Assessing the ability of large language models to summarize and generate maxillofacial prosthetic treatment options

J Prosthodont. 2026 Apr 14. doi: 10.1111/jopr.70136. Online ahead of print.

ABSTRACT

PURPOSE: The aim was to evaluate the ability of four large language models (LLMs) (OpenAI’s ChatGPT-3.5, Microsoft 365 Copilot, DeepSeek-R1, and Google Gemini 2.5 Pro) to develop treatment options when presented with clinical cases published in the maxillofacial prosthodontics literature.

MATERIALS AND METHODS: Six maxillofacial case reports were fed to the LLMs following a prompt that requested prosthodontic treatment options from the perspective of a prosthodontist. Expert evaluators scored the relevance, clarity, depth, focus, and coherence of the responses. Statistical analyses, including descriptive statistics, two-way analysis of variance (ANOVA), post hoc Tukey tests, Pearson correlation analyses, and intraclass correlation coefficients (ICCs), were performed (α < 0.05).

RESULTS: There were significant differences among the total mean relevance (p = 0.003), clarity (p = 0.006), depth (p < 0.001), focus (p < 0.001), and coherence (p < 0.001) scores of chatbots. Copilot consistently scored the lowest, and Gemini or DeepSeek scored the highest for all five factors. Depth (p = 0.006), focus (p = 0.024), and coherence (p = 0.013) scores of senior prosthodontists were slightly higher than those of junior prosthodontists. Pearson correlation analysis revealed positive correlations between the total mean scores for all five factors (p < 0.001).

CONCLUSIONS: The study demonstrates the ability of LLMs to develop maxillofacial prosthetic treatment plans tailored to specific clinical scenarios. There were significant differences between the abilities of the LLMs evaluated in this study. Copilot scored the lowest for all factors evaluated, and Gemini and/or DeepSeek scored the highest.

PMID:41978968 | DOI:10.1111/jopr.70136

By Nevin Manimala