Evaluating large language models using national endodontic specialty examination questions: are they ready for real-world dentistry?

BMC Med Educ. 2025 Oct 2;25(1):1308. doi: 10.1186/s12909-025-07896-z.

ABSTRACT

BACKGROUND: Large Language Models (LLMs) are artificial intelligence (AI) systems that simulate human language processing through deep learning techniques and neural networks. They are increasingly utilized for clinical decision support, student training, and enhancing educational processes. However, the reliability of AI models, especially in answering various types of questions, remains a point of debate. Standard multiple-choice questions (MCQs) involve selecting one correct answer from five options, whereas combination-type MCQs (C-MCQs) identify all correct statements among several alternatives. This study aims to evaluate and compare the performance of various LLMs in answering MCQs and C-MCQs in endodontics.

METHODS: A total of 151 endodontic questions were identified through a comprehensive review of publicly available Dentistry Specialty Exams in Turkey conducted since 2012. The questions were presented to eight LLMs (ChatGPT-4o, ChatGPT-4, Gemini 1.5 Flash, Gemini 1.5 Pro, Gemini 2.0 Flash, Copilot, Deepseek-V3, and Qwen2.5-Max) in Turkish. Accuracy rates for both MCQs and C-MCQs were statistically analyzed using SPSS v23 (p < 0.05).

RESULTS: ChatGPT-4o achieved the highest overall accuracy rate (81.5%), while Gemini 1.5 Flash had the lowest (57%). In standard MCQs, ChatGPT-4o significantly outperformed the other models (p < 0.001), but in C-MCQs, no significant difference was observed between the models (p = 0.179). Across all models, accuracy rates for C-MCQs were significantly lower than for MCQs (p < 0.05). Deepseek-V3 maintained a more balanced performance across question types than the other models.

CONCLUSIONS: LLMs show promising potential as educational tools in endodontics. However, their accuracy varies by question type and model. They can support student learning and clinical decision-making but cannot yet be considered a fully reliable standalone source in endodontics.

PMID:41039560 | DOI:10.1186/s12909-025-07896-z

By Nevin Manimala