Odontology. 2026 Feb 14. doi: 10.1007/s10266-026-01335-1. Online ahead of print.
ABSTRACT
Artificial intelligence (AI) chatbots are increasingly used by dental students for self-directed learning, yet their performance in specialty-level subjects like oral medicine remains underexplored. As oral medicine encompasses diagnostic and clinical reasoning across interdisciplinary domains, assessing AI competence in this field is necessary. This study aimed to evaluate and compare the performance of four advanced AI chatbots-ChatGPT-4, Microsoft Copilot, Google Gemini, and DeepSeek-in answering case-based oral medicine multiple choice questions (MCQs) across Bloom’s cognitive levels and key subtopics. A total of 114 high-quality, case-based MCQs were developed and validated based on authoritative references. Each question was classified according to Bloom’s taxonomy and mapped to one of six oral medicine subdomains. The chatbots’ responses were evaluated for accuracy, response time, and word count. Statistical comparisons were performed using Cochrane Q test, Friedman test, McNemar’s test, and Cohen’s kappa for inter-model agreement. All four chatbots demonstrated high overall accuracy (≥ 97.4%), with Microsoft Copilot showing numerically the highest score (99.1%) although no statistically significant differences were observed among the models. ChatGPT-4 generated the fastest response (mean: 7.0 s), while Copilot provided the most detailed explanations. Performance was consistent across cognitive levels, with near-perfect accuracy in the “Applying” and “Analyzing” domains. Accuracy across subtopics was also high although minor discrepancies were noted in infectious diseases and oral potentially malignant disorders. Inter-chatbot agreement ranged from moderate to perfect (kappa = 0.315-1.00). Advanced AI chatbots, including ChatGPT-4, Copilot, Gemini, and DeepSeek, demonstrated similarly high performance in answering case-based multiple choice questions in oral medicine.
PMID:41691106 | DOI:10.1007/s10266-026-01335-1