Aust Endod J. 2026 Jul 5. doi: 10.1111/aej.70108. Online ahead of print.
ABSTRACT
This study aimed to compare the ability of three artificial intelligence-based large language models, ChatGPT-4, Copilot, and Gemini, to generate multiple-choice questions. Two position statements from the European Society of Endodontology were used as source documents. Each model produced forty questions using an identical prompt, and a total of 120 questions were assessed for distractor quality, ability to distinguish different performance levels, reliability, and content validity. Weighted Kappa, Kruskal-Wallis, and Mann-Whitney U post hoc tests were used for analysis. The inter-rater agreement ranged between 0.870 and 1.000. ChatGPT-4 produced the highest overall scores, and Gemini consistently received the lowest ratings. Overall scores differed significantly between Copilot and Gemini, and ChatGPT-4 and Gemini (p < 0.05), but all produced poorly constructed distractor options. The findings indicate that artificial intelligence-based tools can support the generation of assessment materials in endodontics; however, expert oversight remains essential to ensure accuracy, quality, and educational relevance.
PMID:42402001 | DOI:10.1111/aej.70108