JMIR Form Res. 2025 Dec 19;9:e75607. doi: 10.2196/75607.
ABSTRACT
BACKGROUND: Multiple-choice questions (MCQs) are essential in medical education for assessing knowledge and clinical reasoning. Traditional MCQ development involves expert reviews and revisions, which can be time-consuming and subject to bias. Large language models (LLMs) have emerged as potential tools for evaluating MCQ accuracy and efficiency. However, direct comparisons of these models in orthopedic MCQ assessments are limited.
OBJECTIVE: This study compared the performance of ChatGPT and DeepSeek in terms of correctness, response time, and reliability when answering MCQs from an orthopedic examination for medical students.
METHODS: This cross-sectional study included 209 orthopedic MCQs from summative assessments during the 2023-2024 academic year. ChatGPT (including the “Reason” function) and DeepSeek (including the “DeepThink” function) were used to identify the correct answers. Correctness and response times were recorded and compared using a χ2 test and Mann-Whitney U test where appropriate. The two LLMs’ reliability was assessed using the Cohen κ coefficient. The MCQs incorrectly answered by both models were reviewed by orthopedic faculty to identify ambiguities or content issues.
RESULTS: ChatGPT achieved a correctness rate of 80.38% (168/209), while DeepSeek achieved 74.2% (155/209; P=.04). ChatGPT’s Reason function also outperformed DeepSeek’s DeepThink function (177/209, 84.7% vs 168/209, 80.4%; P=.12). The average response time for ChatGPT was 10.40 (SD 13.29) seconds, significantly shorter than DeepSeek’s 34.42 (SD 25.48) seconds (P<.001). Regarding reliability, ChatGPT demonstrated an almost perfect agreement (κ=0.81), whereas DeepSeek showed substantial agreement (κ=0.78). A completely false response was recorded in 7.7% (16/209) of responses for both models.
CONCLUSIONS: ChatGPT outperformed DeepSeek in correctness and response time, demonstrating its efficiency in evaluating orthopedic MCQs. This high reliability suggests its potential for integration into medical assessments. However, our results indicate that some MCQs will require revisions by instructors to improve their clarity. Further studies are needed to evaluate the role of artificial intelligence in other disciplines and to validate other LLMs.
PMID:41418321 | DOI:10.2196/75607