Comparison of ChatGPT and DeepSeek on a Standardized Audiologist Qualification Examination in Chinese: Observational Study

JMIR Form Res. 2025 Nov 28;9:e79534. doi: 10.2196/79534.

ABSTRACT

BACKGROUND: Generative artificial intelligence (GenAI), exemplified by ChatGPT and DeepSeek, is rapidly advancing and reshaping human-computer interaction with its growing reasoning capabilities and broad applications across fields such as medicine and education.

OBJECTIVE: This study aimed to evaluate the performance of 2 GenAI models (ie, GPT-4-turbo and DeepSeek-R1) on a standardized audiologist qualification examination in Chinese and to explore their potential applicability in audiology education and clinical training.

METHODS: The 2024 Taiwan Audiologist Qualification Examination, comprising 300 multiple-choice questions across 6 subject areas (ie, basic hearing science, behavioral audiology, electrophysiological audiology, principles and practice of hearing devices, health and rehabilitation of the auditory and balance systems, and hearing and speech communication disorders [including professional ethics]), was used to assess the performance of the 2 GenAI models. The complete answering process and reasoning paths of the models were recorded, and performance was analyzed by overall accuracy, subject-specific scores, and question-type scores. Statistical comparisons were performed at the item level using the McNemar test.

RESULTS: ChatGPT and DeepSeek achieved overall accuracies of 80.3% (241/300) and 79.3% (238/300), respectively, which are higher than the passing criterion of the Taiwan Audiologist Qualification Examination (ie, 60% correct answers). The accuracies for the 6 subject areas were 88% (44/50), 70% (35/50), 86% (43/50), 76% (38/50), 82% (41/50), and 80% (40/50) for ChatGPT and 82% (41/50), 72% (36/50), 78% (39/50), 80% (40/50), 80% (40/50), and 84% (41/50) for DeepSeek. No significant differences were found between the two models at the item level (overall P=.79), with a small effect size (accuracy difference=+1%, Cohen h=0.02, odds ratio 0.90, 95% CI 0.53-1.52) and substantial agreement (κ=0.71). ChatGPT scored highest in basic hearing science (88%), whereas DeepSeek performed the best in hearing and speech communication disorders (84%). Both models scored lowest in behavioral audiology (ChatGPT: 70% and DeepSeek: 72%). Question-type analysis revealed that both models performed well on reverse logic questions (ChatGPT: 79/95, 83%; DeepSeek: 80/95, 84%) but performed moderately on complex multiple-choice questions (ChatGPT: 9/17, 53%; DeepSeek: 11/17, 65%). However, both models performed poorly on graph-based questions (ChatGPT: 2/11, 18%; DeepSeek: 4/11, 36%).

CONCLUSIONS: Both GenAI models demonstrated strong professional knowledge and stable reasoning ability, meeting the basic requirements of clinical audiologists and suggesting their potential as supportive tools in audiology education. However, the presence of errors underscores the need for cautious use under educator supervision. Future research should explore their performance in open-ended, real-world clinical scenarios to assess practical applicability and limitations.

PMID:41313805 | DOI:10.2196/79534

By Nevin Manimala