Performance of large language models in medical licensing examinations: a systematic review and meta-analysis

J Educ Eval Health Prof. 2025;22:36. doi: 10.3352/jeehp.2025.22.36. Epub 2025 Nov 18.

ABSTRACT

PURPOSE: This study systematically evaluates and compares the performance of large language models (LLMs) in answering medical licensing examination questions. By conducting subgroup analyses based on language, question format, and model type, this meta-analysis aims to provide a comprehensive overview of LLM capabilities in medical education and clinical decision-making.

METHODS: This systematic review, registered in PROSPERO and following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, searched MEDLINE (PubMed), Scopus, and Web of Science for relevant articles published up to February 1, 2025. The search strategy included Medical Subject Headings (MeSH) terms and keywords related to (“ChatGPT” OR “GPT” OR “LLM variants”) AND (“medical licensing exam*” OR “medical exam*” OR “medical education” OR “radiology exam*”). Eligible studies evaluated LLM accuracy on medical licensing examination questions. Pooled accuracy was estimated using a random-effects model, with subgroup analyses by LLM type, language, and question format. Publication bias was assessed using Egger’s regression test.

RESULTS: This systematic review identified 2,404 studies. After removing duplicates and excluding irrelevant articles through title and abstract screening, 36 studies were included after full-text review. The pooled accuracy was 72% (95% confidence interval, 70.0% to 75.0%) with high heterogeneity (I2=99%, P<0.001). Among LLMs, GPT-4 achieved the highest accuracy (81%), followed by Bing (79%), Claude (74%), Gemini/Bard (70%), and GPT-3.5 (60%) (P=0.001). Performance differences across languages (range, 62% in Polish to 77% in German) were not statistically significant (P=0.170).

CONCLUSION: LLMs, particularly GPT-4, can match or exceed medical students’ examination performance and may serve as supportive educational tools. However, due to variability and the risk of errors, they should be used cautiously as complements rather than replacements for traditional learning methods.

PMID:41248547 | DOI:10.3352/jeehp.2025.22.36

By Nevin Manimala