JMIR Form Res. 2026 Jun 22. doi: 10.2196/86999. Online ahead of print.
ABSTRACT
BACKGROUND: Large language models (LLMs) have demonstrated expert-level performance on medical licensing examinations, but most benchmarks focus on final accuracy, obscuring model-specific behaviors. Critical gaps remain in understanding model efficiency (latency), the efficacy of tiered “rescue” protocols for error correction, and the systematic correlation between performance and human-rated question difficulty. The German M2 exam, paired with the AMBOSS platform’s user-data-driven difficulty ratings, provides a unique opportunity to map AI performance directly against human cognitive load.
OBJECTIVE: This study aimed to move beyond singular accuracy scores by (1) evaluating and comparing the baseline (Tier 1) accuracy and response latency of next-generation rapid-response LLMs; (2) analyzing the efficacy of a two-tiered rescue (Tier 2) protocol in correcting initial errors; and (3) correlating model performance with the user-data-driven Amboss difficulty rating.
METHODS: We evaluated four LLMs (Gemini 2.5 Flash/Pro and ChatGPT 5 Instant/Thinking) on the complete 316-item German M2 (Fall 2024) medical exam, including all multimodal (image-based) questions. A zero-shot copy-paste prompting strategy was utilized, and outputs were evaluated against ground-truth answers using a strict exact-match criterion. A two-tiered protocol was used: Tier 1 (Flash/Instant) provided baseline responses. If incorrect, a Tier 2 (Pro/Thinking) model was deployed as a “rescue.” Performance was analyzed using McNemar’s test, Wilcoxon signed-rank test, Fisher’s exact test, and logistic regression.
RESULTS: Baseline (Tier 1) accuracy was identical at 91.46% (95% CI 87.85-94.06; n = 289/316) for both Gemini 2.5 Flash and ChatGPT 5 Instant, with 27 errors each. However, Gemini Flash (Mean=1.57s) was significantly faster than ChatGPT Instant (Mean = 2.07s; P < .001). Additionally, ChatGPT Instant expended significantly more time on incorrect answers compared to correct ones (P = .002), whereas Gemini Flash showed no such hesitation (P = .814). The Tier 2 rescue rate for ChatGPT 5 Thinking (48.15%, 13/27; 95% CI 30.74-66.01) was higher, though not statistically significant (P = .406), than for Gemini 2.5 Pro (33.33%, 9/27; 95% CI 18.64-52.18). This rescue protocol elevated final accuracy to 94.30% (95% CI 91.18-96.37) for the Gemini system and 95.57% (95% CI 92.70-97.34) for the ChatGPT system (P = .481). A strong, inverse relationship with difficulty was found: for every one-point difficulty increase, the odds of a correct Tier 1 response decreased by 42.1% (OR 0.579, 95% CI 0.425-0.788; P < .001) for Gemini Flash and 47.7% (OR 0.523, 95% CI 0.379-0.720; P < .001) for ChatGPT Instant. This negative correlation persisted even after the rescue (P = .013 and P = .006, respectively).
CONCLUSIONS: Expert-level LLM performance on the German M2 exam masks a critical, systematic vulnerability: a significant decrease in accuracy directly correlated with increased question difficulty. A two-tiered “rescue” system is an effective strategy to mitigate these difficulty-based failures and achieve >95% accuracy, rivaling the best-performing, full-capacity models. We conclude that a simple reliance on a single model is insufficient; hierarchical systems that manage query difficulty are essential for safe and effective integration into medical education.
PMID:42334858 | DOI:10.2196/86999