Comparative Evaluation of Advanced Reasoning Models for Clinical Decision Support in Urology

Urol Int. 2026 Mar 23:1-13. doi: 10.1159/000551610. Online ahead of print.

ABSTRACT

OBJECTIVE: To compare the performance of five advanced reasoning models on urology-related clinical multiple-choice questions from the MedQA dataset, and to benchmark AI performance against medical students and experienced urologists in terms of accuracy, response efficiency, and agreement patterns.

METHODS: We extracted 434 urology-relevant items and evaluated five models-DeepSeek-R1, ChatGPT O4-mini, Gemini 2.5 Pro, Claude 3.7 Sonnet, and Grok 3-using a standardized prompt. Accuracy was computed against reference answers; API response times and connection failures were recorded. In addition, 20 senior medical students and 20 experienced urologists answered subsets of the same item bank using a balanced block design; group-level majority-vote answers were used as human baselines. Statistical analyses included Cochran’s Q and McNemar tests (AI-only accuracy), a logistic generalized linear mixed-effects model (GLMM) with urologists as the reference (model-adjusted accuracy), Fleiss’ κ and Cohen’s κ (agreement), and Friedman and Wilcoxon signed-rank tests (response time).

RESULTS: Across the AI-only comparison, all models achieved high accuracy (86.9-93.3%), with DeepSeek-R1, ChatGPT O4-mini, and Gemini 2.5 Pro outperforming Claude 3.7 Sonnet and Grok 3. In the model-adjusted analysis, all five AI models showed significantly higher odds of correct answers than experienced urologists (all p < 0.001, Dunnett-adjusted), while medical students did not differ significantly from urologists. ChatGPT O4-mini had the shortest median API response time (5.03 s), whereas group-level median task completion times were 15.87 s for students and 17.57 s for urologists; Grok 3 was slowest among AI models (27.62 s). Connection failure rates were 0% for ChatGPT O4-mini, Gemini 2.5 Pro, and Claude 3.7 Sonnet; 1.6% for DeepSeek-R1; and 2.8% for Grok 3. Agreement across the five AI models and the two human majority-vote baselines was moderate-to-substantial (Fleiss’ κ = 0.685, p < 0.001).

CONCLUSION: Modern reasoning models achieve strong accuracy and efficiency on urology-focused benchmark questions, supporting their potential role as useful clinical assistants when implemented with appropriate human oversight. ChatGPT O4-mini’s rapid latency further underscores its suitability for time-sensitive workflows, while model-adjusted analyses indicate its consistently superior accuracy relative to experienced urologists within this standardized assessment format.

PMID:41871224 | DOI:10.1159/000551610

By Nevin Manimala