Categories
Nevin Manimala Statistics

Diagnostic accuracy of artificial intelligence models in childhood exanthematous diseases: a comparative analysis against clinical diagnosis

Eur J Pediatr. 2025 Dec 22;185(1):33. doi: 10.1007/s00431-025-06693-6.

ABSTRACT

PURPOSE: Differentiating among exanthematous diseases is frequently challenging due to their overlapping symptomatology. We, therefore, aimed to evaluate the diagnostic accuracy of a consultant physician, a resident physician, and various AI models (ChatGPT-5, Gemini, Copilot) in this context.

METHODS: We prospectively enrolled 291 patients treated for exanthematous diseases at our clinic between January 2024 and July 2025. The AI models were first tasked with making a diagnosis based solely on cutaneous images and subsequently with both images and accompanying clinical findings. The diagnoses rendered by the consultant, the resident, and the AI models were then compared against the definitive diagnosis.

RESULTS: When benchmarked against the definitive diagnosis, the consultant achieved the highest diagnostic accuracy (96.6%), followed by ChatGPT (with clinical data, 86.9%), Copilot (with clinical data, 81.4%), Gemini (with clinical data, 78.7%), and the resident physician (72.5%). In contrast, models without clinical data performed poorly, with the lowest accuracy recorded at 30.6% by Copilot. In ROC analysis against the consultant, the resident (AUC: .875) and AI models with clinical data-ChatGPT (AUC: .898), Gemini (AUC: .856), and Copilot (AUC: .818)-all demonstrated good diagnostic power (p < .001). The ChatGPT model without clinical data showed moderate diagnostic power, whereas the Copilot and Gemini models without data were not statistically significant. Performance metrics (sensitivity, specificity) were: ChatGPT (with data) (89.7%, 90.0%); Copilot (with data) (83.6%, 80.0%); Gemini (with data) (81.1%, 90.0%); the resident (75.1%, 100.0%); ChatGPT (no data) (51.6%, 90.0%); Gemini (no data) (33.5%, 100.0%); and Copilot (no data) (31.7%, 100.0%). The consultant’s diagnostic performance was significantly superior to all other interpreters and models (p < .001 for all comparisons).

CONCLUSION: This study establishes the diagnostic utility of AI models in pediatric exanthematous diseases, with ChatGPT-5 demonstrating the greatest accuracy when augmented with clinical data. The findings position these models as powerful assistive tools for clinicians but affirm that they do not yet supplant the indispensable expertise of a consultant physician, who remains the gold standard for diagnosis.

WHAT IS KNOWN: • Overlapping clinical features of exanthematous diseases often lead to diagnostic uncertainty. • Rash-focused artificial intelligence models frequently perform better when supplemented with clinical context rather than image data alone.

WHAT IS NEW: • This study provides the first large-scale, multimodal comparison of three next-generation artificial intelligence models (ChatGPT-5, Gemini, Copilot) specifically in pediatric exanthematous diseases. • The study uniquely demonstrates the diagnostic performance gap between image-only and image-plus-clinical-data modes across multiple artificial intelligence models, quantifying the exact improvement provided by clinical context. • By benchmarking artificial intelligence performance simultaneously against both a consultant and a resident physician, this work introduces a novel dual-reference standard, offering more nuanced insight into real-world clinical use cases.

PMID:41428260 | DOI:10.1007/s00431-025-06693-6

By Nevin Manimala

Portfolio Website for Nevin Manimala