Evaluating the clinical utility of multimodal large language models for detecting age-related macular degeneration from retinal imaging

Sci Rep. 2025 Sep 26;15(1):33214. doi: 10.1038/s41598-025-18306-1.

ABSTRACT

This single-center retrospective study evaluated the performance of four multimodal large language models (MLLMs) (ChatGPT-4o, Claude 3.5 Sonnet, Google Gemini 1.5 Pro, Perplexity Sonar Large) in detecting and grading the severity of age-related macular degeneration (AMD) from ultrawide field fundus images. Images from 76 patients (136 eyes; mean age 81.1 years; 69.7% female) seen at the University of California San Diego were graded independently for AMD severity by two junior retinal specialists (and an adjudicating senior retina specialist for disagreements) using the Age-Related Eye Disease Study (AREDS) classification. The cohort included 17 (12.5%) eyes with ‘No AMD’, 18 (13.2%) with ‘Early AMD’, 50 (36.8%) with ‘Intermediate AMD’, and 51 (37.5%) with ‘Advanced AMD’. Between December 2024 and February 2025, each MLLM was prompted with single images and standardized queries to assess the primary outcomes of accuracy, sensitivity, and specificity in binary disease classification, disease severity grading, open-ended diagnosis, and multiple-choice diagnosis (with distractor diseases). Secondary outcomes included precision, F1 scores, Cohen’s kappa, model performance comparisons, and error analysis. ChatGPT-4o demonstrated the highest accuracy for binary disease classification [mean 0.824 (95% confidence interval (CI)): 0.743, 0.875)], followed by Perplexity Sonar Large [mean 0.815 (95% CI: 0.744, 0.879)], both of which were significantly more accurate (P < 0.00033) Than Gemini 1.5 Pro [mean 0.669 (95% CI: 0.581, 0.743)] and Claude 3.5 Sonnet [mean 0.301 (95% CI: 0.221, 0.375)]. For severity grading, Perplexity Sonar Large was most accurate [mean 0.463 (95% CI: 0.368, 0.537)], though differences among models were not statistically significant. ChatGPT-4o led in open-ended and multiple-choice diagnostic tasks. In summary, while MLLMs show promise for automated AMD detection and grading from fundus images, their current reliability is insufficient for clinical application, highlighting the need for further model development and validation.

PMID:41006661 | DOI:10.1038/s41598-025-18306-1

By Nevin Manimala