Evaluating Large Language Models in Ophthalmology: Systematic Review

J Med Internet Res. 2025 Oct 27;27:e76947. doi: 10.2196/76947.

ABSTRACT

BACKGROUND: Large language models (LLMs) have the potential to revolutionize ophthalmic care, but their evaluation practice remains fragmented. A systematic assessment is crucial to identify gaps and guide future evaluation practices and clinical integration.

OBJECTIVE: This study aims to map the current landscape of LLM evaluations in ophthalmology and explore whether performance synthesis is feasible for a common task.

METHODS: A comprehensive search of PubMed, Web of Science, Embase, and IEEE Xplore was conducted up to November 17, 2024 (no language limits). Eligible publications quantitatively assessed an existing or modified LLM on ophthalmology-related tasks. Studies without full-text availability or those focusing solely on vision-only models were excluded. Two reviewers screened studies and extracted data across 6 dimensions (evaluated LLM, data modality, ophthalmic subspecialty, medical task, evaluation dimension, and clinical alignment), and disagreements were resolved by a third reviewer. Descriptive statistics were analyzed and visualized using Python (with NumPy, Pandas, SciPy, and Matplotlib libraries). The Fisher exact test compared open- versus closed-source models. An exploratory random-effects meta-analysis (logit transformation; DerSimonian-Laird τ²) was performed for the diagnosis-making task; heterogeneity was reported with I² and subgrouped by model, modality, and subspecialty.

RESULTS: Of the 817 identified records, 187 studies met the inclusion criteria. Closed-source LLMs dominated: 170 for ChatGPT, 58 for Gemini, and 32 for Copilot. Open-source LLMs appeared in only 25 (13.4%) of studies overall, but they appeared in 17 (77.3%) of evaluation-after-development studies, versus 8 (4.8%) pure-evaluation studies (P<1×10^-5). Evaluations were chiefly text-only (n=168); image-text tasks, despite the centrality of imaging, were used in 19 studies. Subspecialty coverage was skewed toward comprehensive ophthalmology (n=72), retina and vitreous (n=32), and glaucoma (n=20). Refractive surgery, ocular pathology and oncology, and ophthalmic pharmacology each appeared in 3 or fewer studies. Medical query (n=86), standardized examination (n=41), and diagnosis making (n=29) emerged as the 3 predominant tasks, while research assistance (n=5), patient triaging (n=3), and disease prediction (n=3) received less attention. Accuracy was reported in most studies (n=176), whereas calibration and uncertainty were almost absent (n=5). Real-world patient data (n=45), human performance comparison (n=63), non‑English testing (n=24), and real-world deployment (n=4) were relatively absent. Exploratory meta-analysis pooled 28 diagnostic evaluations from 17 studies: overall accuracy was 0.594 (95% CI 0.488-0.692) with extreme heterogeneity (I²=94.5%). Subgroups remained heterogeneous (I²>80%), and findings were inconsistent (eg, pooled GPT-3.5>GPT-4).

CONCLUSIONS: Evidence on LLM evaluations in ophthalmology is extensive but heterogeneous. Most studies have tested a few closed-source LLMs on text-based questions, leaving open-source systems, multimodal tasks, non-English contexts, and real-world deployment underexamined. High methodological variability precludes meaningful performance aggregation, as illustrated by the heterogeneous meta-analysis. Standardized, multimodal benchmarks and phased clinical validation pipelines are urgently needed before LLMs can be safely integrated into eye care workflows.

PMID:41144954 | DOI:10.2196/76947

By Nevin Manimala