Eur Radiol. 2025 Aug 22. doi: 10.1007/s00330-025-11924-3. Online ahead of print.
ABSTRACT
OBJECTIVE: Aimed to evaluate the potential of large language models (LLMs) in differentiating intra-axial primary brain tumors using structured magnetic resonance imaging (MRI) reports and compare their performance with radiologists.
MATERIALS AND METHODS: Structured reports of preoperative MRI findings from 137 surgically confirmed intra-axial primary brain tumors, including Glioblastoma (n = 77), Central Nervous System (CNS) Lymphoma (n = 22), Astrocytoma (n = 9), Oligodendroglioma (n = 9), and others (n = 20), were analyzed by multiple LLMs, including GPT-4, Claude-3-Opus, Claude-3-Sonnet, GPT-3.5, Llama-2-70B, Qwen1.5-72B, and Gemini-Pro-1.0. The models provided the top 5 differential diagnoses based on the preoperative MRI findings, and their top 1, 3, and 5 accuracies were compared with board-certified neuroradiologists’ interpretations of the actual preoperative MRI images.
RESULTS: Radiologists achieved top 1, 3, and 5 accuracies of 85.4%, 94.9%, and 94.9%, respectively. Among the LLMs, GPT-4 performed best with top 1, 3, and 5 accuracies of 65.7%, 84.7%, and 90.5%, respectively. Notably, GPT-4’s top 3 accuracy of 84.7% approached the radiologists’ top 1 accuracy of 85.4%. Other LLMs showed varying performance levels, with average accuracies ranging from 62.3% to 75.9%. LLMs demonstrated high accuracy for Glioblastoma but struggled with CNS Lymphoma and other less common tumors, particularly in top 1 accuracy.
CONCLUSION: LLMs show promise as assistive tools for differentiating intra-axial primary brain tumors using structured MRI reports. However, a significant gap remains between their performance and that of board-certified neuroradiologists interpreting actual images. The choice of LLM and tumor type significantly influences the results.
KEY POINTS: Question How do Large Language Models (LLM) perform when differentiating complex intra-axial primary brain tumors from structured MRI reports compared to radiologists interpreting images? Findings Radiologists outperformed all tested LLMs in diagnostic accuracy. The best model, GPT-4, showed promise but lagged considerably behind radiologists, particularly for less common tumors. Clinical relevance LLMs show potential as assistive tools for generating differential diagnoses from structured MRI reports, particularly for non-specialists, but they cannot currently replace the nuanced diagnostic expertise of a board-certified radiologist interpreting the primary image data.
PMID:40847080 | DOI:10.1007/s00330-025-11924-3