Abdom Radiol (NY). 2026 Apr 7. doi: 10.1007/s00261-026-05492-3. Online ahead of print.
ABSTRACT
OBJECTIVES: This study conducted a natural language processing feasibility analysis aimed at comparing four large language models (LLMs) in terms of (a) reproducibility and (b) predictive accuracy for International Society of Urological Pathology Grade Groups (ISUP GGs) based on structured text reports from prostate multiparametric magnetic resonance imaging (mpMRI).
METHODS: The study first used LLMs to perform the initial round of ISUP GGs predictions based solely on the mpMRI text reports. This was followed by a second round of predictions that incorporated clinical information. Each prediction round was repeated three times to assess consistency. Three radiologists independently completed the first two rounds of ISUP GG predictions and then performed a third round of assessment after reviewing the LLMs’ predictions. The study recorded the response times.
RESULTS: The study included 150 patients (median age, 69 years). Statistically significant differences were observed among different ISUP GGs in terms of age, PSA levels, prostate volume, PSA density, and PI-RADS scores. The four LLMs demonstrated good to excellent reproducibility (Kappa 0.671-0.861). ChatGPT-4.1 had the shortest response time (0.95-17.19 s). Furthermore, the study found that the accuracy of the LLMs (32.7-50.0%) was significantly lower than that of senior radiologist (72.7-76.0%) and intermediate-level radiologist (66.0-68.7%), but was comparable to that of junior radiologist (59.3-65.3%).
CONCLUSION: General-purpose LLMs demonstrate excellent reproducibility. While ChatGPT-4.1 outperforms other LLMs in ISUP GGs prediction and response time, its predictive accuracy remains inferior to that of intermediate and senior radiologists. Therefore, specific fine-tuning of this technology is necessary before general-purpose LLMs can be applied in clinical practice.
PMID:41945149 | DOI:10.1007/s00261-026-05492-3