Temporal evolution of large language models (LLMs) in oncology

J Transl Med. 2025 Nov 4;23(1):1219. doi: 10.1186/s12967-025-07227-2.

ABSTRACT

BACKGROUND: Large language models (LLMs) are increasingly being applied in healthcare; however, their performance in specialized fields, such as oncology, is subject to temporal factors, including knowledge decay and concept drift. The impact of these temporal dynamics on LLM question-answering accuracy in oncology remains inadequately evaluated. This study aims to systematically assess the temporal evolution of LLM accuracy in responding to oncology-related questions using real-world data.

METHOD: We systematically collected relevant literature through 2025 by searching LLM-related keywords in PubMed, Google Scholar, and Web of Science databases. The inclusion criteria were as follows: (1) cancer-related research; (2) clear and complete question descriptions; and (3) complete answers. The final sample (n = 23) contained 614 research questions, comprising subjective questions (n = 223) and multiple-choice questions (n = 391). Following randomization of responses generated by three LLMs (ChatGPT-3.5, ChatGPT-4, and Gemini), we evaluated their accuracy across different cancer categories using both original scoring criteria and Likert scale scoring methods. Data analysis was performed using R statistical software, employing random or fixed effects models to calculate pooled mean differences (MD) and relative risks (RR) with their 95% confidence intervals (CI).

RESULTS: The findings demonstrated that in both subjective and objective oncology assessments, ChatGPT-3.5 (subjective questions MD = -3.30; objective questions RR = 0.92) and ChatGPT-4 (subjective questions MD = -7.17; objective questions RR = 0.93) showed declining performance trends over time, while Gemini exhibited significant improvements over time (subjective questions MD = 11.48; objective questions RR = 1.15). Notably, ChatGPT-3.5’s performance on subjective questions revealed a significant turning point between March 14, 2023, and April 26, 2023, shifting from initially superior performance on newer questions to inferior performance compared with original questions, with the performance gap progressively widening.

CONCLUSIONS: Our meta-analysis reveals temporal performance degradation in ChatGPT-3.5 and ChatGPT-4, which contrasts with the consistent improvement observed in Gemini. These findings provide essential guidance for the evidence-based deployment of LLMs in oncology.

PMID:41188901 | DOI:10.1186/s12967-025-07227-2

By Nevin Manimala