Performance of large language models on the radiation and cancer biology practice exam

Front Oncol. 2026 May 19;16:1738955. doi: 10.3389/fonc.2026.1738955. eCollection 2026.

ABSTRACT

BACKGROUND/OBJECTIVES: Large Language Models (LLMs) are increasingly used in medicine for tasks ranging from patient communication to exam preparation. This study aimed to evaluate the feasibility of using a domain-specific, out-of-training-data radiation and cancer biology examination as a benchmarking framework for large language models, and to compare the accuracy and consistency of commonly used LLMs available at the time of data collection.

METHODS: GPT-3.5, GPT-4, and Llama-2 were queried with 335 multiple-choice questions (MCQs) from the 2023 American Society for Radiation Oncology (ASTRO) Radiation and Cancer Biology Exam Study Guide, excluding image-based items. Each model answered all questions five times over three months to evaluate consistency. Model responses were scored against the official answer key and analyzed using one-way ANOVA with Bonferroni correction to determine statistical differences in accuracy.

RESULTS: GPT-4 achieved the highest accuracy, correctly answering 81% of questions, significantly outperforming GPT-3.5 (62%) and Llama-2 (51%) (p < 0.001). All models performed worse on questions requiring calculations, though differences were not statistically significant. In terms of reliability, GPT-4 and Llama-2 provided consistent responses more frequently than GPT-3.5. Despite stable overall scores, all models exhibited variability in individual responses across repeated trials. GPT-4 produced the longest explanations, averaging 183 words per answer.

CONCLUSIONS: This study demonstrates the feasibility of using a domain-specific, out-of-training-data examination to benchmark large language model knowledge in radiation and cancer biology. While performance differences were observed among models, variability and limitations, particularly in calculation-based questions, highlight the importance of methodological benchmarking and cautious interpretation when considering medical educational applications.

PMID:42239883 | PMC:PMC13225992 | DOI:10.3389/fonc.2026.1738955

By Nevin Manimala