Comparative evaluation of treatment recommendations generated by generative AI and breast cancer specialists for advanced and recurrent breast cancer: a multidimensional assessment of evidence interpretation and clinical decision support

Breast Cancer. 2026 May 20. doi: 10.1007/s12282-026-01858-z. Online ahead of print.

ABSTRACT

BACKGROUND: Large language models (LLMs), a form of generative artificial intelligence (AI), are increasingly explored for clinical applications due to their ability to synthesize medical information. In breast cancer care, where therapeutic decision-making is complex because of expanding treatment options, AI-based decision support tools may improve efficiency and consistency. However, their clinical validity and safety remain insufficiently evaluated in real-world settings. The present study aimed to compare the characteristics of treatment recommendations generated by a large language model with those proposed by breast cancer specialists using identical clinical information.

METHODS: This retrospective observational study included 100 patients with advanced or recurrent breast cancer. Clinical information was provided to ChatGPT using a standardized prompt to generate treatment recommendations. Six board-certified breast cancer specialists independently proposed treatment strategies for the same cases. Recommendations were evaluated using five predefined criteria: 1. Comprehensiveness and appropriateness of literature selection, 2. Accuracy of interpretation of key evidence, 3. Clinical appropriateness and diversity of treatment options, 4. Accuracy of adverse event information, 5. Time required to generate recommendations. Group comparisons were performed using linear mixed-effects models with case and evaluator as random effects.

RESULTS: ChatGPT achieved higher scores than specialists across all five evaluation criteria, with the largest difference observed for accuracy of key evidence interpretation. The mean total score was 93.7 for ChatGPT and 44.1 for specialists, and linear mixed-effects model analysis confirmed a significantly higher total score for ChatGPT (p < 0.001). These findings suggest that AI-generated responses tended to provide more comprehensive and structured summaries of available evidence within the predefined evaluation framework.

CONCLUSIONS: LLMs were able to generate comprehensive and structured treatment recommendations based on the provided clinical information. However, these findings primarily reflect differences in information synthesis under predefined evaluation criteria rather than superiority in real-world clinical decision-making. Given the potential risk of hallucinations, AI should be positioned as an assistive tool to support specialist-led decision-making in breast cancer care.

PMID:42159946 | DOI:10.1007/s12282-026-01858-z

By Nevin Manimala