Categories
Nevin Manimala Statistics

Evaluating the Performance of Large Language Models for Breast Cancer Patient Education: A Comparative Study

J Cancer Educ. 2026 Jun 2. doi: 10.1007/s13187-026-02918-w. Online ahead of print.

ABSTRACT

Breast Cancer necessitates effective patient education. Large language models (LLMs) facilitate patient health consultation, yet their generated medical content may contain misleading and unsafe information. Systematic evaluations of mainstream LLMs for breast cancer health guidance are currently lacking. This study evaluated six LLMs’ (ChatGPT-5.4-thinking, Claude-4.6-sonnet, Gemini-3.1-Pro, DeepSeek-V3.2, Doubao-2.2-thinking, and ERNIE 4.5 Turbo) performance in breast cancer consultation via a structured checklist. A set of 61 standardized questions regarding breast cancer was developed based on Google Trends, clinical guidelines, practical experiences, and expert reviews. Responses from each LLM were independently evaluated by three breast cancer experts focusing on quality, accuracy, comprehensiveness, and safety. Besides, four patients independently evaluated the satisfaction and understandability of their selected three questions of interest. This study utilized Bernard’s Global Quality Score (GQS) tool to assess quality. Readability was assessed using the Chinese Resource Platform (CRP). Other indicators were evaluated using self-designed questionnaires. Statistical analyses were performed using RStudio. In expert evaluations, ERNIE 4.5 Turbo had the highest descriptive quality score and was among the top-performing models in safety (Bonferroni-adjusted P < 0.05), while several models performed comparably in comprehensiveness. There was no significant difference in accuracy among the models. ChatGPT-5.4-thinking scored significantly lower in safety, and Doubao-2.2-thinking had significantly lower reading difficulty, required age, and Chinese character count (adjusted P < 0.05). In patient evaluations, ERNIE 4.5 Turbo showed the highest descriptive satisfaction and understandability ratings. Six large language models performed strongly in breast cancer question-answering, with ERNIE 4.5 Turbo ranking highest. However, issues like poor readability and unsafe recommendations remain in answers. Future research should prioritize enhancing patient readability to facilitate AI’s application in precision cancer health education.

PMID:42228312 | DOI:10.1007/s13187-026-02918-w

By Nevin Manimala

Portfolio Website for Nevin Manimala