Generative Artificial Intelligence for Medical Summarization in Prostate Cancer: Comparative Evaluation by Physicians and Patient Advocates-A Pilot Study

JCO Clin Cancer Inform. 2026 Apr;10:e2500316. doi: 10.1200/CCI-25-00316. Epub 2026 Apr 2.

ABSTRACT

PURPOSE: The exponential growth of scientific publications presents increasing challenges for clinicians and patients seeking to access up-to-date medical information. Language models (LMs) have emerged as powerful tools for generating and summarizing scientific content, but their performance in oncology remains insufficiently characterized from both professional and patient perspectives.

MATERIALS AND METHODS: We conducted a prospective, survey-based pilot study evaluating four LMs: Llama 3, Mistral Large 2, Gemma 2B, and Consensus, applied to the summarization and French translation of seven recent prostate cancer (PCa) abstracts. Each model received a standardized prompt to generate the summary of each abstract. Physicians (medical and radiation oncologists, urologists) and patients treated for PCa independently assessed the outputs using structured Likert-scale questionnaires covering qualitative criteria such as accuracy, usefulness, organization, and comprehensibility. Descriptive statistical analyses were then performed to characterize the distribution of responses across evaluation items.

RESULTS: A total of 40 respondents (14 physicians, 26 patients) provided 280 individual evaluations. Across physicians, consensus received the highest proportion of strongly agree ratings for all criteria, including completeness, accuracy, currency, organization, and usefulness. Only two physicians evaluated this model. Among patients, Gemma 2B achieved the highest strongly agree ratings for conciseness and comprehensibility, whereas Consensus obtained the highest score for organization; both models were similarly rated for usefulness. When considering overall positive evaluations (agree or strongly agree), Llama 3 and Mistral Large 2 performed well across groups but generated fewer strongly agree responses. Descriptive analyses demonstrated clear differences in perceived accuracy, completeness, and clarity across model architectures.

CONCLUSION: Perceived quality varied across models and user groups. Consensus was preferred by physicians, whereas patients more often favored Gemma. These differences underscore the importance of selecting models aligned with specific clinical communication tasks when deploying generative artificial intelligence in oncology.

PMID:41926718 | DOI:10.1200/CCI-25-00316

By Nevin Manimala