AI-enabled clinical decision support in breast cancer care: a blinded multicenter benchmarking study comparing medically specialized with a general-purpose system

J Med Syst. 2026 Jul 4;50(1):107. doi: 10.1007/s10916-026-02434-w.

ABSTRACT

Medically specialized AI systems that have obtained regulatory clearance as medical devices can be deployed for patient-specific clinical decision support under defined compliance requirements. However, it remains unclear whether medical specialization and regulatory status translate into higher-quality breast cancer treatment recommendations than those produced by a general-purpose large language model (LLM). This blinded, multicenter study compared the performance of two medically specialized AI systems with a general-purpose model in breast cancer care. Two medically specialized (Prof. Valmed and OpenEvidence) and one general-purpose system (ChatGPT-5 Thinking) were prompted to generate treatment plans for 20 standardized breast cancer patient cases. Outputs were rated and ranked by blinded, board-certified breast cancer specialists from seven university breast cancer centers for safety, guideline adherence, medical adequacy, completeness, overall quality, and logical coherence. Statistical analyses comprised descriptive statistics, inter-rater reliability assessment, non-parametric performance comparisons of rating and ranking outcomes, and correlation analyses. Mean (± standard deviation) processing time for ChatGPT-5 Thinking (159 ± 58 s) was more than fourfold higher than that of Prof. Valmed (35 ± 4) and OpenEvidence (9 ± 1). ChatGPT-5 Thinking achieved significantly higher ratings across all evaluation categories, with no significant differences between the two medically specialized systems. Treatment plans generated by ChatGPT-5 Thinking were ranked as the top choice in 96.4% of rater-case combinations, compared with 3.6% for OpenEvidence, while Prof. Valmed was never ranked first. In this blinded, multicenter evaluation, a general-purpose LLM outperformed two medically specialized, retrieval-augmented systems in generating breast cancer treatment plans across all assessed categories. These results indicate that while regulatory clearance and domain specialization address key requirements for AI-enabled clinical decision support systems, these factors alone do not translate into superior performance in breast cancer care. At present, medically specialized systems may be best used as supportive tools under expert oversight, while further optimization and real-world validation are needed.Clinical trial number: Not applicable.

PMID:42400697 | DOI:10.1007/s10916-026-02434-w

By Nevin Manimala