Performance of large language models in mitral valve surgery patient education: a comparative analysis

BMC Med Inform Decis Mak. 2026 Jun 29. doi: 10.1186/s12911-026-03662-3. Online ahead of print.

ABSTRACT

BACKGROUND: Large language models (LLMs), a form of artificial intelligence, are increasingly being utilized in healthcare to support patient education and information delivery. The aim of this study was to perform a comparative analysis of five different LLMs (i.e., ChatGPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, DeepSeek-V3, and Microsoft Copilot) in terms of accuracy, completeness, and readability, based on their responses to frequently asked questions in preoperative patient education for mitral valve surgery (MVS).

METHODS: A standardized questionnaire comprising seven frequently asked questions by patients prior to MVS was developed. Prompting procedures and model parameters were fully reported to support reproducibility. These questions were presented to each LLM in an identical manner. The responses were evaluated by two academic experts in cardiac surgery using structured assessment criteria across three main dimensions: accuracy, completeness, and readability. For the readability analysis, the Simplified Measure of Gobbledygook (SMOG) Index and the Flesch-Kincaid Grade Level (FKGL) scale were utilized.

RESULTS: The ChatGPT-4o and Gemini 2.5 Pro Preview models received statistically significantly higher scores than Claude 3.7 Sonnet and Microsoft Copilot for both accuracy (median 5 for ChatGPT-4o and Gemini 2.5 Pro Preview vs. 4 for Claude 3.7 Sonnet and Microsoft Copilot, p < 0.001) and completeness (median 5 for Gemini 2.5 Pro Preview vs. 3 for Claude 3.7 Sonnet, p < 0.001). Claude 3.7 Sonnet achieved the highest readability scores, with significantly lower SMOG (10.90 for Claude 3.7 Sonnet vs. 12.24 for ChatGPT-4o, p = 0.006) and FKGL (8.0 for Claude 3.7 Sonnet vs. 9.04 for ChatGPT-4o, p = 0.004) scores, indicating simpler and more comprehensible sentence structures. Significant differences were observed among the evaluated models across all three assessment dimensions (p < 0.001 for all comparisons).

CONCLUSIONS: The LLMs represent valuable supplementary tools in patient education processes. However, their implementation in clinical practice must be carefully evaluated, particularly with regard to accuracy and completeness. This study highlights the potential applicability of ChatGPT-4o and Claude 3.7 Sonnet models for preoperative patient education in MVS, while emphasizing that all LLMs should be used under the supervision and guidance of healthcare professionals. For LLMs to be reliably utilized in the medical field, improvement in medical accuracy and standardization are essential.

PMID:42374405 | DOI:10.1186/s12911-026-03662-3

By Nevin Manimala