Evaluation of large language models in cardiovascular surgery: a comparative study of board-level clinical question answering and generation

J Cardiothorac Surg. 2026 May 7. doi: 10.1186/s13019-026-04251-1. Online ahead of print.

ABSTRACT

BACKGROUND: Large language models (LLMs) are increasingly being explored in surgical training and clinical knowledge assessment. Although these models have demonstrated promising performance in standardized examinations, their performance in highly specialized fields such as cardiovascular surgery remains insufficiently investigated. This study aimed to evaluate the performance of current large language models in answering and generating board-level cardiovascular surgery questions reflecting guideline-based clinical reasoning.

METHODS: In this cross-sectional evaluation study, three large language models (ChatGPT-5.1, Gemini 3, and DeepSeek v3.2) were evaluated in two stages. In the first stage, the models answered 150 multiple-choice questions developed and validated by five cardiovascular surgery specialists using a Delphi process, designed to reflect the content scope and difficulty level of the American Board of Thoracic Surgery certification examination. Accuracy rates and pairwise comparisons were analyzed using the McNemar test. In the second stage, model-generated questions were evaluated by expert cardiovascular surgeons in terms of medical accuracy, clinical relevance, exam-level appropriateness, error type, and difficulty level. Statistical analyses included Spearman correlation, Wilcoxon signed-rank test, and chi-square analysis.

RESULTS: The models demonstrated comparable accuracy rates (ChatGPT 80.7%; Gemini 78.7%; DeepSeek 82.0%), with no statistically significant differences between them. Question difficulty level was not associated with model accuracy. Error distribution differed significantly between models (χ² = 8.1; p = 0.02), with Gemini demonstrating the highest rate of valid question generation and DeepSeek showing a higher rate of major errors. A significant positive correlation was observed between model- and expert-assigned difficulty levels.

CONCLUSIONS: Current large language models demonstrate strong performance in board-level cardiovascular surgery knowledge assessment. However, the presence of major errors and variability in difficulty calibration, together with known limitations in clinical reasoning, indicate that these systems should be used cautiously as supportive tools in surgical training and knowledge assessment rather than as substitutes for clinical decision-making.

PMID:42098878 | DOI:10.1186/s13019-026-04251-1

By Nevin Manimala