BMC Med Educ. 2026 Apr 9. doi: 10.1186/s12909-026-09141-7. Online ahead of print.
ABSTRACT
BACKGROUND: Large language models (LLMs) such as ChatGPT have garnered growing attention for their potential to support medical education, including specialty areas like cardiology. However, the accuracy, guideline adherence, and pedagogical usefulness of these AI-generated materials remain incompletely characterized.
OBJECTIVE: This study aimed to evaluate the factual accuracy, guideline alignment, and teaching utility of ChatGPT-generated responses on 15 core cardiology topics.
METHODS: Fifteen high-yield cardiology subjects (e.g., acute coronary syndrome, heart failure, valvular disease, arrhythmias) were identified based on standard medical curricula. Each topic was queried with 10 standardized prompts (150 total). Two cardiologists independently rated each ChatGPT response across five domains (factual accuracy, completeness, clarity/readability, guideline alignment, and teaching utility) using a 5-point rubric. Representative examples of higher- and lower-performing responses were also reviewed qualitatively to illustrate recurrent strengths and common error patterns. Inter-rater reliability was assessed via Cohen’s kappa (κ), and descriptive statistics summarized performance by domain and by topic. All prompts were entered using the free web version of ChatGPT (OpenAI; GPT-4o) under standardized default settings during October-November 2024.
RESULTS: Clarity/readability (4.4 ± 0.5) was the highest-scoring domain overall, while factual accuracy (3.7 ± 0.6) showed moderate performance. Guideline alignment varied across the 15 topics, ranging from 4.2 in stable angina to 3.0 in complex pericardial diseases. Heart failure with reduced ejection fraction (HFrEF) also had lower guideline concordance (3.1 ± 0.6). Illustrative examples indicated that lower scores were most often driven by omission of newer therapies, incomplete discussion of advanced procedural indications, and overly generic explanations in more complex subspecialty topics. Cohen’s kappa values (0.78-0.85) indicated substantial inter-rater agreement.
DISCUSSION: ChatGPT provided coherent explanations suitable for introductory learning across most cardiology topics, but gaps were evident in rapidly evolving areas like advanced heart failure and congenital anomalies. These findings underscore ChatGPT’s value as a supplementary educational tool while highlighting the need for careful expert oversight in verifying content accuracy and currency. Strategies for safely integrating AI-generated materials into medical education are discussed.
CONCLUSIONS: While ChatGPT can aid cardiology instruction through clear, concise overviews, reliance on it as a primary source risks propagation of incomplete or outdated information. Ongoing refinement of LLMs, coupled with consistent expert review, will be crucial for leveraging AI effectively in cardiology education.
PMID:41957617 | DOI:10.1186/s12909-026-09141-7