Comparison of large language models for clinical scenario generation in medical education: a mixed-methods study

BMC Med Educ. 2026 Feb 24. doi: 10.1186/s12909-026-08821-8. Online ahead of print.

ABSTRACT

BACKGROUND: In undergraduate medical education, the ability to manage clinical-cases is a core competency expected of future physicians. Traditionally, this skill is developed through repeated exposure to real patient encounters in clinical settings. However, increasing patient safety concerns, limited clinical opportunities, and faculty workload constraints have made it increasingly difficult for students to access sufficient clinical practice. As a result, innovative solutions such as AI-based simulations are being explored to supplement clinical training. Among these, large language models (LLMs) offer promising potential for generating diverse, interactive, and context-specific clinical scenarios that can support competency-based education. This study aims to evaluate and compare the effectiveness and educational utility of four widely used and accessible LLMs; ChatGPT-4o, Claude 3.7 Sonnet, Gemini 1.5, and DeepSeek (Chat), in generating clinical scenarios for Turkish undergraduate medical education, and to identify the model that produces the most accurate, understandable, and pedagogically appropriate content aligned with national medical education standards.

METHODS: A convergent parallel mixed methods design was employed. Using standardized prompts based on Türkiye’s National Core Undergraduate Medical Education Program-2020, scenarios on three common infectious diseases were generated by each LLM. Twenty-five senior medical students and five expert clinicians evaluated the Turkish-language scenarios using structured rating forms and open ended feedback. Quantitative data were analyzed with Friedman and Wilcoxon tests; qualitative data underwent thematic analysis.

RESULTS: Claude received the highest ratings for clarity, realism, and support for clinical reasoning. Statistically significant differences favored Claude over Gemini and DeepSeek (p < 0.05). Qualitative feedback supported these results, highlighting Claude’s educational value and linguistic precision. ChatGPTperformed moderately, while Gemini and DeepSeek exhibited issues with realism and coherence.

CONCLUSIONS: In this study, Claude was rated highest for generating Turkish-language scenarios perceived as clinically appropriate and pedagogically useful for undergraduate medical education in Türkiye. Overall, the findings provide preliminary evidence on perceived scenario quality across models and support further multicenter and outcomes-focused studies to evaluate feasibility, implementation, and educational impact in diverse settings. Future research should also examine how LLM-generated scenarios can be used as supplementary materials in simulation-based learning.

PMID:41736016 | DOI:10.1186/s12909-026-08821-8

By Nevin Manimala