Int J Qual Health Care. 2025 Dec 24:mzaf134. doi: 10.1093/intqhc/mzaf134. Online ahead of print.
ABSTRACT
BACKGROUND: Patient safety concerns arise as large language models (LLMs) are increasingly integrated into healthcare delivery, given the absence of adequate quality-assurance frameworks. This study aimed to develop risk-stratified quality indicators for AI-assisted medical consultations and assess patient safety implications across varying clinical scenarios.
METHODS: Two hundred clinical scenarios representing different risk levels across four medical complexity categories (50 scenarios per category) were evaluated using established quality and safety frameworks. Twelve healthcare specialists independently evaluated Claude Sonnet 4.0 responses using validated quality indicators. This single-model evaluation provides baseline performance metrics and a methodological framework that can be used in future comparative studies involving multiple LLM systems.
RESULTS: Significant variations in performance were identified across clinical complexity categories. Routine care scenarios achieved excellent performance (89.2%, 95% CI: 86.1-92.3%), while acute care situations showed concerning limitations (65.4%, 95% CI: 61.8-69.0%). Inter-rater reliability was excellent across all domains (ICC: 0.85-0.91). Statistical analysis revealed significant between-group differences (F(3,196) = 47.3, p < 0.001), suggesting potential patterns in AI performance across clinical complexity levels.
CONCLUSION: These preliminary findings suggest that implementation of AI-assisted healthcare consultations may benefit from risk-stratified deployment strategies, pending validation in larger studies. Current findings inform the development of safety frameworks and quality indicators essential for maintaining high standards of care while leveraging the benefits of AI.
PMID:41442171 | DOI:10.1093/intqhc/mzaf134