JMIR Med Educ. 2026 Jun 16;12:e90064. doi: 10.2196/90064.
ABSTRACT
BACKGROUND: Feedback is essential for medical students’ learning during clinical clerkships; yet, supervising physicians often struggle to provide meaningful written feedback due to time constraints. Large language models offer a promising approach to supplement human feedback, but how artificial intelligence (AI)-generated and human feedback differ in authentic clinical settings remains unclear, as most comparisons have been conducted in classroom or simulation contexts.
OBJECTIVE: The aim of the study is to examine how AI-generated feedback and supervisor-provided feedback differ when applied to medical students’ clinical clerkship logs, by identifying the distinct characteristics and complementary strengths of each feedback type.
METHODS: This cross-sectional convergent mixed methods study included 161 weekly clinical clerkship logs from 47 fifth- and sixth-year medical students across 12 clinical departments at Nagoya University, Japan (January-May 2024). Of 164 eligible logs, 3 were excluded because supervisors entered contact messages rather than substantive feedback. AI feedback was generated using GPT-4o. In total, 10 faculty physicians and 10 medical students evaluated both feedback types in blinded, randomized order using a validated 5-category rubric (criteria-based, clear direction, accuracy, prioritization, and supportive tone), followed by open-ended comments and source identification. Quantitative analyses (paired 2-tailed t tests, cumulative link mixed-effects models; α=.05 with Bonferroni correction) were complemented by qualitative thematic analysis and integrated using joint display analysis.
RESULTS: AI feedback was significantly longer than supervisor feedback (mean 382.02, SD 81.82 vs mean 98.87, SD 73.66 characters; Cohen d=2.84, 95% CI 2.50-3.19; P<.001). Cumulative link mixed-effects models showed that AI scored higher on criteria-based (odds ratio [OR] 11.81, 95% CI 7.64-18.27; P<.001) and clear direction (OR 6.61, 95% CI 4.35-10.06; P<.001), with no significant differences on accuracy (OR 1.35, 95% CI 0.91-2.00; P>.99), prioritization (OR 1.70, 95% CI 1.16-2.50; P=.10), or supportive tone (OR 1.34, 95% CI 0.87-2.06; P>.99). AI feedback showed greater consistency (variance ratio 3.9:1; Levene F1,320=73.20; P<.001). All 20 evaluators correctly identified feedback sources. Qualitative analysis revealed that AI provided structured, text-anchored feedback addressing rubric criteria, while supervisors offered experience-based feedback grounded in clinical context and professional expertise.
CONCLUSIONS: This study extends the comparison of AI-generated and supervisor feedback to an authentic clinical clerkship environment, moving beyond classroom and simulation settings examined in prior work. Through integrated mixed methods analysis, a key distinction emerged between text-anchored AI feedback, which systematically addresses written log content in alignment with rubric criteria, and experience-based supervisor feedback, which draws on clinical observation and professional judgment. AI consistently delivered structured feedback addressing gaps that arise when time-pressured supervisors provide brief comments, while supervisors contributed clinically grounded insights that AI cannot replicate. These complementary strengths suggest that AI feedback should supplement rather than replace supervisor feedback, and that hybrid models leveraging each type’s advantages warrant investigation in clinical education.
PMID:42302283 | DOI:10.2196/90064