JAMA Netw Open. 2026 Jun 1;9(6):e2620939. doi: 10.1001/jamanetworkopen.2026.20939.
ABSTRACT
IMPORTANCE: Emergency department (ED) quality review often uses administrative electronic triggers (eTriggers), but yields on detecting missed opportunities for diagnosis (MODs) are low. A commercial large language model (LLM) may help screen for MODs, yet evaluation data in real-world cohorts remain limited.
OBJECTIVE: To evaluate LLMs for identifying MODs in ED eTrigger cohorts.
DESIGN, SETTING, AND PARTICIPANTS: This retrospective diagnostic study of 2 eTrigger cohorts, ED discharge with return hospital admission within 72 hours and ED admission to the floor with intensive care unit (ICU) escalation within 24 hours, was conducted from April 2015 through March 2025 across 9 EDs (2 academic and 7 community) in 1 US health system. Samples included 200 encounters from the 72-hour return cohort and 100 encounters from the floor-to-ICU cohort; each case was adjudicated by 2 emergency physicians using a review process based on the Safer Dx framework.
EXPOSURES: Cases were evaluated by Claude Sonnet 4, Claude Sonnet 4.6, Claude Opus 4.6, Gemini 3 Pro, GPT-5, and GPT-5 mini.
MAIN OUTCOMES AND MEASURES: Main outcomes were sensitivity, specificity, positive predictive value, negative predictive value, area under the receiver operating characteristic curve (AUC), and reviewer-reviewer and reviewer-model concordance.
RESULTS: Among 300 sampled encounters, 12 were excluded, leaving 288 analyzed encounters (median [IQR] age, 69 [54-79] years; 135 female [46.9%]) with 39 MODs (13.5%), including 21 of 191 (11.0%) in the 72-hour return cohort and 18 of 97 (18.6%) in the floor-to-ICU cohort. Interrater agreement was 81.9% (95% CI, 77.4%-86.1%), with Gwet AC1 of 0.77 (95% CI, 0.70-0.83). In the 72-hour return cohort, model sensitivity ranged from 42.9% (95% CI, 24.5%-63.5%) for GPT-5 mini to 85.7% (95% CI, 65.4%-95.0%) for Claude Sonnet 4, specificity from 55.9% (95% CI, 48.4%-63.1%) for Claude Sonnet 4 to 82.9% (95% CI, 76.6%-87.9%) for GPT-5 mini, and AUC from 0.65 (95% CI, 0.53-0.77) for GPT-5 mini to 0.73 (95% CI, 0.61-0.85) for Claude Sonnet 4. In the floor-to-ICU cohort, sensitivity ranged from 5.6% (95% CI, 1.0-25.8%) for GPT-5 mini to 55.6% (95% CI, 33.7%-75.4%) for Claude Sonnet 4, specificity from 64.6% (53.6%-74.2%) for Claude Sonnet 4 to 97.5% (95% CI, 91.2%-99.3%) for GPT-5 mini, and AUC from 0.57 (95% CI, 0.46-0.67) for GPT-5 mini to 0.82 (95% CI, 0.73-0.91) for GPT-5. Across cohorts, LLMs showed similar discrimination but different sensitivity-specificity tradeoffs; Claude Sonnet 4 generally favored higher sensitivity, whereas GPT-5 mini favored higher specificity.
CONCLUSIONS AND RELEVANCE: In this diagnostic study of 2 ED eTrigger cohorts, model performance varied by cohort, with LLMs showing similar discrimination but different binary thresholds. These findings suggest that evaluation within the review workflow is needed before implementation and that reviewer-like concordance captures a distinct dimension of model behavior from discrimination.
PMID:42371624 | DOI:10.1001/jamanetworkopen.2026.20939