The Utility of Large Language Models to Assist With Emergency Triage Decisions Within Otolaryngology

Otolaryngol Head Neck Surg. 2026 Jun 17. doi: 10.1002/ohn.70313. Online ahead of print.

ABSTRACT

OBJECTIVE: To determine whether contemporary large language models can match clinician performance in evaluating the urgency of emergency otolaryngology referrals.

STUDY DESIGN: Blinded cross-sectional diagnostic reasoning study.

SETTING: Simulated emergency referral environment modeled on tertiary care otolaryngology practice.

METHODS: Thirty emergency referral scenarios spanning the spectrum of otolaryngologic urgency were independently evaluated by 4 large language models (GPT-5, GPT-4, DeepSeek, and Grok) and 4 clinicians (otolaryngology attending and resident, emergency attending and resident). Outputs were anonymized and scored by 10 blinded otolaryngologists for appropriateness of urgency and quality of explanation using a three-point scale. Statistical analyses included nonparametric group comparisons, adjusted ordinary least squares modeling with case-level control, and correlation of each entity’s case profile with that of the otolaryngology attending.

RESULTS: Inter-rater reliability was excellent. The otolaryngology attending achieved the highest overall performance. GPT-5 demonstrated comparable mean performance, with no statistically significant difference in either domain. GPT-4 scored modestly lower but received higher mean ratings than both emergency clinicians. DeepSeek and the otolaryngology resident demonstrated intermediate performance, while Grok and the emergency clinicians performed lowest. Group-level analyses showed no significant difference between the large language model and otolaryngology cohorts; both were rated higher than emergency clinicians in this sample.

CONCLUSION: GPT-5 demonstrated triage performance comparable to the otolaryngology attending in this controlled sample. Large language models may support emergency decision-making and education when specialist consultation is limited, but require supervision, transparency, and local calibration.

PMID:42307998 | DOI:10.1002/ohn.70313

By Nevin Manimala