Sci Rep. 2026 Jun 11. doi: 10.1038/s41598-026-57682-0. Online ahead of print.
ABSTRACT
Automated ICD-10-CM coding is critical for hospital reimbursement under Diagnosis-Related Group (DRG) payment systems, yet standard metrics weight all errors equally. This study evaluated 11 models on MIMIC-IV under heterogeneous conditions (the full 7942-code space, top-50 self-trained baselines, and 200-admission zero-shot LLM samples) and proposed two revenue-sensitive metrics: the Revenue Sensitivity Index (RSI) and Coding Reimbursement Score (CRS). Performance was compared across US Medicare Severity DRG (MS-DRG) and Taiwan DRG (Tw-DRG) systems, with five human-AI review strategies simulated. PLM-ICD achieved the highest micro-averaged F1 (0.5934), while open-source zero-shot LLMs performed markedly worse in this exploratory comparison. A 26.5% CRS gap separated the best and worst fine-tuned models. Rankings were identical under both DRG schemes (Spearman ρ = 1.00), indicating stability under a tiered Tw-DRG approximation (93.9% coverage), not the official grouper. At a 20% review rate, revenue-targeted prioritization achieved 43.2% CRS reduction versus 20.0% for random sampling, reaching 91% of the oracle bound. Revenue-aware evaluation captures financially meaningful differences missed by standard metrics, and revenue-guided human-AI collaboration emerges as a candidate deployment framework requiring prospective validation.
PMID:42277410 | DOI:10.1038/s41598-026-57682-0