Automated ICD-10-Anchored Classification of Primary Care Text Data: Development and Evaluation of a Custom Multilabel Classifier

JMIR Med Inform. 2026 Apr 6;14:e86533. doi: 10.2196/86533.

ABSTRACT

BACKGROUND: Electronic medical records are a vast and valuable source of information, useful for tasks such as estimating disease prevalence. However, in routine primary care, much of this information is in free-text format rather than in a structured form and, therefore, not readily amenable to analysis. Manual coding of this textual data is both time-consuming and resource-intensive, making it impractical for large datasets. Although powerful open-source language models offer new opportunities for automated coding, their use on short heterogeneous primary care notes, particularly in German-language settings, remains insufficiently studied.

OBJECTIVE: By providing hands-on guidance for applied health researchers, this study aims to demonstrate the effective and accurate automatic classification of free-text notes using a language model fine-tuned for automated International Statistical Classification of Diseases, Tenth Revision (ICD-10) coding.

METHODS: Building on the extensive Family Medicine Research Using Electronic Medical Records (FIRE) routine database from the Institute of Primary Care at the University Hospital Zurich and the University of Zurich, we trained a large language model-based multilabel classifier on a dataset of 38,728 free-text notes, which had been manually categorized into 47 classes using specific ICD-10 codes and code ranges or nondiagnostic/ad hoc labels (eg, “unclear diagnosis,” “status post”). We stratified the labeled data into training (70%), validation (15%), and posttraining test (15%) sets, ensuring similar label distributions across these sets. Using the Transformers Python library, we trained the model over 10 epochs and evaluated it on the posttraining test dataset.

RESULTS: Across 48 classes, the FIRE classifier achieved strong performance on the held-out posttraining set, with F1-scores of 0.85 (micro, overall across all predictions), 0.86 (macro, mean of per-class scores treating classes equally), and 0.84 (weighted, per-class scores weighted by class frequency).

CONCLUSIONS: This study demonstrates steps for training open-source large language models and highlights the potential to streamline and scale the extraction of diagnostic information for practical applications. Our model can be robustly deployed, for example, for prescreening and labeling of free-text information, thus potentially reducing the burden of repetitive and error-prone manual handling.

PMID:41941723 | DOI:10.2196/86533

By Nevin Manimala