Scalable Identification of Clinically Relevant Chronic Obstructive Pulmonary Disease Documents in Large-Scale Electronic Health Record Datasets With a Lightweight Natural Language Processing Model: Retrospective Cohort Study

JMIR Med Inform. 2026 May 12;14:e84326. doi: 10.2196/84326.

ABSTRACT

BACKGROUND: The widespread adoption of electronic health records has resulted in the generation of large volumes of clinical notes. Learning algorithms and large language models can be trained on these resources, but they are susceptible to noise-irrelevant or noninformative data. This sensitivity can lead to significant challenges, including performance degradation and the generation of inaccurate predictions or “hallucinations.” This study addresses a critical challenge in clinical informatics: efficiently filtering millions of documents for relevance before advanced language model processing, particularly in resource-constrained environments.

OBJECTIVE: We present a novel framework for determining document relevance in clinical settings using a chronic obstructive pulmonary disease (COPD) dataset.

METHODS: We developed a novel framework using weak supervision and domain-expert heuristics to generate “silver standard” labels for training data and gold standard expert-annotated labels, creating 2 datasets to optimize the model during the development phase and subsequent testing phase. Various text representation techniques (bag of words, term frequency-inverse document frequency, lightweight document embeddings, compression-based features, and Unified Medical Language System concept extraction) were evaluated. These representations were used to train random forest, extreme gradient boosting, and k-nearest neighbor classifiers. Models were optimized on a small expert-annotated dataset and evaluated on a held-out test set.

RESULTS: The combination of lightweight document embedding with a random forest classifier demonstrated the best performance, achieving a precision of 0.73, recall of 0.86, and F1-score of 0.80 (95% CI 0.76-0.87) for identifying relevant COPD documents. This significantly outperformed baseline heuristics (precision=0.70; recall=0.38; F1-score=0.50, 95% CI 0.43-0.56) and other tested methods.

CONCLUSIONS: Our study presents a novel framework for identifying COPD-relevant clinical documents using lightweight embedding and machine learning. This approach effectively filters pertinent documents, enhancing information retrieval precision. The framework’s scalability and minimal annotation needs make it promising for diverse health care applications, potentially optimizing clinical outcomes through efficient document selection for data-driven decision support systems.

PMID:42119137 | DOI:10.2196/84326

By Nevin Manimala