Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation

JMIR Med Inform. 2025 Nov 14;13:e79307. doi: 10.2196/79307.

ABSTRACT

BACKGROUND: Missing data are a common challenge in electronic health record (EHR)-based prediction modeling. Traditional imputation methods may not suit prediction or machine learning models, and real-world use requires workflows that are implementable for both model development and real-time prediction.

OBJECTIVE: We evaluated methods for handling missing data when using EHR data to build clinical prediction models for patients admitted to the pediatric intensive care unit (PICU).

METHODS: Using EHR data containing missing values from an academic medical center PICU, we generated a synthetic complete dataset. From this, we created 300 datasets with missing data under varying mechanisms and proportions of missingness for the outcomes of (1) successful extubation (binary) and (2) blood pressure (continuous). We assessed strategies to address missing data including simple methods (eg, last observation carried forward [LOCF]), complex methods (eg, random forest multiple imputation), and native support for missing values in outcome prediction models.

RESULTS: Across 886 patients and 1220 intubation events, 18.2% of original data were missing. LOCF had the lowest imputation error, followed by random forest imputation (average mean squared error [MSE] improvement over mean imputation: 0.41 [range: 0.30, 0.50] and 0.33 [0.21, 0.43], respectively). LOCF generally outperformed other imputation methods across outcome metrics and models (mean improvement: 1.28% [range: -0.07%, 7.2%]). Imputation methods showed more performance variability for the binary outcome (balanced accuracy coefficient of variation: 0.042) than the continuous outcome (mean squared error coefficient of variation: 0.001).

CONCLUSIONS: Traditional imputation methods for inferential statistics, such as multiple imputation, may not be optimal for prediction models. The amount of missingness influenced performance more than the missingness mechanism. In datasets with frequent measurements, LOCF and native support for missing values in machine learning models offer reasonable performance for handling missingness at minimal computational cost in predictive analyses.

PMID:41237368 | DOI:10.2196/79307

By Nevin Manimala