Categories
Nevin Manimala Statistics

Privacy-enhancing sequential learning under heterogeneous selection bias in multi-site electronic health records data

J Am Med Inform Assoc. 2026 Jun 16:ocag083. doi: 10.1093/jamia/ocag083. Online ahead of print.

ABSTRACT

OBJECTIVES: To develop privacy-enhancing statistical methods for estimating disease risk parameters across multiple electronic health record (EHR) sites with heterogeneous selection mechanisms, avoiding individual-level data sharing. We illustrate their utility via a cross-biobank analysis of smoking and 97 cancer subtypes using NIH All of Us (AOU) and Michigan Genomics Initiative (MGI) data sites.

MATERIALS AND METHODS: Distributed health platforms often render centralized algorithms infeasible due to patient privacy protection. We propose Sequential Pseudo-Likelihood (SPL) and Sequential Augmented Inverse Probability Weighting (SAIPW) to adjust for selection bias using summary statistics shared across sites and external population information. SAIPW employs flexible auxiliary models for multiple robustness. We compared SPL and SAIPW against unweighted and centralized/meta-learning benchmarks in simulations, applying them to harmonized MGI (n = 50 935) and AOU (n = 241 563) data.

RESULTS: Unweighted estimators exhibited substantial bias. SPL and SAIPW yielded unbiased estimates with valid coverage, with SAIPW remaining robust to selection model misspecification. Both approaches showed negligible efficiency loss relative to centralized methods. Meta-learning methods proved unstable for rare outcomes. Real-data analyses consistently identified strong associations between smoking and lung, bladder, and larynx cancers.

DISCUSSION: These findings highlight the necessity of adjusting for site-specific selection biases in distributed health networks. SPL and SAIPW offer practical, scalable solutions that bypass the instability of meta-analysis for rare events, successfully harmonizing diverse biobanks while strictly enhancing patient privacy.

CONCLUSION: Our framework enables valid, privacy-enhancing inference across EHR sites subject to heterogeneous selection, facilitating scalable, distributed research using real-world data.

PMID:42298300 | DOI:10.1093/jamia/ocag083

By Nevin Manimala

Portfolio Website for Nevin Manimala