Categories
Nevin Manimala Statistics

A cross-validation statistical framework for asymmetric data integration

Biometrics. 2022 May 7. doi: 10.1111/biom.13685. Online ahead of print.

ABSTRACT

The proliferation of biobanks and large public clinical datasets enables their integration with a smaller amount of locally gathered data for the purposes of parameter estimation and model prediction. However, public datasets may be subject to context-dependent confounders and the protocols behind their generation are often opaque; naively integrating all external datasets equally can bias estimates and lead to spurious conclusions. Weighted data integration is a potential solution, but current methods still require subjective specifications of weights and can become computationally intractable. Under the assumption that local data is generated from the set of unknown true parameters, we propose a novel weighted integration method based upon using the external data to minimize the local data leave-one-out cross validation (LOOCV) error. We demonstrate how the optimization of LOOCV errors for linear and Cox proportional hazards models can be rewritten as functions of external dataset integration weights. Significant reductions in estimation error and prediction error are shown using simulation studies mimicking the heterogeneity of clinical data as well as a real-world example using kidney transplant patients from the Scientific Registry of Transplant Recipients. This article is protected by copyright. All rights reserved.

PMID:35524490 | DOI:10.1111/biom.13685

By Nevin Manimala

Portfolio Website for Nevin Manimala