Eur J Epidemiol. 2026 May 28. doi: 10.1007/s10654-026-01404-3. Online ahead of print.
ABSTRACT
Data harmonization is a prerequisite for joint cohort analyses. In this review, we aim to identify and contrast statistical methods for retrospective harmonization of longitudinal data. We performed a scoping review following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews guidelines. Studies were included if they described statistical methods for retrospectively harmonizing longitudinal data at the participant level. From 35 included papers out of 1,234 hits, we identified three types of statistical methods applicable to tabular data commonly collected in longitudinal epidemiological studies (e.g., questionnaires): (1) distribution-based methods, (2) the proportion score model, and (3) latent variable models. Our results suggest that the suitability of a statistical harmonization method mainly depends on the measurement scales of the original variables as well as on the type of target variable (directly measurable vs. latent). The chosen harmonization method influences how missing subsets of variables are addressed. None of the included studies applied more automated approaches such as machine learning-based procedures for deriving a harmonized dataset. Based on our findings, we present a roadmap that can guide researchers in selecting the most appropriate statistical method for a specific harmonization task and in handling variables collected only in a subset of studies. Data harmonization is still a demanding task that requires the development and application of novel tools for automating the procedures.
PMID:42207414 | DOI:10.1007/s10654-026-01404-3