BMC Med Inform Decis Mak. 2025 Dec 24. doi: 10.1186/s12911-025-03326-8. Online ahead of print.
ABSTRACT
BACKGROUND: Synthetic data generation (SDG) has emerged as a critical enabler for data-driven healthcare research, offering privacy-preserving alternatives to real patient data. Temporal health data – ranging from physiological signals to electronic health records (EHRs) – pose unique challenges for SDG due to their complexity, irregularity, and clinical sensitivity.
OBJECTIVE: This review systematically examines SDG methods for longitudinal and time-series health data. Its aims are to (1) propose a lightweight taxonomy to support orientation across the SDG landscape along five structural dimensions, (2) characterize the major synthesis techniques and their alignment with temporal structures and data modalities, and (3) synthesize the utility and privacy evaluation strategies used in practice.
METHODS: A systematic literature review was conducted following PRISMA guidelines across four major databases (ACM, arXiv, IEEE Xplore, Europe PMC) for publications from 2017 to 2025. Eligible studies proposed or applied SDG techniques to healthcare-relevant temporal data with sufficient methodological transparency. Structured data extraction and thematic analysis were used to identify modeling trends, evaluation metrics, and domain-specific requirements, complemented by a comparative synthesis of SDG methods.
RESULTS: A total of 115 studies were included. Deep generative models – especially Generative Adversarial Networks (GANs), Autoencoders (AEs), and diffusion-based methods – dominate the field, with increasing adoption of autoregressive and hybrid simulation approaches. Event-based EHR data are most commonly targeted, while continuous and irregular time series remain underexplored. Utility evaluations vary widely, with strong emphasis on descriptive statistics and predictive performance, but limited attention to inferential validity and clinical realism. Privacy assessments are sparse and inconsistently reported: only 30% of studies included any metric, and just around 6% implemented differential privacy (DP), often without parameter disclosure. This limited adoption may reflect technical challenges, limited expertise, and the absence of regulatory incentives.
CONCLUSIONS: Synthetic temporal data play an increasingly vital role across clinical prediction, public health modeling, and Artificial Intelligence (AI) development. However, SDG research remains fragmented in terminology, evaluation practices, and privacy safeguards. Responsible-AI considerations – such as fairness, transparency, and trust – along with evidence on clinical adoption remain underexplored but are critical for future integration. This review provides a unified conceptual and methodological framework to guide future research, standardization efforts, and interdisciplinary collaboration for responsible, effective use of synthetic health data.
PMID:41444887 | DOI:10.1186/s12911-025-03326-8