NPJ Digit Med. 2026 May 2;9(1):347. doi: 10.1038/s41746-026-02662-x.
ABSTRACT
Access to large, diverse biomedical datasets is critical for advancing medical research, yet privacy regulations severely restrict data sharing. We present an end-to-end framework for privacy-preserving health data synthesis that integrates advanced deep generative models (DGMs) with robust preprocessing, formal differential privacy (DP) training for select DGMs, empirical privacy risk evaluation, data-sufficiency analysis, domain-guided quality control, and biobank visualization tools. Released as open-source containerized software, the framework ensures reproducible deployment while preserving statistical fidelity, machine learning (ML) utility, and privacy guarantees. Empirical evaluations across diverse biobank datasets demonstrate that TabSyn-a transformer-based diffusion model-combined with our correlation-and distribution-aware CorrDst loss function achieves superior performance balancing fidelity, privacy, and computational efficiency. The tailored preprocessing pipeline effectively handles high missingness rates, substantially improving distributional accuracy and clinical plausibility. Across 26 biobank datasets spanning three regulatory levels, the framework shows that TabSyn with correlation- and distribution-aware loss function consistently achieves superior performance in terms of fidelity, privacy, and computational efficiency.
PMID:42069937 | DOI:10.1038/s41746-026-02662-x