Categories
Nevin Manimala Statistics

Longitudinal Synthetic Data Generation by Artificial Intelligence to Accelerate Clinical and Translational Research in Breast Cancer

JCO Clin Cancer Inform. 2025 Nov;9:e2500033. doi: 10.1200/CCI-25-00033. Epub 2025 Nov 6.

ABSTRACT

PURPOSE: Real-world data (RWD) are critical for breast cancer (BC) research but are limited by privacy concerns, missing information, and data fragmentation. This study explores synthetic data (SD) generated through advanced generative models to address these challenges and create harmonized longitudinal data sets.

METHODS: A data set of 1052 patients with human epidermal growth factor receptor 2-positive and triple-negative BC from the Informatics for Integrating Biology and the Bedside (i2b2) platform was used. Advanced generative models, including generative adversarial networks (GANs), variational autoencoders (VAEs), and language models (LMs), were applied to generate synthetic longitudinal data sets replicating disease progression, treatment patterns, and clinical outcomes. The Synthethic Validation Framework (SAFE) powered by Train was used to evaluate the fidelity, utility, and privacy. SD were tested across three settings: (1) integration with i2b2 for privacy-preserving data sets; (2) multistate disease modeling to predict clinical outcomes; and (3) generation of synthetic control groups for clinical trials.

RESULTS: The synthetic data sets exhibited high fidelity (score 0.94) and ensured privacy, with temporal patterns validated through time-series analyses and Uniform Manifold Approximation and Projection embeddings. In setting A, SD accurately mirrored RWD on the i2b2 platform while maintaining privacy. In setting B, incorporating SD improved the predictive performance of a multistate disease progression model, increasing the C-index by up to 10%. In setting C, SD replicated the end points of the APT trial, demonstrating its feasibility for generating synthetic control arms with preserved statistical properties of the real data set.

CONCLUSION: AI-generated longitudinal SD effectively address key challenges in RWD use in BC. This approach can improve translational research and clinical trial design while ensuring robust privacy protection. Integration with platforms such as i2b2 highlights their scalability and potential for broader applications in oncology.

PMID:41197110 | DOI:10.1200/CCI-25-00033

By Nevin Manimala

Portfolio Website for Nevin Manimala