Enhancing Early Prediction of Gestational Diabetes Mellitus Through Data Augmentation and Feature Guidance: Model Development and Validation Study

JMIR Med Inform. 2026 May 25;14:e85335. doi: 10.2196/85335.

ABSTRACT

BACKGROUND: Early prediction of gestational diabetes mellitus (GDM) is critical for improving maternal health outcomes. However, predictive models are often challenged by limited early-pregnancy samples, severe class imbalance in datasets, and complex interrelationships among clinical features.

OBJECTIVE: This study aimed to develop and evaluate a unified dual-dimensional enhancement framework integrating data augmentation and feature engineering. By addressing data imbalance and leveraging medical prior knowledge, this framework significantly improves early GDM prediction performance.

METHODS: We proposed a framework combining Generative Adversarial Network (GAN)-based data augmentation with large language model-inspired feature engineering. GAN sampling was used to generate clinically plausible synthetic minority class samples to mitigate data imbalance. The large language model was guided to organize features into domains (eg, basic demographics, metabolic syndrome, and core liver biomarkers) and generate higher-order composite features, integrating medical prior knowledge. Machine learning models were subsequently developed, and interpretability analyses were performed using Shapley additive explanations to identify key predictors.

RESULTS: This study used a final analytical cohort of 8214 pregnant women, divided into dataset A comprising 966 out of 5251 (18.4%) participants with GDM, and dataset B comprising 598 out of 2963 (20.2%) participants with GDM. The random forest model enhanced by Tabular Variational Autoencoder-based feature augmentation demonstrated the best performance. On the test dataset, it achieved a recall of 0.7559, an accuracy of 0.8444, and an area under the receiver operating characteristic curve (AUROC) of 0.8873. Statistical evaluation confirmed that the Tabular Variational Autoencoder method significantly outperformed the baseline (Cohen d=2.894; P<.001) and the Conditional Tabular Generative Adversarial Network method (Cohen d=1.637; P=.02) in recall enhancement. Shapley additive explanations analysis identified the following 5 features as the most influential predictors: fasting blood glucose, the composite feature (fasting blood glucose+triglycerides)×prepregnancy BMI, activated partial thromboplastin time, leukocyte count, and neutrophil count.

CONCLUSIONS: The proposed dual-dimensional enhancement framework effectively alleviates data limitations and captures complex feature interactions in early GDM prediction. This strategy not only improves model performance, particularly in recall, but also provides interpretable biological evidence to support rapid clinical screening, stratified management, and early intervention in pregnancy.

PMID:42184375 | DOI:10.2196/85335

By Nevin Manimala