Categories
Nevin Manimala Statistics

PSM-SMOTE: propensity score matching and synthetic minority oversampling for handling unbalanced microbiome data

Genes Genomics. 2025 Oct 4. doi: 10.1007/s13258-025-01688-x. Online ahead of print.

ABSTRACT

BACKGROUND: Predictive models using microbiome data often suffer from covariate imbalance and class imbalance, biasing results. Propensity Score Matching (PSM) balances covariates but reduces sample size, while borderline synthetic minority oversampling technique (borderline-SMOTE) oversamples minority classes but can generate uninformative examples.

OBJECTIVE: To develop and evaluate PSM-SMOTE, a novel hybrid sampling method that integrates PSM and borderline-SMOTE to handle both covariate and class imbalance in microbiome data.

METHODS: We developed PSM-SMOTE, a three-step hybrid sampling algorithm for microbiome data: (1) PSM at four caliper levels to balance covariates, (2) selection of at least ten robust differential markers via seven statistical tests with false discovery rate correction, and (3) application of borderline-SMOTE on the marker-based distance matrix to oversample minority classes. We evaluated PSM-SMOTE on three publicly available microbiome case-control datasets: pancreatic ductal adenocarcinoma (PDAC), colorectal cancer (CRC), and obesity, using logistic regression (LR), random forest (RF), and support vector machine (SVM) classifiers. Performance was assessed via area under the ROC curve (AUC).

RESULTS: PSM-SMOTE improved test AUCs in multiple model-dataset combinations compared with using PSM alone. Notably, for the RF model, PSM-SMOTE consistently enhanced AUC across nearly all oversampling settings in the PDAC and obesity cohorts. For the SVM model, PSM-SMOTE also achieved a significant AUC increase in the CRC cohort. For the LR model, PSM-SMOTE showed modest improvement under strict matching.

CONCLUSION: PSM-SMOTE effectively addresses dual imbalance in microbiome data and consistently enhances performance, providing a practical solution for imbalanced data analyses.

PMID:41045399 | DOI:10.1007/s13258-025-01688-x

By Nevin Manimala

Portfolio Website for Nevin Manimala