Categories
Nevin Manimala Statistics

Prevalence aware feature selection improves biomarker identification in microbiome studies

Bioinformatics. 2026 Jun 24:btag371. doi: 10.1093/bioinformatics/btag371. Online ahead of print.

ABSTRACT

MOTIVATION: Identifying robust microbial biomarkers is crucial for disease diagnosis and prediction, elucidation of biological mechanisms, and development of targeted therapies. Machine learning-based approaches, particularly the random forest model, have been widely used for biomarker identification during sample stratification. However, those biomarkers often vary considerably for the same disease, limiting their practical applicability. A robust framework for reliable biomarker identification in microbiome research is needed. To address this gap, we proposed a prevalence-aware feature selection framework (ParSlet) that incorporates a universal scaling relationship between taxon prevalence and selection frequency.

RESULTS: We first identified a universal exponential scaling law linking the probability of a taxon being consistently recognized as a biomarker versus its prevalence. Then, we integrated this scaling law with taxa prevalence into the biomarker identification using random forest. We systematically evaluated this approach in both simulated microbiome datasets and real-world microbiome datasets and compared it with existing methods, finding that our integrated approach generally improved feature stability and reproducibility of biomarker identification. In colorectal cancer (CRC) datasets, our method robustly identified well-established microbial biomarkers such as Ruminococcus, Clostridium_XVIII, and Faecalibacterium. Integrating a prevalence-based scaling adjustment into feature importance enhances the stability of microbiome biomarker identification. This approach holds promise for enabling more reliable disease diagnostics, uncovering generalizable microbial signatures across cohorts, and guiding the development of targeted microbiome-based interventions.

AVAILABILITY: ParSlet is available at https://github.com/KelabatOSU/Feature_selection.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:42340677 | DOI:10.1093/bioinformatics/btag371

By Nevin Manimala

Portfolio Website for Nevin Manimala