Bioinformatics. 2026 Jun 17:btag398. doi: 10.1093/bioinformatics/btag398. Online ahead of print.
ABSTRACT
MOTIVATION: The heterogeneity of complex diseases including cancer leads to heavy-tailed distributions in the disease traits. In such settings, non-robust variable selection methods are inherently susceptible to data contamination and can yield unstable or misleading results. This vulnerability becomes more severe for recently proposed approaches that introduce pseudo-features as negative controls, as these methods further amplify the curse of dimensionality by expanding the genotype matrix in the presence of outliers and high-dimensional genomic features.
RESULTS: We develop a robust variable selection framework with stability selection to prioritize genomic features in the presence of contamination. In contrast to existing approaches that rely on pseudo-features for error control, the proposed method achieves double robustness. First, it adopts least absolute deviation (LAD) LASSO to ensure robustness against outliers and heavy-tailed errors in disease traits. Second, it avoids augmenting the genotype matrix with pseudo-features, thereby mitigating the curse of dimensionality that is particularly problematic in high-dimensional genomic data. The proposed method has been extensively evaluated in simulation studies to demonstrate its effectiveness over multiple competing methods for variable selection. In addition, we have applied the proposed method and competing approaches to two real-data case studies: the The Cancer Genome Atlas (TCGA) Skin Cutaneous Melanoma (SKCM) dataset and an eQTL dataset. The results demonstrate that the proposed method achieves superior performance by identifying genomic features with higher reproducibility.
AVAILABILITY AND IMPLEMENTATION: The source code for implementing the proposed methods is publicly available at https://github.com/cenwu/RSS with an archival DOI https://doi.org/10.6084/m9.figshare.32306883.
PMID:42308532 | DOI:10.1093/bioinformatics/btag398