BMC Bioinformatics. 2025 Nov 25. doi: 10.1186/s12859-025-06215-z. Online ahead of print.
ABSTRACT
As data complexity and volume increase rapidly, efficient statistical methods for identifying significant variables become crucial. Variable selection plays a vital role in establishing relationships between predictors and response variables. The challenge lies in achieving this goal while controlling the False Discovery Rate (FDR) and maintaining statistical power. The knockoff filter, a recent approach, generates inexpensive knockoff variables that mimic the correlation structure of the original variables, serving as negative controls for inference. In this study, we extend the use of knockoffs to Light Gradient Boosting Machine (LightGBM), a fast and accurate machine learning technique. Shapely Additive Explanations (SHAP) values are employed to interpret the black-box nature of machine learning. Through extensive experimentation, our proposed method outperforms traditional approaches, accurately identifying important variables for each class. It offers improved speed and efficiency across multiple datasets. To validate our approach, an extensive simulation study is conducted. The integration of knockoffs into LightGBM enhances performance and interpretability, contributing to the advancement of variable selection methods. Our research addresses the challenges of variable selection in the era of big data, providing a valuable tool for identifying relevant variables in statistical modeling and machine learning applications.
PMID:41291416 | DOI:10.1186/s12859-025-06215-z