Proc Natl Acad Sci U S A. 2022 May 31;119(22):e2118636119. doi: 10.1073/pnas.2118636119. Epub 2022 May 24.
ABSTRACT
SignificanceRandom Forests (RFs) are among the most successful machine-learning algorithms in terms of prediction accuracy. In many domain problems, however, the primary goal is not prediction, but to understand the data-generation process-in particular, finding important features and feature interactions. There exists strong empirical evidence that RF-based methods-in particular, iterative RF (iRF)-are very successful in terms of detecting feature interactions. In this work, we propose a biologically motivated, Boolean interaction model. Using this model, we complement the existing empirical evidence with theoretical evidence for the ability of iRF-type methods to select desirable interactions. Our theoretical analysis also yields deeper insights into the general interaction selection mechanism of decision-tree algorithms and the importance of feature subsampling.
PMID:35609192 | DOI:10.1073/pnas.2118636119