Ann Epidemiol. 2022 Mar 27:S1047-2797(22)00044-8. doi: 10.1016/j.annepidem.2022.03.010. Online ahead of print.
ABSTRACT
PURPOSE: The use of predictive models in epidemiology is relatively narrow as most of the studies report results of traditional statistical models such as Linear, Logistic, or Cox regressions. In this study, a high-dimensional epidemiological cohort, collected within the Kuopio Ischemic Heart Disease Risk Factor Study (KIHD) in 1984-1989, was used to investigate the predictive ability of models with embedded variable selection.
METHODS: Simple Logistic Regression with seven preselected risk factors was compared to k-Nearest Neighbors, Logistic Lasso Regression, Decision Tree, Random Forest, and Multilayer Perceptron in predicting cardiovascular death for the aged men from KIHD for the long horizon of 30±3 years: 746 predictor variables were available for 2682 men (705 cardiovascular deaths were registered and 1977 men stayed alive). We considered two scenarios of handling competing risks (removing subjects and treating them as non-cases).
RESULTS: The best average AUC on the test sample was 0.8075 (95%CI, 0.8051-0.8099) in scenario 1 and 0.7155 (95%CI, 0.7128-0.7183) in scenario 2 achieved with Logistic Lasso Regression, which was 6.04% and 5.50% higher than the baseline AUC provided by Logistic Regression with manually preselected predictors.
CONCLUSIONS: In both scenarios Logistic Lasso Regression, Random Forest, and Multilayer Perceptron outperformed Simple Logistic Regression.
PMID:35354081 | DOI:10.1016/j.annepidem.2022.03.010