Evaluating ensemble models for fair and interpretable prediction in higher education using multimodal data

Sci Rep. 2025 Aug 11;15(1):29420. doi: 10.1038/s41598-025-15388-9.

ABSTRACT

Early prediction of academic performance is vital for reducing attrition in online higher education. However, existing models often lack comprehensive data integration and comparison with state-of-the-art techniques. This study, which involved 2,225 engineering students at a public university in Ecuador, addressed these gaps. The objective was to develop a robust predictive framework by integrating Moodle interactions, academic history, and demographic data using SMOTE for class balancing. The methodology involved a comparative evaluation of seven base learners, including traditional algorithms, Random Forest, and gradient boosting ensembles (XGBoost, LightGBM), and a final stacking model, all validated using a 5-fold stratified cross-validation. While the LightGBM model emerged as the best-performing base model (Area Under the Curve (AUC) = 0.953, F1 = 0.950), the stacking ensemble (AUC = 0.835) did not offer a significant performance improvement and showed considerable instability. SHAP analysis confirmed that early grades were the most influential predictors across top models. The final model demonstrated strong fairness across gender, ethnicity, and socioeconomic status (consistency = 0.907). These findings enable institutions to identify at-risk students using state-of-the-art interpretable and fair models. These findings enable institutions to identify at-risk students using state-of-the-art, interpretable, and fair models, advancing learning analytics by validating key success predictors against contemporary benchmarks.

PMID:40789907 | DOI:10.1038/s41598-025-15388-9

By Nevin Manimala