BMC Med Res Methodol. 2026 Jan 8. doi: 10.1186/s12874-025-02751-7. Online ahead of print.
ABSTRACT
INTRODUCTION AND OBJECTIVES: Over recent decades, the exponential growth of data, especially in healthcare, has necessitated advanced analytical methods. Conventional machine learning algorithms often assume independence among data points, limiting their effectiveness with longitudinal and hierarchical data. This study introduces a novel algorithm called GMEXGBoost, a methodological extension of generalized mixed-effects models that leverages the boosting framework of XGBoost for estimating fixed effects while simultaneously accounting for random effects. The innovation lies in GMEXGBoost’s ability to explicitly incorporate data correlations while retaining the predictive power of boosted trees.
METHODS: The GMEXGBoost model was evaluated through extensive simulations and a real-world cohort study, benchmarking against GLMM, GLMMTree, GMERF, and XGBoost. Also, its performance was assessed using predictive mean absolute deviation (PMAD), predictive misclassification rate (PMCR), sensitivity, specificity, accuracy, and AUC. Simulation analyses were conducted using multiple synthetic datasets, each comprising training and testing groups with varying effect structures, including random intercepts and slopes. All computations were performed in RStudio(version 2023.06.0).
RESULTS: Our results indicate that while XGBoost achieved the lowest average errors across most scenarios, GMEXGBoost consistently demonstrated superior stability and accuracy when random-effect variance was large or correlations were strong. Also, in real data, GMEXGBoost outperformed other models in terms of the performance metrics.
CONCLUSION: The GMEXGBoost algorithm, by combining the estimates of the GLMM and XGBoost models, leverages the capabilities of both and delivers improved performance in complex problems. Although it is not universally superior, but demonstrates clear advantages in the analysis of hierarchical and longitudinal datasets with strong correlations. These properties make it a valuable tool for decision-making in healthcare and other domains that involve complex, structured data.
PMID:41501655 | DOI:10.1186/s12874-025-02751-7