Categories
Nevin Manimala Statistics

An ensemble approach improves the prediction of the COVID-19 pandemic in South Korea

J Glob Health. 2025 Mar 28;15:04079. doi: 10.7189/jogh.15.04079.

ABSTRACT

BACKGROUND: Modelling can contribute to disease prevention and control strategies. Accurate predictions of future cases and mortality rates were essential for establishing appropriate policies during the COVID-19 pandemic. However, no single model yielded definite conclusions, with each having specific strengths and weaknesses. Here we propose an ensemble learning approach which can offset the limitations of each model and improve prediction performances.

METHODS: We generated predictions for the transmission and impact of COVID-19 in South Korea using seven individual models, including mathematical, statistical, and machine learning approaches. We integrated these predictions using three ensemble methods: stacking, average, and weighted average ensemble (WAE). We used train and test errors to measure a model’s performance and selected the best covariate combinations based on the lowest train error. We then evaluated model performance using five error measures (r2, weighted mean absolute percentage error (WMAPE), autoregressive integrated moving average (ARIMA), mean squared error (MSE), root mean squared error (RMSE), and mean absolute percentage error (MAPE)) and selected the optimal covariate combination accordingly. To validate the generalisability of our approach, we applied the same modelling framework to USA data.

RESULTS: Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data evaluated using the WMAPE, individual models achieved the following: Generalised additive modelling (GAM) reached a value of 0.244 for the daily number of confirmed cases, a value of 0.172 for the time series Poisson for the daily number of confirmed deaths, and a value of 0.022 for both ARIMA and time series Poisson for the daily number of ICU patients. For smoothed data, the Holt-Winters model achieved a value of 0.058 for daily confirmed cases, while ARIMA attained a value of 0.058 for the daily number of confirmed deaths and 0.013 for the daily number of ICU patients. Among ensemble models, the SVM-based stacking ensemble achieved error values of 0.235 for the daily number of confirmed cases, 0.118 for the daily number of deaths, and 0.019 for the daily number of ICU patients on raw data. For smoothed data, the average ensemble and weighted average ensemble achieved 0.060 for the daily number of confirmed cases and 0.013 for daily ICU patients. The ensemble models also generalised well when applied to data from the USA.Booster shot rate + Omicron variant BA.5 rate was the most commonly selected combination of covariates. For raw data, GAM (0.244) predicted daily confirmed cases best, time series Poisson (0.172) predicted daily confirmed deaths, and both ARIMA and time series Poisson (0.022) predicted daily ICU patients, based on WMAPE. For smoothed data, time series Poisson predicted daily confirmed cases (0.065) best, while ARIMA best predicted daily confirmed deaths (0.058) and ICU patients (0.013). For ensemble models, stacking ensemble using SVM was the best model for predicting daily confirmed cases (0.228), deaths (0.11), and ICU patients (0.02). With smoothed data, average ensemble and WAE were the best models for predicting daily confirmed cases (0.058) and ICU patients (0.011). The performance of ensemble models was generalised to other countries using the USA data for predictive performance.

CONCLUSIONS: No single model performed consistently. While the ensemble models did not always provide the best predictions, a comparison of first-best and second-best models showed that they performed considerably better than the single models. If an ensemble model was not the best performing model, its performance was always not far from the best single model: a look at the mean and variance of the error measures shows that ensemble models provided stable predictions without much variation in their performances compared to single models. These results can be used to inform policymaking during future pandemics.

PMID:40146993 | DOI:10.7189/jogh.15.04079

By Nevin Manimala

Portfolio Website for Nevin Manimala