JMIR Form Res. 2022 Apr 27. doi: 10.2196/35114. Online ahead of print.
BACKGROUND: The COVID-19 pandemic represents the most unprecedented global challenge in recent times. As the global community attempts to manage the pandemic long-term, it is pivotal to understand what factors drive prevalence rates, and to predict the future trajectory of the virus.
OBJECTIVE: This study has two objectives. Firstly, it tests the statistical relationship between socioeconomic status and COVID-19 prevalence. Secondly, it utilises machine learning techniques to predict cumulative COVID-19 cases in a multi-country sample of 182 countries. Taken together, these objectives will shed light upon socioeconomic status as a global risk factor of the COVID-19 pandemic.
METHODS: This research utilised exploratory data analysis and supervised machine learning methods. Exploratory analysis included variable distribution, variable correlations and outlier detection. Following this, three supervised regression techniques were applied: linear regression, random forest, and adaptive boosting. Results were evaluated using k-fold cross validation and subsequently compared to analyse algorithmic suitability. The analysis involved two models. Firstly, the algorithms were trained to predict 2021 COVID-19 prevalence using only 2020 reported case data. Following this, socioeconomic indicators were added as features and the algorithms were trained again. The Human Development Index metrics of life expectancy, mean years of schooling, expected years of schooling, and Gross National Income were used to approximate socioeconomic status.
RESULTS: All variables correlated positively with 2021 COVID-19 prevalence, with R2 values ranging from 0.55-0.85. Using socioeconomic indicators, COVID-19 prevalence was predicted with a reasonable degree of accuracy. Using 2020 reported case rates as a lone predictor to predict 2021 prevalence rates, the average predictive accuracy of the algorithms was low (R2=0.543). When the socioeconomic indicators were added alongside 2020 prevalence rates as features, average predictive performance improved considerably (R2=0.721) and all error statistics decreased. This suggested that adding socioeconomic indicators alongside 2020 reported case data optimised prediction of COVID-19 prevalence to a considerable degree. Linear regression was the strongest learner with R2=0.693 on the first model and R2=0.763 on the second model, followed by random forest (0.481 and 0.722) and AdaBoost (0.454 and 0.679). Following this, the second model was retrained using a selection of additional COVID-19 risk factors (population density, median age, and vaccination uptake) instead of the HDI metrics. Average accuracy dropped to 0.649 however, which highlights the value of socioeconomic status as a predictor of COVID-19 cases in the chosen sample.
CONCLUSIONS: Results show that socioeconomic status should be an important variable to consider in future epidemiological modelling, and highlights the reality of the COVID-19 pandemic as a social phenomenon as well as a healthcare phenomenon. This paper also puts forward new considerations about the application of statistical and machine learning techniques to understand and combat the COVID-19 pandemic.