Comparison of Machine Learning Models for Colon Cancer Survival: Predictive Modeling Approach

JMIR Cancer. 2025 Nov 26;11:e72665. doi: 10.2196/72665.

ABSTRACT

BACKGROUND: Colon cancer is a leading cause of cancer-related deaths worldwide, with survival influenced by risk factors, treatment type, and patient characteristics. Traditional statistical models, such as Kaplan-Meier curves, have been widely used to estimate survival probabilities. However, these models often have difficulty handling complex interactions, covariates, and nonlinear relationships between risk factors. Recently, machine learning (ML) techniques have emerged as promising tools for improving survival prediction by handling large covariates and capturing complex patterns.

OBJECTIVE: This study compares several ML models to accurately estimate colon cancer survival by leveraging data from the Kentucky Cancer Registry. By identifying key risk factors, these analyses aim to improve risk stratification, treatment planning, and prognosis for overall colon cancer survival within subgroups.

METHODS: We conducted a retrospective analysis of colon cancer cases diagnosed between 2010 and 2022 (n=33,825), using Kentucky Cancer Registry data linked to mortality records, with approval from the University of Kentucky Institutional Review Board (#63067). We compared multiple predictive modeling techniques, including Cox proportional hazards, accelerated failure time models, Extreme Gradient Boosting, random survival forests, least absolute shrinkage and selection operator (LASSO), and elastic net regression, to estimate survival probabilities. The Kaplan-Meier method provided baseline survival estimates, and multivariate models, including ML approaches, evaluated contributions of key risk factors. Model performance was compared across evaluation metrics such as the Brier score, concordance index, out-of-bag error, and Continuous Ranked Probability Score. Missing data were handled via multiple imputation, and leave-one-out cross-validation was applied to reduce overfitting.

RESULTS: The ML models identified key covariates influencing survival outcomes, such as age, treatment type, positive nodes, tumor stage, smoking, and comorbidities. In the overall model, patients who refused or received no treatment had a 3.24-fold higher risk of mortality compared to those who underwent surgery at primary and regional sites. Elevated mortality risk was also observed among smokers (24% higher than non-smokers) and Appalachian residents (7% higher than non-Appalachian residents). Our overall model achieved a concordance index of 0.8146, with strong discriminatory performance across subgroups, including early-age diagnosis (0.8175), late-age diagnosis (0.7841), Appalachia (0.8135), non-Appalachia (0.8126), White patients (0.8164), and Black patients (0.7881). The results highlight the strengths and limitations of each ML approach, with the random survival forest and LASSO models outperforming traditional methods such as the Cox model in prediction accuracy and model discrimination.

CONCLUSIONS: Our study demonstrated the utility of ML in identifying risk factors associated with colon cancer survival, with positive lymph nodes, age at diagnosis, treatment received, clinical tumor size, tumor grade, smoking status, geographic region, and marital status emerging as dominant predictors across all statistical models. This comparative analysis offers valuable insights for clinical decision-making and prognosis, highlighting the potential of ML to identify risk factors specific to different subgroups, ultimately advancing personalized care for patients with colon cancer.

PMID:41297023 | DOI:10.2196/72665

By Nevin Manimala