JCO Clin Cancer Inform. 2026 Apr;10(2):e2500172. doi: 10.1200/CCI-25-00172. Epub 2026 Apr 27.
ABSTRACT
PURPOSE: Curating high-quality clinical and genomic data sets from patients with cancer to predict hospital readmission using machine learning (ML) models.
METHODS: We extracted data from electronic health records for patients with cancer in the University of California, San Diego Health System, to curate clinicogenomic data sets for lung, breast, and colon cancers. We constructed ML models to predict the risk of hospital readmission 30, 60, and 90 days postdischarge. Standard ML models (logistic regression, random forest [RF], gradient boosting [GB], neural network) and multitask neural network models were developed to simultaneously predict all three readmission outcomes.
RESULTS: Our results revealed that rehospitalization is most frequent in colon cancer within 30 days. For the 30-day hospitalization prediction, GB achieved the highest area under the precision recall curve (PR-AUC) for lung (0.415) and breast (0.470) cancers and RF achieved the overall highest PR-AUC for colon cancer (0.621). Explainability analysis revealed that health care metrics (such as the number of previous admissions and average length of stay), risk scores composed of diagnosis codes, and treatments are significant features in predicting readmission within cancer types. It also identified EGFR mutations as a potential predictor of readmission in colon cancer.
CONCLUSION: The study highlights the potential of integrating clinical and genomic data for predicting adverse outcomes in patients with cancer. The standard ML approaches were able to successfully capture patterns in readmission and outperformed the more complex models. Limitations include the relatively small data set from a single institution. Ultimately, this study highlights the value of curating and maintaining clinicogenomic information at an institution level to streamline data set curation and model development.
PMID:42044461 | DOI:10.1200/CCI-25-00172