Assessing the Data Quality Dimensions of Surgical Oncology Cohorts in the All of Us Research Program

JCO Clin Cancer Inform. 2025 Jul;9:e2500078. doi: 10.1200/CCI-25-00078. Epub 2025 Jul 8.

ABSTRACT

PURPOSE: Cancer is a leading cause of morbidity and mortality in the United States. Mapping electronic health record (EHR) data to the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) may standardize data structure and allow for multiple database oncology studies. However, the number of oncology studies produced with the OMOP CDM has been low. To investigate the discrepancy between the public health impact of cancer and the output of OMOP CDM clinical cancer studies, we evaluated (EHR) data quality of five surgical oncology cohorts in the All of Us Research Program: mastectomy, prostatectomy, colectomy, melanoma excision, and lung cancer resection.

METHODS: We selected procedure codes that were the basis of each phenotype. We used a data quality checklist to evaluate five domains systematically: conformance, completeness, concordance, plausibility, and temporality.

RESULTS: Most phenotype-defining source codes were mapped to Current Procedural Terminology 4, which is an EHR standard. All cohorts had low concept prevalence. Most bivariate correlations between concepts were weak (⍴ ≤ 0.5). The small number of biomarkers available for use limited our plausibility analysis. The median time between biopsy and surgery varied across cohorts.

CONCLUSION: We identified multiple data completeness issues, which limited the fitness for use evaluation. Also, using the OMOP CDM procedure concepts and mappings presented challenges for our study. Variable amounts of missingness in OMOP CDM surgical oncology data may affect the fitness for use of cancer data. Further research is warranted to improve the quality of that data.

PMID:40627823 | DOI:10.1200/CCI-25-00078

By Nevin Manimala