Estimating 10-Year Cardiovascular Disease Risk in Primary Prevention Using UK Electronic Health Records and a Hybrid Multitask BERT Model: Retrospective Cohort Study

JMIR Med Inform. 2025 Nov 13;13:e76659. doi: 10.2196/76659.

ABSTRACT

BACKGROUND: Cardiovascular disease (CVD) remains a leading cause of preventable morbidity and mortality, highlighting the need for early risk stratification in primary prevention. Traditional Cox models assume proportional hazards and linear effects, limiting flexibility. While machine learning offers greater expressiveness, many models rely solely on structured data and overlook time-to-event (TTE) information. Integrating structured and textual representations may enhance prediction and support equitable assessment across clinical subgroups.

OBJECTIVE: This study aims to develop a hybrid multitask deep learning model (MT-BERT [multitask Bidirectional Encoder Representations from Transformers]) integrating structured and textual features from electronic health records (EHRs) to predict 10-year CVD risk, enhancing individualized stratification and supporting equitable assessment across diverse demographic groups.

METHODS: We used data from Clinical Practice Research Datalink (CPRD) Aurum comprising 469,496 patients aged 40-85 years to develop MT-BERT for 10-year CVD risk prediction. Structured EHR variables and their corresponding textual representations were jointly encoded using a multilayer perceptron and a distilled version of the BERT model (DistilBERT), respectively. A fusion layer and stacked multihead attention modules enabled cross-modal interaction modeling. The model generated both binary classification outputs and TTE risk scores, optimized using a custom FocalCoxLoss function with uncertainty-based weighting. Prediction targets encompassed composite and individual CVD outcomes. Model performance was evaluated using the area under the receiver operating characteristic curve (AUROC), concordance index, and Brier score, with subgroup analyses by ethnicity and deprivation, and heterogeneity assessed using Higgins I² and Cochran Q statistics. Generalizability was assessed via external validation in a held-out London cohort.

RESULTS: The MT-BERT model yielded AUROC values of 0.744 (95% CI 0.738-0.749) in males and 0.782 (95% CI 0.768-0.796) in females on the test set (n=711,052), and 0.736 (95% CI 0.729-0.741) and 0.775 (95% CI 0.768-0.780), respectively in “spatial external” validation (n=144,370). Brier scores were 0.130 in males and 0.091 in females. Individuals classified as high-risk (≥40% risk in males and ≥34% in females) demonstrated significantly reduced 10-year event-free survival relative to lower-risk individuals (log-rank P<.001). Model performance was consistently higher in females across all metrics. Subgroup analyses revealed substantial heterogeneity across ethnicity and deprivation (I²>70%), especially among males, with lower AUROC in South Asian and Black ethnic groups. These findings reflect variation in model performance across demographic groups while supporting its applicability to large-scale CVD risk stratification.

CONCLUSIONS: The proposed hybrid MT-BERT model predicts 10-year CVD risk for primary prevention by integrating structured variables and unstructured clinical text from EHRs. Its multitask design facilitates both individualized risk stratification and TTE estimation. While performance was modestly reduced in deprived and minority ethnic subgroups, these findings provide preliminary support for advancing equity-aware, data-driven prevention strategies in increasingly diverse health care settings.

PMID:41232034 | DOI:10.2196/76659

By Nevin Manimala