JCO Clin Cancer Inform. 2025 Dec;9:e2500223. doi: 10.1200/CCI-25-00223. Epub 2025 Dec 1.
ABSTRACT
PURPOSE: Tobacco use is a major risk factor for diseases such as cancer. Granular quantitative details of smoking (eg, pack years and years since quitting) are essential for assessing disease risk and determining eligibility for lung cancer screening (LCS). However, existing natural language processing (NLP) tools struggle to extract detailed quantitative smoking data from clinical narratives.
METHODS: We cross-validated four pretrained Bidirectional Encoder Representations from Transformers (BERT)-based models-BERT, BioBERT, ClinicalBERT, and MedBERT-by fine-tuning them on 90% of 3,261 sentences mentioning smoking history to extract six quantitative smoking history variables from clinical narratives. The model with the highest cross-validated micro-averaged F1 scores across most variables was selected as the final SmokeBERT model and was further fine-tuned on the 90% training data. Model performance was evaluated on a 10% holdout test set and an external validation set containing 3,191 sentences.
RESULTS: ClinicalBERT was selected as the final model based on cross-validation and was fine-tuned on the training data to create the SmokeBERT model. Compared with the state-of-the-art rule-based NLP model and the Generative Pre-trained Transformer Open Source Series 20 billion parameter model, SmokeBERT demonstrated superior performance in smoking data extraction (overall F1 score, holdout test: 0.97 v 0.88-0.90; external validation: 0.86 v 0.72-0.79) and in identifying LCS-eligible patients (97% v 59%-97% for ≥20 pack-years and 100% v 60%-84% for ≤15 years since quitting).
CONCLUSION: We developed SmokeBERT, a fine-tuned BERT-based model optimized for extracting detailed quantitative smoking histories. Future work includes evaluating performance on larger clinical data sets and developing a multilingual, language-agnostic version of SmokeBERT.
PMID:41325572 | DOI:10.1200/CCI-25-00223