J Imaging Inform Med. 2026 Mar 30. doi: 10.1007/s10278-026-01914-2. Online ahead of print.
ABSTRACT
Aligning radiological features with clinical text descriptions remains a key challenge for zero-shot disease recognition in chest radiography. We propose DVLM (Dual-Head Vision-Language Model with Neural Memory), a framework combining Vision Transformer visual encoding with ClinicalBERT-based text processing through parallel contrastive and supervised learning branches. A neural memory module stores disease-relevant patterns during training for improved generalization to unseen pathologies. We evaluated DVLM on CheXpert, MIMIC-CXR, and PadChest using multi-seed validation (five seeds fivefold cross-validation), controlled ablation studies, and statistical significance testing. DVLM achieved 90.0% ± 0.28% macro-averaged AUROC on CheXpert (95% CI, 89.5-90.6%), with the neural memory module contributing +3.3% improvement ( , Cohen’s ). For zero-shot classification (25% held-out diseases), DVLM achieved 73.5% AUROC, outperforming MedKLIP by 2.3%. Temperature scaling reduced calibration error by 72%, and Grad-CAM localization achieved an IoU of 0.642 against radiologist annotations. Subgroup analysis confirmed equitable performance across demographic groups (maximum disparity, 1.3%). While DVLM demonstrates strong ranking capability suitable for triage applications, threshold-based classification for rare diseases remains limited (F1, 24.8-30.1%), indicating the need for radiologist confirmation in clinical deployment.
PMID:41912958 | DOI:10.1007/s10278-026-01914-2