Zhonghua Wai Ke Za Zhi. 2026 Feb 1;64(2):182-190. doi: 10.3760/cma.j.cn112139-20250814-00402.
ABSTRACT
Objective: To explore the performance of large language model (LLM) in diagnosing clinically significant prostate cancer (csPCa), and the improvement in diagnostic performance of open-source LLM after low-rank adaptation (LoRA) fine-tuning. Methods: This is a retrospective case series study. Data from 1 077 patients who underwent ultrasound-guided systematic prostate biopsy at Department of Urology,Peking University Third Hospital from January 2018 to December 2024 were collected, aged (M(IQR)) 69(13) years (range:38 to 90 years) including 391 patients in the gray zone (prostate-specific antigen 4 to 10 μg/L). The collected data included patients’ clinical characteristics, prostate MRI reports, and biopsy histopathological results. Four LLM (GPT 4.1, DeepSeek R1, Qwen3-235B-A22B, Qwen3-32B) were used to diagnose csPCa based on patient information, and the performance of the LLM was evaluated using biopsy histopathological results as the gold standard. Subsequently, the data from 1 077 patients were divided into training and test sets at an 8∶2 ratio, and LoRA fine-tuning was performed on Qwen3-32B. The fine-tuned model was named PCD-Qwen3, and its diagnostic efficacy in the test set was evaluated. The receiver operating characteristics curve was plotted and the area under the curve (AUC) and 95%CI were calculated to evaluate the diagnostic performance of LLM. The Delong test was used to compare the differences in AUC between groups. Results: Among all patients, DeepSeek R1 had the highest AUC for diagnosing csPCa at 0.848 (95%CI: 0.826 to 0.871), with statistically significant differences compared to Qwen3-235B-A22B (0.827 (95%CI: 0.803 to 0.851)) and Qwen3-32B (0.753 (95%CI: 0.724 to 0.781))(Z=2.34, P=0.020; Z=7.35, P<0.01), but no difference compared to GPT 4.1(0.842 (95%CI: 0.819 to 0.865))(P>0.05). The accuracy, sensitivity, and specificity of DeepSeek R1 for diagnosing csPCa were 77.3%, 70.2%, and 84.1%, respectively. In the gray zone patient population with total prostate specific antigen of 4 to 10 μg/L, DeepSeek R1 had an AUC of 0.765 (95%CI: 0.715 to 0.816) for diagnosing csPCa. Using DeepSeek R1 to diagnose gray zone patients could avoid 46.3% (181/391) of unnecessary biopsies while missing 5.9% (23/391) of csPCa patients. Except for Qwen3-32B, the PI-RADS scores evaluated by the three LLM achieved moderate agreement with those of radiologists. After LoRA fine-tuning, the diagnostic performance of PCD-Qwen3 was significantly improved compared to Qwen3-32B. In the test set of 216 patients, the accuracy, sensitivity, specificity, and AUC were 77.3%, 75.5%, 79.1%, and 0.831 (95%CI: 0.776 to 0.885), respectively, comparable to the performance of DeepSeek R1 (all P>0.05). Conclusions: Among the four LLM, DeepSeek R1 had the best performance in diagnosing csPCa. After LoRA fine-tuning, PCD-Qwen3 achieved performance comparable to DeepSeek R1. LLM demonstrated promising application value in diagnosing csPCa.
PMID:41667933 | DOI:10.3760/cma.j.cn112139-20250814-00402