BMC Urol. 2026 Jan 19. doi: 10.1186/s12894-026-02054-z. Online ahead of print.
ABSTRACT
PURPOSE: To evaluate the performance of ChatGPT-4o in estimating International Prostate Symptom Score (IPSS) and Overactive Bladder Symptom Score (OABSS) based on patients’ natural language descriptions and full outpatient records, compared to actual questionnaire scores.
MATERIALS AND METHODS: This study included 91 patients, of whom 52 completed IPSS and 77 completed OABSS. ChatGPT-4o was prompted with verbatim symptom statements and full medical records written by a urologist. Predicted scores were compared to actual scores using paired t-tests, weighted Cohen’s kappa for item-level agreement, Spearman’s correlation for total scores, and Bland-Altman plots for bias. Diagnostic classifications (lower urinary tract symptoms [LUTS]: IPSS ≥8; overactive bladder [OAB]: OABSS ≥3 with urgency ≥2) were assessed using McNemar’s test and receiver operating characteristic curve analysis.
RESULTS: Mean IPSS scores estimated by ChatGPT-4o were statistically significantly lower than patient-reported scores (11.2 vs. 13.6, p = 0.006), whereas OABSS scores did not differ significantly between the two methods (6.99 vs. 6.86, p = 0.686). Diagnostic agreement was high: LUTS in 42 (actual) vs. 38 (GPT) patients, and OAB in 51 vs. 50 patients. Area under curve was 0.81 for IPSS and 0.91 for OABSS. Kappa values ranged from 0.23-0.81 (IPSS) and 0.44-0.71 (OABSS), with highest concordance in quality of life (QoL) and urgency incontinence. Spearman’s correlation coefficient was 0.60 (IPSS) and 0.70 (OABSS). Accuracy was lower in first-visit patients.
CONCLUSIONS: GPT-4o estimated IPSS and OABSS with moderate but clinically acceptable accuracy. Its performance was comparable regarding diagnostic classification, particularly for QoL and OABSS. ChatGPT-4o may complement traditional questionnaires, particularly with missing or incomplete patient-reported data.
PMID:41555309 | DOI:10.1186/s12894-026-02054-z