Int Urol Nephrol. 2026 May 6. doi: 10.1007/s11255-026-05178-1. Online ahead of print.
ABSTRACT
PURPOSE: This study aimed to evaluate the concordance between treatment recommendations generated by LLMs and decisions made by a multidisciplinary uro-oncology tumor board.
METHODS: Forty-eight consecutive prostate cancer cases previously discussed at a multidisciplinary tumor board were retrospectively analyzed. For each case, treatment recommendations were generated using five LLM platforms (ChatGPT-4o, ChatGPT, Perplexity, Copilot, and DeepSeek) based on standardized clinical summaries. Four independent urology specialists evaluated the concordance between LLM recommendations and tumor board decisions using a 5-point Likert scale. Differences among models were assessed using the Friedman test followed by Bonferroni-corrected Wilcoxon signed-rank tests. Inter-rater agreement was calculated using the intraclass correlation coefficient.
RESULTS: Significant differences in concordance were observed among the evaluated AI platforms (χ2 = 32.16, p < 0.001). Perplexity and ChatGPT-4o demonstrated the highest alignment with tumor board decisions, each achieving a median Likert score of 4.75, whereas Copilot showed the lowest concordance (median 3.00). DeepSeek and ChatGPT demonstrated intermediate performance. Post hoc analyses revealed that Perplexity significantly outperformed several lower-performing platforms; however, no statistically significant difference was observed between Perplexity and ChatGPT-4o (p = 0.149). Expert evaluations showed strong inter-rater agreement (ICC = 0.82).
CONCLUSION: Large language models can demonstrate substantial concordance with multidisciplinary tumor board decisions in prostate cancer management. However, variability among models and the risk of hallucinated information indicate that LLMs should function as clinical decision-support tools under expert supervision rather than as autonomous decision-makers.
PMID:42090098 | DOI:10.1007/s11255-026-05178-1