J Orthop Surg (Hong Kong). 2025 Sep-Dec;33(3):10225536251407382. doi: 10.1177/10225536251407382. Epub 2025 Dec 7.
ABSTRACT
BackgroundThis study aims to compare the performance of two artificial intelligence (AI) models, ChatGPT-4.0 and DeepSeek-R1, in addressing clinical questions related to degenerative lumbar spinal stenosis (DLSS) using the North American Spine Society (NASS) guidelines as the benchmark.Methods15 clinical questions spanning five domains (diagnostic criteria, non-surgical management, surgical indications, perioperative care, and emerging controversies) were designed based on the 2013 NASS evidence-based clinical guidelines for the diagnosis and management of DLSS. Responses from both models were independently evaluated by two board-certified spine surgeons across four metrics: accuracy, completeness, supplementality, and misinformation. Inter-rater reliability was assessed using Cohen’s κ coefficient, while Mann-Whitney U and Chi-square tests were employed to analyze statistical differences between models.ResultsDeepSeek-R1 demonstrated superior performance over ChatGPT-4.0 in accuracy (median score: 3 vs 2, P = 0.009), completeness (2 vs 1, P = 0.010), and supplementality (2 vs 1, P = 0.018). Both models exhibited comparable performance in avoiding misinformation (P = 0.671). DeepSeek-R1 achieved higher inter-rater agreement in accuracy (κ = 0.727 vs 0.615), whereas ChatGPT-4.0 showed stronger consistency in ssupplementality (κ = 0.792 vs 0.762).ConclusionsWhile both AI models demonstrate potential for clinical decision support, DeepSeek-R1 aligns more closely with NASS guidelines. ChatGPT-4.0 excels in providing supplementary insights but exhibits variability in accuracy. These findings underscore the need for domain-specific optimization of AI models to enhance reliability in medical applications.
PMID:41353581 | DOI:10.1177/10225536251407382