DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study

Int J Surg. 2025 Jun 1;111(6):4056-4059. doi: 10.1097/JS9.0000000000002386. Epub 2025 Apr 3.

ABSTRACT

BACKGROUND: Large language models (LLMs) have demonstrated potential in medical diagnostics, but their accuracy in complex cases remains a subject of investigation. DeepSeek-R1, an open-source model with advanced reasoning capabilities, has gained global attention. This study evaluates the diagnostic performance of DeepSeek-R1 compared to GPT-4 in complex clinical cases.

MATERIALS AND METHODS: A historical control study was conducted using 100 clinicopathologic cases from the New England Journal of Medicine (NEJM), published between 18 August 2022, and 30 January 2025. Each case was processed using DeepSeek-R1 with a structured diagnostic prompt. The model’s performance was assessed based on final diagnosis accuracy, differential diagnosis inclusion rate, ranking of correct diagnoses, and differential quality scores. Results were statistically compared to previously published GPT-4 performance data using chi-square, Mann-Whitney U, and t-tests.

RESULTS: DeepSeek-R1 correctly matched the final diagnosis in 35% of cases (35/100), which was comparable to GPT-4’s accuracy (39%; P = 0.634). However, DeepSeek-R1 included the correct diagnosis in its differential list in 48% of cases, significantly lower than GPT-4 (64%; P = 0.036). DeepSeek-R1 generated longer differential diagnoses (11.9 ± 2.0 vs. 9.0 ± 1.4; P = 0.000004) but maintained a similar mean rank for correct diagnoses (1.8 ± 2.2 vs. 2.5 ± 2.5; P = 0.288566) and equivalent differential quality scores (4.2 ± 0.10 vs. 4.2 ± 1.3; P = 0.099667).

CONCLUSION: DeepSeek-R1 exhibits diagnostic accuracy comparable to GPT-4 while generating more diverse differential diagnoses. Its open-source nature and innovative reasoning strategies may enhance medical AI applications. Future studies should explore real-world clinical integration and refinement of differential diagnosis prioritization.

PMID:40505040 | DOI:10.1097/JS9.0000000000002386

By Nevin Manimala