Categories
Nevin Manimala Statistics

ChatGPT performance in answering medical residency questions in nephrology: a pilot study in Brazil

J Bras Nefrol. 2025 Oct-Dec;47(4):e20240254. doi: 10.1590/2175-8239-JBN-2024-0254en.

ABSTRACT

OBJECTIVE: This study evaluated the performance of ChatGPT 4 and 3.5 versions in answering nephrology questions from medical residency exams in Brazil.

METHODS: A total of 411 multiple-choice questions, with and without images, were analyzed, organized into four main themes: chronic kidney disease (CKD), hydroelectrolytic and acid-base disorders (HABD), tubulointerstitial diseases (TID), and glomerular diseases (GD). Questions with images were answered only by ChatGPT-4. Statistical analysis was performed using the chi-square test.

RESULTS: ChatGPT-4 achieved an overall accuracy of 79.80%, while ChatGPT-3.5 achieved 56.29%, with a statistically significant difference (p < 0.001). In the main themes, ChatGPT-4 performed better in HABD (79.11% vs. 55.17%), TID (88.23% vs. 52.23%), CKD (75.51% vs. 61.95%), and DG (79.31% vs. 55.29%), all with p < 0.001. ChatGPT-4 presented an accuracy of 81.49% in questions without images and 54.54% in questions with images, with an accuracy of 60% for electrocardiogram analysis. This study is limited by the small number of image-based questions and the use of outdated examination items, reducing its ability to assess visual diagnostic skills and current clinical relevance. Furthermore, addressing only 4 areas of Nephrology may not fully represent the breadth of nephrology practice.

CONCLUSION: ChatGPT-3.5 was found to have limitations in nephrology reasoning compared to ChatGPT-4, evidencing gaps in knowledge. The study suggests that further exploration is needed in other nephrology themes to improve the use of these AI tools.

PMID:40623208 | DOI:10.1590/2175-8239-JBN-2024-0254en

By Nevin Manimala

Portfolio Website for Nevin Manimala