Evaluation of ChatGPT as a supplementary tool for pituitary adenomas: An observational study based on simulated consultations

Medicine (Baltimore). 2025 Nov 14;104(46):e45928. doi: 10.1097/MD.0000000000045928.

ABSTRACT

Chat Generative Pretrained Transformer (ChatGPT), a large language model developed by OpenAI, has shown potential in healthcare communication and patient education. However, its performance in specialized medical domains, such as pituitary adenomas (PAs), remains unclear. Therefore, this study aimed to evaluate the reliability and consistency of ChatGPT in answering PA-related questions. We hypothesized that ChatGPT would demonstrate high reliability in responding to general patient-oriented queries but lower reliability for specialized clinical questions. A total of 256 PA-related questions were collected from patients and families, clinical practice guidelines, and medical question banks. Each question was input into ChatGPT (GPT-4, March 2025 version), and the generated responses were independently reviewed by 2 senior neurosurgeons. Any discrepancies in their assessments were resolved by a third neurosurgeon with over 30 years of clinical experience. Responses were categorized as completely correct, partially correct but usable, partially correct, or incorrect. Responses rated as completely correct or partially correct but usable were considered reliable. Consistency was assessed based on the stability of response quality across similar question types. Comparisons were made by question type (general vs professional) and source using univariate analysis. Among the 256 responses, 143 (55.8%) were completely correct, 68 (26.6%) were partially correct but usable, 19 (7.4%) were partially correct, and 26 (10.2%) were incorrect. Overall, 82.4% of the responses were considered reliable, and 68.4% demonstrated consistency. Reliability was significantly higher for general questions than for professional ones (95.0% vs 78.6%, OR = 5.182, 95% CI: 1.545-17.378, P = .003), and for guideline-derived questions compared to question bank-derived ones (100.0% vs 75.7%, OR = 1.321, 95% CI: 1.214-1.437, P = .017). Differences in consistency across subgroups were not statistically significant. ChatGPT exhibits high reliability and moderate consistency in answering PA-related questions, especially for general and guideline-based content. It may serve as a supplementary source of patient information but should not replace professional medical consultation, particularly in complex or surgical contexts. As this study was conducted in an artificial testing environment without validation in real patient consultations, the generalizability of the findings remains limited.

PMID:41239728 | DOI:10.1097/MD.0000000000045928

By Nevin Manimala