Eur Arch Otorhinolaryngol. 2025 Nov 1. doi: 10.1007/s00405-025-09785-z. Online ahead of print.
ABSTRACT
PURPOSE: This study aims to compare the evaluation of obstructive sleep apnea (OSA) by ten super-experts using responses from a 10-question survey answered by 3 different artificial intelligence chatbots, Chat GPT-3.5, Chat GPT-4.0, Gemini, and a panel of 100 otolaryngologists specialized in sleep medicine.
METHODS: A 10-question survey regarding OSA management was answered by Chat GPT-3.5, Chat GPT-4.0, Gemini, and a panel of 100 otolaryngologists. The responses were assessed by ten super-experts in sleep medicine for their agreement with expert consensus, using a Likert scale. Statistical analyses were performed to evaluate the level of agreement and significance.
RESULT: Expert consensus had the highest mean score (4.5 ± 0.9), significantly outperforming all AI models. ChatGPT-3.5 was the best among AI systems, with a score of 4.1 ± 1.2 (p=0.003), followed by ChatGPT-4 with 3.9 ± 1.4 (p<0.001) and Gemini with 3.6 ± 1.5 (p<0.001). Perfect agreement with expert consensus was achieved in specific scenarios, particularly regarding indications for bariatric surgery and lateral pharyngoplasty. However, there were significant differences in complex clinical scenarios that required integration of multiple factors, particularly in therapeutic management questions where the performance of AI models was significantly below that of expert consensus (p<0.01).
CONCLUSIONS: Although AI models are promising in the management of OSA, especially for well-defined clinical scenarios, they at present serve best as complementary tools rather than replacements for expert clinical judgment. Most surprisingly, ChatGPT-3.5 outperformed its newer versions in many aspects, indicating that model updates with general capabilities may not always lead to better performance in specialized medical domains. These findings emphasize the potential of AI as a supportive resource while emphasizing the continuing need for human expertise in complex clinical decision-making.
PMID:41176557 | DOI:10.1007/s00405-025-09785-z