Eur Arch Otorhinolaryngol. 2025 Sep 18. doi: 10.1007/s00405-025-09656-7. Online ahead of print.
ABSTRACT
OBJECTIVES: This study aimed to compare the ability of two major language models, ChatGPT-4.0 and Gemini 1.5 Flash, to establish a research methodology based on scientific publications in laryngology.
METHODS: We screened 80 articles selected from five prestigious otolaryngology journals and included 60 articles with a methods section and statistical analysis. These were classified according to six research types: cell culture, animal experiments, prospective, retrospective, systematic review, and artificial intelligence. A total of 30 studies were analyzed, with five articles randomly selected from each group. For each article, both language models were asked to produce research methodologies, and the responses were evaluated by two independent raters.
RESULTS: There was no statistically significant difference between the mean scores of the models (p > 0.05). ChatGPT 4.0 had a higher mean score (5.17 ± 1.12), especially in the data collection and measurement-assessment category. The Gemini model showed relatively more balanced performance in the statistical analysis category. The weighted kappa values were between 0.54 and 0.71, indicating a moderate to high agreement between the raters. In the analysis by article type, Gemini’s performance in Q1 showed significant variation (p = 0.038).
CONCLUSION: Large language models such as ChatGPT and Gemini provide similarly consistent results in establishing the methodology of scientific studies in laryngology. Both models can be considered supportive tools; however, expert supervision is needed, especially for complex constructs such as statistical analysis. This study makes original contributions to the usability of LLMs for study design in laryngology.
PMID:40968205 | DOI:10.1007/s00405-025-09656-7