A Comparative Analysis of Three Large Language Models on Bruxism Knowledge

J Oral Rehabil. 2025 Feb 6. doi: 10.1111/joor.13948. Online ahead of print.

ABSTRACT

BACKGROUND: Artificial Intelligence (AI) has been widely used in health research, but the effectiveness of large language models (LLMs) in providing accurate information on bruxism has not yet been evaluated.

OBJECTIVES: To assess the readability, accuracy and consistency of three LLMs in responding to frequently asked questions about bruxism.

METHODS: This cross-sectional observational study utilised the Google Trends tool to identify the 10 most frequent topics about bruxism. Thirty frequently asked questions were selected, which were submitted to ChatGPT-3.5, ChatGPT-4 and Gemini at two different times (T1 and T2). The readability was measured using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKG) metrics. The responses were evaluated for accuracy using a three-point scale, and consistency was verified by comparing responses between T1 and T2. Statistical analysis included ANOVA, chi-squared tests and Cohen’s kappa coefficient considering a p value of 0.5.

RESULTS: In terms of readability, there was no difference in FRE. The Gemini model showed lower FKG scores than the Generative Pretrained Transformer (GPT)-3.5 and GPT-4 models. The average accuracy of the responses was 68.33% for GPT-3.5, 65% for GPT-4 and 55% for Gemini, with no significant differences between the models (p = 0.290). Consistency was substantial for all models, with the highest being in GPT-3.5 (95%). The three LLMs demonstrated substantial agreement between T1 and T2.

CONCLUSION: Gemini’s responses were potentially more accessible to a broader patient population. LLMs demonstrated substantial consistency and moderate accuracy, indicating that these tools should not replace professional dental guidance.

PMID:39912320 | DOI:10.1111/joor.13948

By Nevin Manimala