ScientificWorldJournal. 2026 Mar 3;2026:5479774. doi: 10.1155/tswj/5479774. eCollection 2026.
ABSTRACT
OBJECTIVE: This study is aimed at evaluating and comparing the scientific reliability of three large language models (LLMs), Perplexity, iASK, and ChatGPT 4o mini, based on their responses to orthodontic-related queries.
MATERIALS AND METHODS: The three LLMs were prompted with 10 clinical orthodontic questions, and their responses were assessed independently by two evaluators using a structured scoring system (0-10). Statistical analyses, including Pearson and Spearman correlations, Cronbach’s alpha, and Wilcoxon signed-rank test, were performed to determine interevaluator reliability and model performance differences.
RESULTS: Perplexity achieved the highest mean score (7.2), followed by iASK (5.4) and ChatGPT 4o mini (5.2). High consistency between evaluators was observed (Cronbach‘s alpha = 0.947). A significant difference was noted between Perplexity and both ChatGPT 4o mini and iASK (p = 0.002). Pearson and Spearman correlations indicated strong agreement between evaluators (r = 0.982, ρ = 1.000).
CONCLUSION: Perplexity demonstrated superior performance in orthodontic-related queries compared to ChatGPT 4o mini and iASK. The findings highlight the importance of evaluating AI models for clinical applicability and reliability.
PMID:41789391 | PMC:PMC12957766 | DOI:10.1155/tswj/5479774