Performance of large language models conducting systematic review tasks in prosthodontics

J Prosthet Dent. 2026 Mar 12:S0022-3913(26)00090-9. doi: 10.1016/j.prosdent.2026.02.009. Online ahead of print.

ABSTRACT

STATEMENT OF PROBLEM: Systematic reviews (SRs) are time-consuming and resource-intensive processes. Whether large language models (LLMs) can improve the process is unclear.

PURPOSE: The purpose of this study was to evaluate the accuracy and reliability of 4 LLMs (GPT-4, Gemini, Claude, and Elicit) in performing SR tasks (full-text screening, data extraction, and risk of bias assessment) at 3 different sequential periods of time (0, 15, and 30 days).

MATERIAL AND METHODS: A comprehensive systematic search was conducted across 5 databases in December 2024, with 59 articles evaluated for screening (2 used for pilot) and 31 for data extraction (2 used for pilot). A 3-pronged prompting strategy was used, including persona-based initialization, few-shot learning, and structured population, intervention, control, outcome (PICO) criteria. Performance was assessed through 3 repeated evaluations at 2-week intervals by measuring accuracy and reliability using standard metrics (accuracy, precision, F1-score, sensitivity, and specificity) against expert assessments, data extraction quality on a 0 to 5 scale, and risk of bias agreement via the Cohen kappa, with statistical analysis using Kruskal-Wallis and Dunn post-hoc tests (α=.05).

RESULTS: In full-text screening, Claude achieved the highest sensitivity at 97%, while Claude and Elicit both showed strong overall performance with 86% accuracy and 87% F1-scores. All models maintained sensitivity above 90%. For data extraction, GPT-4 consistently performed best with median scores of 5.0, while Claude and Gemini showed similar capabilities. Significant differences only appeared in labeling and modeling tasks during Week 1 (P=.04). Risk of bias assessment agreement with experts varied from 55% to 90% across different criteria.

CONCLUSIONS: LLMs show potential for SR efficiency (especially for data extraction) but require human oversight because of variable performance across models and tasks.

PMID:41826091 | DOI:10.1016/j.prosdent.2026.02.009

By Nevin Manimala