Performance of AI Tools in Citing Retracted Literature : Content Analysis

J Med Internet Res. 2026 May 1;28:e88766. doi: 10.2196/88766.

ABSTRACT

BACKGROUND: Generative artificial intelligence (GenAI) tools are increasingly used in scientific research to support literature searches, evidence synthesis, and manuscript preparation. While these systems promise substantial efficiency gains, concerns have emerged regarding their reliability, particularly their tendency to cite inaccurate, fabricated, or retracted literature. The unrecognized inclusion of retracted studies poses a serious risk to research integrity and evidence-based decision-making. Whether commonly used GenAI tools can reliably detect, exclude, or transparently communicate the retraction status of scientific publications remains unclear.

OBJECTIVE: This study aimed to evaluate the ability of freely available GenAI tools to correctly handle retracted scientific articles during literature searches. Primary and secondary outcomes focused on accuracy, reliability, and consistency in recognizing retracted literature.

METHODS: In this pragmatic trial, nine widely used free-access GenAI tools (ChatGPT 4, ChatGPT 5, Claude, Gemini, Perplexity, Microsoft Copilot, SciSpace, ScienceOS, and Consensus) were evaluated. Each tool was asked five predefined, standardized questions addressing topic overview, article identification, article summarization, and explicit assessment of retraction status. Overall, 15 retracted articles (the 10 most cited and 5 most recently retracted as of May 23, 2025) were selected from the Retraction Watch database. All questions were repeated twice to assess intratool consistency. Responses were independently rated as correct or incorrect by 2 researchers. Descriptive statistics summarized performance, and comparisons between general-purpose and research-focused AI tools were conducted using descriptive statistics. Interreviewer agreement was assessed using Cohen kappa coefficient.

RESULTS: None of the evaluated AI tools consistently handled retracted articles correctly. No model achieved perfect accuracy across all question sets. ChatGPT 5 performed best, defined by the primary outcome of achieving fully correct responses to all five predefined tasks (5/5) for the highest number of retracted articles, correctly answering all five questions for 8 of 15 articles (53.3%). Research-focused tools (SciSpace, ScienceOS, and Consensus) failed to produce a single fully correct response set. Retracted articles were frequently included in topic overviews without warning, with error rates exceeding 40% in several tools. When specifically asked about retraction status, most systems failed to provide correct or complete information. OpenEvidence only reported data for a subset of our retracted articles as it is only used in health care literature. It demonstrated strong performance in topic overviews but low accuracy in identifying retracted articles.

CONCLUSIONS: Freely available GenAI tools are currently not able to detect, exclude, or appropriately flag retracted scientific literature. The widespread and confident reproduction of retracted studies represents a substantial threat to research integrity, particularly in medical and evidence-based fields. Until retraction-aware verification mechanisms are systematically integrated, independent source checking remains essential when using AI-assisted literature tools.

PMID:42066286 | DOI:10.2196/88766

By Nevin Manimala