Performance of Large Language Models in the Cognitive Analysis of Misinformation: Evaluation Study

JMIR Infodemiology. 2026 May 18;6:e72524. doi: 10.2196/72524.

ABSTRACT

BACKGROUND: Public discourse is significantly impacted by the rapid spread of misinformation on social media platforms. Human moderators, while capable of performing well, face many challenges due to scalability. While large language models (LLMs) show great potential across various language tasks, their capacity for cognitive and contextual analysis, in detecting and interpreting misinformation, remains less evaluated.

OBJECTIVE: This study evaluates the effectiveness of LLMs in detecting and interpreting misinformation compared to human annotators, focusing on tasks requiring cognitive analysis and complex judgment. Additionally, we analyze the influence of different prompt engineering strategies on model performance and discuss ethical considerations for using LLMs in content moderation systems.

METHODS: We evaluated 4 OpenAI models against a panel of human annotators using a subset of posts from the MuMiN dataset. Each model and human annotator responded to structured questions on misinformation, following an established cognitive framework. Both human annotators and LLMs also provided scores indicating how confident they were in their responses. Various prompting strategies were used in this research, including: 0-shot, few-shot, and chain-of-thought, with performance evaluated through precision, recall, F₁-score, and accuracy. We used statistical tests, including the McNemar test, to quantitatively assess differences between LLM and human ratings of misinformation.

RESULTS: GPT-4 Turbo with chain-of-thought prompting achieved the highest performance of all LLMs for detecting misinformation, with an accuracy of 67.2% and an F₁-score of 78.3%, but was outperformed by human annotators, who achieved 70.1% accuracy and an F₁-score of 81%. LLMs performed well in tasks involving logical reasoning and straightforward misinformation detection, but struggled with complex judgments, including detecting sarcasm, understanding misinformation, and analyzing user intent. LLM confidence scores positively correlated with accuracy in simpler tasks (r=0.72, P<.01) but were less reliable in subjective and complex contextual evaluations.

CONCLUSIONS: LLMs show significant potential for automating misinformation detection. Their limitations in understanding and interpreting these posts highlight the current necessity of human oversight. A hybrid framework combining LLMs for preliminary screening with human moderators for more complex evaluation presents a promising future direction. Future research could prioritize the fine-tuning of LLMs using datasets that emphasize cognitive and emotional linguistic features, alongside the development of advanced prompting techniques.

PMID:42149639 | DOI:10.2196/72524

By Nevin Manimala