Diagnostic Performance of Contemporary Large Language Models on Free-Text Histopathologic Descriptions in Oral and Maxillofacial Pathology

Head Neck Pathol. 2026 Jun 11;20(1):65. doi: 10.1007/s12105-026-01929-9.

ABSTRACT

PURPOSE: To benchmark the diagnostic accuracy and inter-model agreement of contemporary large language models (LLMs) on text-only histopathologic narrative descriptions in oral and maxillofacial pathology (OMFP), and to assess how performance varies by diagnostic dependency and diagnostic category.

METHODS: In this retrospective diagnostic accuracy study, 171 de-identified OMFP cases were screened, and 155 cases were included after predefined selection. Each case comprised an edited histopathologic narrative and a single expert-verified reference diagnosis. Three general-purpose LLMs (ChatGPT-5.0, Gemini-2.5-Pro, Claude-Opus-4.1) were queried once per case using an identical zero-shot prompt requesting a single definitive diagnosis. Primary outcome was case-level accuracy; paired comparisons used McNemar’s test, and inter-model agreement was assessed with Cohen’s κ. Performance was stratified by histology-sufficient (HS) versus correlation-dependent (CD) lesions and by OMFP diagnostic category.

RESULTS: Overall accuracies were 83.2% (ChatGPT-5.0), 77.4% (Gemini-2.5-Pro), and 72.3% (Claude-Opus-4.1). ChatGPT-5.0 significantly outperformed Claude-Opus-4.1, whereas the differences between ChatGPT-5.0 and Gemini-2.5-Pro and between Gemini-2.5-Pro and Claude-Opus-4.1 were not statistically significant. Agreement was highest between Gemini-2.5-Pro and Claude-Opus-4.1 (κ = 0.73). All models showed higher accuracy for HS than CD lesions, with best performance in malignant, hematolymphoid, and immune-mediated categories.

CONCLUSION: Contemporary LLMs can interpret OMFP histopathologic narratives with moderate overall diagnostic accuracy, particularly for lesions with distinctive microscopic features. Performance declined for CD entities, indicating persistent reliance on clinicoradiologic context. Although not suitable as stand-alone diagnosticians, these findings provide a controlled benchmark of text-based diagnostic performance and suggest potential supportive value in OMFP education and exploratory pathology informatics applications under expert supervision.

PMID:42274902 | DOI:10.1007/s12105-026-01929-9

By Nevin Manimala