Cutan Ocul Toxicol. 2026 Jun 24:1-10. doi: 10.1080/15569527.2026.2692370. Online ahead of print.
ABSTRACT
BACKGROUND: Multimodal Large Language Models (LLMs) are increasingly positioned as diagnostic assistants in dermatology. However, current research often relies on clear-cut cases, leaving their performance in clinically ambiguous, gray zone scenarios insufficiently explored. Specifically, whether integrating visual data helps LLMs correct initial human misdiagnoses or reinforces cognitive biases remains unknown.
OBJECTIVES: To evaluate the diagnostic accuracy of three recent multimodal large language models all queried through their default web interfaces on 5 February 2026, using standardized single-turn prompts in biopsy-confirmed dermatologic cases initially misdiagnosed by clinicians, and to assess the impact of visual integration on human error replication rates.
METHODS: A cross-sectional analysis was conducted on 30 diagnostic dilemmas confirmed by histopathology. Models were queried using a two-stage protocol: (1) Text-Only and (2) Multimodal. Primary outcomes were Top-1 accuracy, visual gain, and the rate of replicating the clinician’s initial error.
RESULTS: Gemini 3 achieved the highest multimodal Top-1 accuracy (60.0% (18/30), followed by ChatGPT-5.2 at 56.7% (17/30) and Claude 4.5 Sonnet at 33.3% (10/30). In the inflammatory subgroup, Gemini 3 accuracy increased from 45.5% (5/11) in text-only to 72.7% (8/11) in multimodal mode; this difference was not statistically significant (McNemar’s test, p = 0.248). All models showed limited accuracy for malignant lesions using macro-images.
CONCLUSIONS: While Gemini 3 shows promise as a de-biasing tool in complex inflammatory dermatoses, multimodal LLMs currently lack the granular precision required for malignancy detection without dermoscopic data. These findings underscore the need for cautious integration of AI in high-stakes diagnostic scenarios.
PMID:42340682 | DOI:10.1080/15569527.2026.2692370