Categories
Nevin Manimala Statistics

Evaluation of large language models in terms of safety, contraindications, and adverse effect information related to botulinum toxin applications

Cutan Ocul Toxicol. 2026 Jun 3:1-6. doi: 10.1080/15569527.2026.2680300. Online ahead of print.

ABSTRACT

BACKGROUND: Large language models (LLMs) are increasingly used for the dissemination of health-related information. However, data regarding the accuracy and adequacy with which they present information on the safety, contraindications, and adverse effects of botulinum toxin (BoNT) applications remain limited.

METHODS: In this study, the performance of the ChatGPT-5 mini (OpenAI), DeepSeek-V3.2, and Gemini 3 Flash LLMs in presenting safety and risk-related information on BoNT applications was comparatively evaluated using 10 categorized and structured questions. Responses were independently scored by three dermatologists using a predefined four-point evaluation scale.

RESULTS: ChatGPT-5 mini (OpenAI) and DeepSeek-V3.2 achieved higher and more consistent scores in the domains of general safety, contraindications, and adverse effects. In contrast, the Gemini 3 Flash model demonstrated lower performance, particularly in patient safety-critical areas such as systemic spread and toxicity, as well as drug interactions. A statistically significant difference was observed in the distribution of response quality among the models (p < 0.05).

CONCLUSION: The findings suggest that the level at which LLMs present information related to BoNT applications may vary. Therefore, these tools should be considered supportive aids under physician supervision rather than independent sources of clinical decision-making.

PMID:42235011 | DOI:10.1080/15569527.2026.2680300

By Nevin Manimala

Portfolio Website for Nevin Manimala