Assessing the performance of large language models when used to determine ASA status of cats and dogs and generate anaesthetic protocols

Vet Rec. 2026 May 12. doi: 10.1002/vetr.70741. Online ahead of print.

ABSTRACT

BACKGROUND: Large language models (LLMs) are emerging as decision-support tools in human medicine; however, their evaluation in veterinary anaesthesiology remains limited.

METHODS: We retrospectively analysed 225 anonymised feline and canine cases (American Society of Anaesthesiologists [ASA] classifications 1‒5) from Atatürk University Veterinary Hospital. ChatGPT-4o, ChatGPT-5 and Gemini 2.5 Pro independently assigned ASA classifications and generated anaesthetic protocols using standardised prompts. Protocol adequacy was evaluated for all cases, regardless of ASA classification agreement, by two experienced veterinary anaesthesiologists using a four-point scale. Statistical analyses included Friedman and Bonferroni-adjusted Wilcoxon tests, effect sizes and inter-panelist reliability (assessed by quadratic-weighted Cohen’s kappa and intraclass correlation coefficient).

RESULTS: ChatGPT-5 achieved the highest ASA classification accuracy (53.3%), followed by ChatGPT-4o (46.7%) and Gemini 2.5 Pro (30.7%). The performance was strongest for ASA 3‒5, whereas ASA 1 cases were frequently misclassified, mainly due to ASA overestimation. ChatGPT-5 generated the most clinically sufficient anaesthetic protocols, outperforming the other models.

LIMITATIONS: The retrospective, single-centre design and inclusion of only feline and canine cases may limit generalisability.

CONCLUSIONS: LLMs can generate clinically relevant ASA classifications and anaesthetic protocols in veterinary anaesthesiology, although performance varies across models. However, expert oversight remains essential.

PMID:42117364 | DOI:10.1002/vetr.70741

By Nevin Manimala