Are large language models consistent with the ASPS and AAPS guidelines? A comparison of AI chatbot recommendations and plastic surgery clinical guidance

J Plast Reconstr Aesthet Surg. 2026 Mar 16;116:215-222. doi: 10.1016/j.bjps.2026.03.009. Online ahead of print.

ABSTRACT

INTRODUCTION: Assessing the ability of AI chatbots to provide information consistent with clinical guidelines is essential for evaluating the accuracy of the information that patients may receive. We evaluated the ability of three widely used chatbots to reference and respond to clinical questions in alignment with the American Society of Plastic Surgeons’ (ASPS) clinical guidelines.

METHODS: Evidence-based clinical practice guidelines from ASPS and the American Association of Plastic Surgeons (AAPS) were used to develop prompts for ChatGPT-4, Meta Llama 3.1, and Microsoft Copilot. Reviewers determined if the chatbots’ answer aligned with the ASPS guidelines. Any reference to ASPS by the chatbots was recorded. Descriptive statistics were used for data analysis.

RESULTS: Forty-nine total recommendations from five clinical guidelines were included: reduction mammoplasty, autologous breast reconstruction, breast-implant associated anaplastic large cell lymphoma, eyelid surgery, and reconstruction after skin cancer. Copilot cited ASPS recommendations most frequently (Copilot: 67.3%, Llama: 34.7%, ChatGPT: 16.3%; p<0.0001) and had the highest rate of ASPS- and AAPS-aligned responses (Copilot: 79.6%, Llama: 73.5%, ChatGPT: 69.4%; p>0.05). Among the misaligned responses, neutral responses were most common with no significant differences among the chatbots (Copilot: 60%, Llama: 69.2%, ChatGPT: 40%; p=0.62).

CONCLUSION: In our study, up to 30% of chatbot responses did not align with ASPS and AAPS guidance. These results indicate a need for advocacy from plastic surgery societies regarding patient reliance on AI chatbots and training AI models specific to the specialty.

PMID:41985209 | DOI:10.1016/j.bjps.2026.03.009

By Nevin Manimala