Categories
Nevin Manimala Statistics

Assessing the quality and readability of patient education materials on chemotherapy cardiotoxicity from artificial intelligence chatbots: An observational cross-sectional study

Medicine (Baltimore). 2025 Apr 11;104(15):e42135. doi: 10.1097/MD.0000000000042135.

ABSTRACT

Artificial intelligence (AI) and the introduction of Large Language Model (LLM) chatbots have become a common source of patient inquiry in healthcare. The quality and readability of AI-generated patient education materials (PEM) is the subject of many studies across multiple medical topics. Most demonstrate poor readability and acceptable quality. However, an area yet to be investigated is chemotherapy-induced cardiotoxicity. This study seeks to assess the quality and readability of chatbot created PEM relative to chemotherapy-induced cardiotoxicity. We conducted an observational cross-sectional study in August 2024 by asking 10 questions to 4 chatbots: ChatGPT, Microsoft Copilot (Copilot), Google Gemini (Gemini), and Meta AI (Meta). The generated material was assessed for readability using 7 tools: Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), Simple Measure of Gobbledygook (SMOG) Index, Automated Readability Index (ARI), and FORCAST Grade Level. Quality was assessed using modified versions of 2 validated tools: the Patient Education Materials Assessment Tool (PEMAT), which outputs a 0% to 100% score, and DISCERN, a 1 (unsatisfactory) to 5 (highly satisfactory) scoring system. Descriptive statistics were used to evaluate performance and compare chatbots amongst each other. Mean reading grade level (RGL) across all chatbots was 13.7. Calculated RGLs for ChatGPT, Copilot, Gemini and Meta were 14.2, 14.0, 12.5, 14.2, respectively. Mean DISCERN scores across the chatbots was 4.2. DISCERN scores for ChatGPT, Copilot, Gemini, and Meta were 4.2, 4.3, 4.2, and 3.9, respectively. Median PEMAT scores for understandability and actionability were 91.7% and 75%, respectively. Understandability and actionability scores for ChatGPT, Copilot, Gemini, and Meta were 100% and 75%, 91.7% and 75%, 90.9% and 75%, and 91.7% and 50%, respectively. AI chatbots produce high quality PEM with poor readability. We do not discourage using chatbots to create PEM but recommend cautioning patients about their readability concerns. AI chatbots are not an alternative to a healthcare provider. Furthermore, there is no consensus on which chatbots create the highest quality PEM. Future studies are needed to assess the effectiveness of AI chatbots in providing PEM to patients and how the capabilities of AI chatbots are changing over time.

PMID:40228277 | DOI:10.1097/MD.0000000000042135

By Nevin Manimala

Portfolio Website for Nevin Manimala