Categories
Nevin Manimala Statistics

AI-generated patient education for ankylosing spondylitis: a comparative study of readability and quality

Clin Rheumatol. 2025 Dec 13. doi: 10.1007/s10067-025-07771-8. Online ahead of print.

ABSTRACT

OBJECTIVE: To evaluate and compare the quality and readability of patient education materials (PEM) related to ankylosing spondylitis (AS) generated by four AI-based large language models (LLMs): ChatGPT-4o, ChatGPT-3.5, DeepSeek R1, and DeepSeek V3.

METHODS: On May 1, 2025, the ten most frequently searched AS-related questions were identified using Google Trends (Turkey). These questions were posed to the four LLMs, and the responses were recorded without modification. Quality was assessed by two independent rheumatologists. The quality was evaluated using the DISCERN tool. Readability and comprehensibility were assessed using the Flesch Reading Ease Score (FRES) and the Flesch-Kincaid Grade Level (FKGL). Inter-rater reliability was analyzed using the intraclass correlation coefficient (ICC). Mean scores and 95% confidence intervals (CI) were reported.

RESULTS: ChatGPT-4o achieved the highest average DISCERN score (72.38), followed by DeepSeek R1 (69.76), ChatGPT-3.5 (68.82), and DeepSeek V3 (68.79). Inter-rater reliability for DISCERN was excellent (ICC, 0.931). ChatGPT-4o had the highest mean DISCERN score, although the difference was not statistically significant. For readability analysis, DeepSeek V3 had the highest FERS score (14.93). This suggested that DeepSeek V3 was more easily understandable than other LLMs. ChatGPT-3.5 received the lowest score (5.29). FKGL scores varied within a narrow range (15.33-15.93) across models. Therefore, it was interpreted that the data required university-level reading skills. Conclusion For AS, AI-generated PEMs were generally complex enough to meet the needs of highly educated patients. The responses were information-dense and complex, requiring excessive expertise regardless of the recipient’s educational level. In the future, improving the clarity and comprehensibility of the language according to personal characteristics (educational level, etc.) and providing evidence-based citations could help make LLMs more useful in clinical settings or for the public. Key Points • This study compared how different AI chatbots explain ankylosing spondylitis to patients. • Although the information quality was high, the language used was too complex for most patients. • ChatGPT-4o gave the most accurate content, while DeepSeek V3 used the easiest words. • Future AI tools should use simpler language and include reliable references to better support patient education.

PMID:41390886 | DOI:10.1007/s10067-025-07771-8

By Nevin Manimala

Portfolio Website for Nevin Manimala