J Craniofac Surg. 2025 Sep 10. doi: 10.1097/SCS.0000000000011930. Online ahead of print.
ABSTRACT
BACKGROUND: With the development of artificial intelligence, obtaining patient-centered medical information through large language models (LLMs) is crucial for patient education. However, existing digital resources in online health care have heterogeneous quality, and the reliability and readability of content generated by various AI models need to be evaluated to meet the needs of patients with different levels of cultural literacy.
OBJECTIVE: This study aims to compare the accuracy and readability of different LLMs in providing medical information related to gynecomastia, and explore the most promising science education tools in practical clinical applications.
METHODS: This study selected 10 most frequently searched questions about gynecomastia from PubMed and Google Trends. Responses were generated using 3 LLMs (DeepSeek-R1, OpenAI-O3, Claude-4-Sonnet), with text quality assessed using the DISCERN-AI and PEMAT-AI scales. Text readability and legibility were comprehensively evaluated through metrics including word count, syllable count, Flesch-Kincaid Grade Level (FKGL), Flesch Kincaid Reading Ease (FKRE), SMOG index, and Automated Readability Index (ARI).
RESULTS: In terms of quality evaluation, among the 10 items of the DISCERN-AI scale, only the overall content quality score showed a statistically significant difference (P = 0.001), with DeepSeek-R1 demonstrating the best performance at a median score of 5 (5,5). Regarding readability, DeepSeek-R1 exhibited the highest average word count and syllable count, both with P-values of 0.000. The 3 models showed no significant differences in FKGL, FKRE, or automatic readability indices. Specifically, the averaged FKGL scores of DeepSeek-R1 was 14.08, OpenAI-O3 was 14.1, and Claude-4-sonnet was 13.31. The SOMG evaluation revealed that Claude-4-sonnet demonstrated the strongest readability, the average value is 11 with a P-value of 0.028.
CONCLUSION: DeepSeek-R1 demonstrated the highest overall quality in content generation, followed by Claude-4-sonnet. Evaluations using FKGL, SMOG index, and ARI all indicated that Claude-4-sonnet exhibited the best readability. Given that improvements in quality and readability can enhance patient engagement and reduce anxiety, these 2 models should be prioritized for patient education applications. Future efforts should focus on integrating these advantages to develop more reliable large-scale medical language models.
PMID:40929657 | DOI:10.1097/SCS.0000000000011930