Assessment of AI-Generated Patient Education Materials for Bladder Training and Pelvic Floor Muscle Therapy: Comparison with an International Society Leaflet

Int Urogynecol J. 2026 Apr 17. doi: 10.1007/s00192-026-06660-1. Online ahead of print.

ABSTRACT

INTRODUCTION AND HYPOTHESIS: High-quality patient education materials are essential in urogynecology. We hypothesized that patient handouts generated by different large language models (LLMs) would vary in quality and readability and would differ from an established society-produced leaflet.

METHODS: Twelve leaflets on bladder training and pelvic floor muscle therapy from six origins: GPT-4, Gemini-2.5 Pro, Sonnet-4, Llama-4, Perplexity, and The International Urogynecological Association (IUGA), were produced or obtained and standardized into plain text. Three blinded reviewers assessed completeness, information quality (DISCERN), and the Patient Education Materials Assessment Tool (PEMAT-A: actionability; PEMAT-U: understandability). The statistical plan included ordinary least squares fixed-effects per metric with type II analysis of variance for source effects; estimated marginal means with Holm-adjusted pairwise comparisons; a crossed mixed-effects model for topic groups; and inter-rater reliability was measured. Readability and text analyses used standard indices.

RESULTS: Origins varied in completeness (p < 0.001), DISCERN (p < 0.001), and PEMAT-A (p = 0.0018); PEMAT-U showed a trend (p = 0.063). Llama-4 scored significantly lower on completeness and DISCERN, and lower than GPT-4, IUGA, and Perplexity on PEMAT-A; Sonnet4 outperformed Llama-4 on PEMAT-U. No single origin dominated all metrics. Readability varied greatly: GPT-4 had an average Flesch-Kincaid grade level ≈ 6.6, Gemini ≈ 7.4; Sonnet4 ≈ 15; Llama-4 ≈ 17. IUGA leaflets were the longest, with grade levels around 9-10. Bladder-training materials were modestly more complete than pelvic muscle materials (p = 0.045). Inter-rater reliability was high (ICC ≥ 0.87).

CONCLUSIONS: Patient education quality varies substantially across AI tools and compared with society materials. AI-generated content can meet readability targets but requires expert review to ensure completeness and reliability before clinical use.

PMID:41998329 | DOI:10.1007/s00192-026-06660-1

By Nevin Manimala