Evaluation of generative artificial intelligence in producing anatomically distinct lipedema subtypes: A diagnostic accuracy study

Phlebology. 2026 Jul 2:2683555261467340. doi: 10.1177/02683555261467340. Online ahead of print.

ABSTRACT

ObjectivesGenerative artificial intelligence (AI) models capable of producing photorealistic medical images are increasingly proposed for patient education, clinical illustration, and trainee instruction. However, their ability to accurately represent anatomically distinct disease subtypes remains unclear. This study evaluated the diagnostic accuracy of a widely used generative AI model in producing images corresponding to the five anatomical lipedema types defined by the Schmeller classification.MethodsIn this prospective audit, ChatGPT’s image-generation interface was prompted to create 60 images for each lipedema type (Types I-V),yielding 300 images. Prompts were standardized and limited to the subtype label without additional descriptors. Two clinicians independently classified each image into one of the five lipedema types or as indeterminate, blinded to the original prompt; disagreements were resolved by a third clinician. Diagnostic performance was assessed using a confusion matrix and per-type sensitivity, specificity, positive predictive value(PPV), negative predictive value (NPV),F1-score,and one-vs-rest receiver operating characteristic area under the curve (ROC AUC). Overall accuracy and Cohen’s κ statistics were also calculated.ResultsAll 300 images were evaluable. The model generated anatomically consistent images for Types I,II, and III (sensitivity = 1.00 for each). Specificity was 1.00 for Types I and II but 0.50 for Type III because all images requested as Types IV and V were classified as Type III. Consequently, the model failed to generate any images consistent with Type IV(arm-predominant) or Type V(calf-isolated) lipedema (sensitivity = 0.00 for both). Overall accuracy was 0.600. Unweighted and quadratic-weighted Cohen’s κ values were 0.500 and 0.667, respectively. Micro- and macro-averaged ROC AUC were both 0.750.ConclusionThe model reproduces severity gradients within lower-extremity lipedema but systematically collapses anatomically distinct subtypes into the dominant Type III phenotype, failing to depict arm-predominant and calf-isolated disease. Current generative AI systems may therefore encode lipedema as a single visual phenotype rather than a distributed anatomical entity, limiting their reliability for medical education and clinical communication.

PMID:42389893 | DOI:10.1177/02683555261467340

By Nevin Manimala