Diabetes Obes Metab. 2026 Apr 13. doi: 10.1111/dom.70747. Online ahead of print.
ABSTRACT
AIMS: To evaluate the accuracy of four general-purpose artificial intelligence (AI) models-ChatGPT (OpenAI), Gemini (Google), Claude (Anthropic) and DeepSeek (DeepSeek AI)-in calculating the carbohydrate content of meals compared with clinicians-calculated reference values.
MATERIALS AND METHODS: The primary endpoint was equivalence between clinicians and AI-generated calculations within an error margin of ±5%. One-hundred twenty-four meals were analysed, equally distributed among breakfast, lunch, dinner and snacks. Carbohydrate contents were jointly determined by two paediatric diabetologists and one clinical nutritionist using the USDA FoodData Central and CREA Italian Food Composition Tables. Each AI model received identical, standardized prompts in English describing the meals. Statistical analyses included the Two One-Sided Tests procedure, the Bland-Altman plots, the Wilcoxon signed-rank and the Spearman correlations.
RESULTS: The clinicians’ median carbohydrate content was 30.32 g. Model medians were 30.75 g (ChatGPT), 30.40 g (Gemini), 29.75 g (DeepSeek) and 29.25 g (Claude). ChatGPT showed the smallest bias, the narrowest limits of agreement, and the highest correlation with clinicians’ calculation. Only ChatGPT met the predefined ±5% equivalence criterion, whereas Gemini and DeepSeek achieved equivalence within a ±10% margin. Claude displayed the largest negative bias and the widest dispersion.
CONCLUSIONS: ChatGPT most accurately approximated clinicians’ carbohydrate calculation among the tested AI models and fulfilled strict clinical equivalence criteria. Although the other models tended to underestimate carbohydrate content, their mean deviations remained within clinically acceptable limits. These findings suggest that AI tools, particularly ChatGPT, may serve as useful adjuncts for carbohydrate counting for people with type 1 diabetes, supporting self-management.
PMID:41969183 | DOI:10.1111/dom.70747