Comparative evaluation of ChatGPT versions in training program design: scientific approach, accuracy, and practical applicability

BMC Sports Sci Med Rehabil. 2025 Dec 5. doi: 10.1186/s13102-025-01409-7. Online ahead of print.

ABSTRACT

OBJECTIVE: This study aimed to comparatively evaluate the scientific approach, accuracy, and practical applicability of different versions of ChatGPT in generating training programs.

METHOD: Adopting a mixed-methods design, the study employed seven distinct instruction sets, each developed with input from seven experts possessing a minimum of 10 years of professional experience (Certified Strength and Conditioning Specialists (CSCS), academicians holding PhDs in Sports Sciences with research focus on exercise physiology and training methodology) in their respective fields. Using these instruction sets, three versions of ChatGPT (ChatGPT-3.5, ChatGPT-4o, and ChatGPT-4.1) were tasked with generating 12-week resistance training programs for a hypothetical, healthy young adult male with a moderate training background. Using a rubric scoring scale, the generated programs were systematically evaluated and scored separately based on the following criteria: compliance with the initial program request, inclusion of literature references, adherence to exercise variety and progressive loading principles, individualization of progression and justification, program modifications, inclusion of warm-up and cool-down components, injury risk considerations, presence of incorrect recommendations and practical applicability, and accessibility.To determine whether differences in mean scores were statistically significant, the Friedman non-parametric test was applied. When significant differences were identified, pairwise comparisons were conducted using the Wilcoxon signed-rank test to determine which groups accounted for these differences. In addition, qualitative data analysis was performed to explore expert evaluations in depth, employing both content analysis and descriptive analysis techniques.

RESULTS: Statistically significant differences were identified among the ChatGPT versions examined in this study: between ChatGPT-4o and ChatGPT-3.5 (p = .018), ChatGPT-4.1 and ChatGPT-3.5 (p = .018), and ChatGPT-4.1 and ChatGPT-4o (p = .018). Expert content analysis further indicated that ChatGPT-4.1 produced responses that were more detailed, internally consistent, and better supported by scientific literature compared to the other versions.

CONCLUSION: Although the ChatGPT versions examined in this study exhibited certain limitations, they demonstrated the potential to deliver structured exercise programs aligned with established training principles and relevant scientific literature. Nonetheless, to ensure that AI-assisted training plans provide safe, evidence-based, and individualized content, the involvement of qualified human expertise remains essential.

TRIAL REGISTRATION: No official trial registration number was assigned.

PMID:41345701 | DOI:10.1186/s13102-025-01409-7

By Nevin Manimala