Categories
Nevin Manimala Statistics

“Can We Trust Them?” An Expert Evaluation of Large Language Models to Provide Sleep and Jet Lag Recommendations for Athletes

Sports Med. 2025 Oct 3. doi: 10.1007/s40279-025-02303-5. Online ahead of print.

ABSTRACT

BACKGROUND: With the increasing use of artificial intelligence in healthcare and sports science, large language models (LLMs) are being explored as tools for delivering personalized, evidence-based guidance to athletes.

OBJECTIVE: This study evaluated the capabilities of LLMs (ChatGPT-3.5, ChatGPT-4, and Google Bard) to deliver evidence-based advice on sleep and jet lag for athletes.

METHODS: Conducted in two phases between January and June 2024, the study first identified ten frequently asked questions on these topics with input from experts and LLMs. In the second phase, 20 experts (mean age 43.9 ± 9.0 years; ten females, ten males) assessed LLM responses using Google Forms surveys administered at two intervals (T1 and T2). Inter-rater reliability was evaluated using Fleiss’ Kappa, and intra-rater agreement using the Jaccard Similarity Index (JSI), and content validity through the content validity ratio (CVR). Differences among LLMs were analyzed using Friedman and Chi-square tests.

RESULTS: Experts’ response rates were high (100% at T1 and 95% at T2). Inter-rater reliability was minimal (Fleiss’ Kappa: 0.21-0.39), while intra-rater agreement was high, with 53% of experts achieving a JSI ≥ 0.75. ChatGPT-4 had the highest CVR for sleep (0.67) and was the only model with a valid CVR for jet lag (0.68). Google Bard showed the lowest CVR for jet lag (0%), with significant differences compared to ChatGPT-3.5 (p = 0.0073) and ChatGPT-4 (p < 0.0001). Reasons for inappropriate responses varied significantly for jet lag (p < 0.0001), with Google Bard criticized for insufficient information and frequent errors. ChatGPT-4 outperformed other models overall.

CONCLUSIONS: This study highlights the potential of LLMs, particularly ChatGPT-4, to provide evidence-based advice on sleep but underscores the need for improved accuracy and validation for jet lag recommendations.

PMID:41042486 | DOI:10.1007/s40279-025-02303-5

By Nevin Manimala

Portfolio Website for Nevin Manimala