Comparison of Emotional Content in Text Responses From Physicians and AI Chatbots to Patient Health Queries: Cross-Sectional Study

J Med Internet Res. 2026 Mar 6;28:e85516. doi: 10.2196/85516.

ABSTRACT

BACKGROUND: Surveys show that many people are willing to use generative artificial intelligence (AI) for health questions. Prior research has largely focused on chatbot accuracy, with some studies finding that both physicians and consumers overwhelmingly prefer chatbot-generated text over physician responses.

OBJECTIVE: This study aimed to characterize and compare the emotional content of responses from physicians and 2 AI chatbots (OpenAI’s ChatGPT and Google’s Gemini) and to assess differences in reading level and use of medical disclaimers.

METHODS: A public, patient-deidentified telehealth website was used to compile 100 physician-answered questions. The same questions were posed to both chatbots between May 18 and 19, 2025. Two coders classified the emotional content of each sentence using a predefined codebook and reviewed for agreement. Emotions were ranked as primary, secondary, and tertiary by the proportion of sentences classified as each emotion per response. Multinomial logistic regression compared emotional rankings using physician responses as the reference. Word count, Flesch Reading Ease, and Flesch-Kincaid Grade Level were analyzed via ANOVA with the Tukey honestly significant difference test. Disclaimer use was compared between chatbots using a χ² test.

RESULTS: Primary emotions were overwhelmingly neutral, except for one response from each chatbot in which anger was primary. For secondary emotions, the odds ratio of hope was 80.28% (95% CI 37.71%-93.76%) lower for ChatGPT, while the odds ratio of fear was 3.29 (95% CI 1.44-7.49) times higher for Gemini. For tertiary emotions, the odds ratio of compassion was 1.94 (95% CI 1.06-3.54) times higher, and the odds ratio of having no tertiary emotion was 84.33% (95% CI 64.72%-93.04%) lower for Gemini. Gemini responses averaged 889.1 (SD 305.7) words, ChatGPT 476.5 (SD 109.5), and physicians 193.5 (SD 113.6). Gemini had the lowest average Flesch Reading Ease score at 39.9 (SD 8.8), followed by ChatGPT at 45.8 (SD 12.8), while physicians had the highest at 51.9 (SD 13.6). Gemini had the highest average Flesch-Kincaid Grade Level at 11.3 (SD 1.5), followed by ChatGPT at 9.9 (SD 1.9), and physicians at 9.2 (SD 2.4). Gemini was significantly more likely to include a disclaimer than ChatGPT (χ²₁=49.2; P<.001).

CONCLUSIONS: Chatbot responses were significantly (P<.001) longer and more difficult to read than physician responses and were more likely to contain a wider range of emotions. Qualitatively, chatbot responses were more varied in their presentation as well as in the breadth of the emotions themselves. The findings of this study could be used to inform more emotionally connected physician responses to patient message queries.

PMID:41791109 | DOI:10.2196/85516

By Nevin Manimala