Alignment of Large Language Model Responses With Human Therapists in Motivational Interviewing

JAMA Netw Open. 2026 Mar 2;9(3):e262750. doi: 10.1001/jamanetworkopen.2026.2750.

ABSTRACT

IMPORTANCE: Large language models (LLMs) are increasingly applied to mental health contexts, yet their capacity to generate responses that align with evidence-based psychotherapy remains uncertain. Motivational interviewing (MI), a structured counseling approach, provides an empirically grounded setting for evaluating alignment between LLM-generated and human therapist responses.

OBJECTIVE: To evaluate how closely an LLM’s responses align with therapist responses in MI sessions, using automated similarity metrics.

DESIGN, SETTING, AND PARTICIPANTS: This cross-sectional study used high-fidelity therapist-client transcripts annotated with the Motivational Interviewing Treatment Integrity system. Transcripts were sourced from publicly available counseling videos. For each therapist turn, the GPT-4o LLM generated a response using a standardized, MI-informed prompt based on the preceding conversation context. Analyses were conducted between March and May 2025.

MAIN OUTCOMES AND MEASURES: Alignment between LLM-generated and therapist responses was assessed using (1) cosine similarity based on sentence embeddings to capture semantic overlap and (2) DeepEval, a contextual deep-learning-based metric assessing coherence and contextual appropriateness. A therapist topic-consistency index quantified within-session thematic coherence and was examined as a moderator of alignment.

RESULTS: A total of 3706 therapist turns from 154 MI sessions were evaluated. Mean (SD) DeepEval scores were higher than mean (SD) cosine similarity scores (0.72 [0.31] vs 0.29 [0.20]; P < .001), suggesting limited semantic overlap despite greater contextual appropriateness. Therapist topic consistency significantly moderated similarity, where cosine similarity was higher in high-consistency than low-consistency sessions (mean [SD] difference, 0.027 [0.007]; t3706 = 3.987; P < .001), as was DeepEval score (mean [SD] difference, 0.038 [0.010]; t3706 = 3.747; P < .001). Correlation between metrics was negligible (Spearman ρ, -0.01), indicating that they captured distinct aspects of response alignment. LLM performance declined slightly across longer conversations (mean [SD] slope reduction for cosine similarity, -0.0005 [0.0016], and for DeepEval, -0.0005 [0.0022]), with increased verbosity and signs of reduced contextual grounding.

CONCLUSIONS AND RELEVANCE: In this cross-sectional study of 154 MI sessions, prompted LLMs showed general alignment with therapist responses in MI-oriented conversations, as judged by automated similarity metrics. However, limitations in long-range coherence, stylistic alignment, and the use of indirect proxies for therapeutic quality highlight the need for improved prompt design, MI-specific evaluation methods, and clinical validation before integration into mental health care.

PMID:41870428 | DOI:10.1001/jamanetworkopen.2026.2750

By Nevin Manimala