Large Language Model-Powered Diagnostic Co-Pilot (“CapyEngine”) for Mental Disorders: Development, Evaluation, and Future Optimization Study

JMIR AI. 2026 Mar 24;5:e70017. doi: 10.2196/70017.

ABSTRACT

BACKGROUND: Despite the growing potential of large language models (LLMs) in mental health services, evidence on its capabilities in diagnostic processes remains limited.

OBJECTIVE: This study described the development and evaluation of CapyEngine, an LLM-powered diagnostic tool designed to assist in the diagnosis of mental disorders.

METHODS: We developed and evaluated CapyEngine through 3 phases. In phase 1, we created a disorder and symptom database using Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition, Text Revision (DSM-5-TR). We then designed and developed CapyEngine’s architecture using LLMs, embedding models, and vector searches. In phase 2, we conducted usability testing with mental health professionals (n=7). In phase 3, we compared CapyEngine’s diagnostic accuracy against ChatGPT-4o and clinicians using 35 standardized case scenario exam questions from psychiatry and clinical psychology board exams. Questions were input into CapyEngine, and the top 10 recommended diagnoses were obtained. ChatGPT-4o was prompted to provide the top 10 potential diagnoses for each question. Clinicians (n=3) received similar instructions to generate at least 10 potential diagnoses for each question. Responses were then analyzed to compare diagnostic accuracy of CapyEngine, ChatGPT-4o, and clinicians. Accuracy was measured by the percentage of questions where the correct answer was among the top 10 (least stringent), top 5, or top 1 (most stringent) results of the diagnosis list.

RESULTS: Preliminary user interview reflected high acceptability and feasibility of CapyEngine. Across diagnostic accuracy thresholds, ChatGPT-4o consistently outperformed both CapyEngine and clinicians in broader rankings (top 10 and top 5 benchmarks; all P<.03). Clinicians showed significantly higher accuracy than CapyEngine using the top 5 benchmark (odds ratio 0.26, 95% CI 0.09-0.78; P=.02). For the top 1 benchmark, no significant differences were observed, where clinicians showed a borderline advantage over ChatGPT-4o (odds ratio 0.34, 95% CI 0.13-0.91; P=.05). Regarding the range and slope of diagnostic accuracy decline across benchmarks (least to most stringent), CapyEngine showed the smallest decline (0.14) and flattest slope (-0.07), reflecting more consistent and constrained diagnostic ranking behavior as evaluation thresholds became more stringent. Clinicians exhibited a moderate decline (0.26), whereas ChatGPT-4o demonstrated a sharp decrease (0.69) in accuracy when only the top-ranked diagnosis was considered, consistent with broader diagnostic coverage at less stringent thresholds.

CONCLUSIONS: Overall, ChatGPT-4o achieved the highest accuracy at less stringent benchmarks (top 10 and top 5), while clinician performance did not differ significantly from ChatGPT-4o in identifying the single most likely diagnosis. Although CapyEngine was less accurate overall, it exhibited more consistent and constrained diagnostic ranking across evaluation benchmarks, likely reflecting its DSM-5-TR-based, domain-specific design rather than broader diagnostic coverage. Nonetheless, CapyEngine shows promise as a tool to augment the mental health diagnostic process, and further research is needed to evaluate the risks and benefits of integrating artificial intelligence systems, such as CapyEngine, into clinical workflows.

PMID:41875403 | DOI:10.2196/70017

By Nevin Manimala