Artificial intelligence chain-of-thought reasoning in nuanced medical scenarios: mitigation of cognitive biases through model intransigence

BMJ Qual Saf. 2025 Nov 24:bmjqs-2025-019299. doi: 10.1136/bmjqs-2025-019299. Online ahead of print.

ABSTRACT

BACKGROUND: Artificial intelligence large language models (LLMs) are increasingly used to inform clinical decisions but sometimes exhibit human-like cognitive biases when facing nuanced medical choices.

METHODS: We tested whether new chain-of-thought reasoning LLMs might mitigate cognitive biases observed in physicians. We presented medical scenarios (n=10) to models released by DeepSeek, OpenAI and Google. Each scenario was presented in two versions that differed according to a specific bias (eg, surgery framed in survival vs mortality statistics). Responses were categorised and the extent of bias was measured by the absolute discrepancy between responses to different versions of the same scenario. The extent of intransigence (also termed dogma or inflexibility) was measured by Shannon entropy. The extent of deviance in each scenario was measured by comparing the average model response to the average practicing physician response (n=2507).

RESULTS: DeepSeek-R1 mitigated 6 out of 10 cognitive biases observed in practicing physicians by generating intransigent all-or-none responses. The four biases that persisted were post hoc fallacy (34% vs 0%, p<0.001), decoy effects (44% vs 5%, p<0.001), Occam’s razor fallacy (100% vs 0%, p<0.001) and hindsight bias (56% vs 0%, p<0.001). In every scenario, the average model response deviated substantially from the average response of practicing physicians (p<0.001 for all). Similar patterns of persistent specific biases, intransigent responses and substantial deviance from practicing physicians were also apparent in OpenAI and Google.

CONCLUSION: Some biases persist in chain-of-thought reasoning LLMs, and models tend to produce intransigent recommendations. These findings highlight the role of clinicians to think broadly, respect diversity and remain vigilant when interpreting chain-of-thought reasoning artificial intelligence LLMs in nuanced medical decisions for patients.

PMID:41285583 | DOI:10.1136/bmjqs-2025-019299

By Nevin Manimala