Assessing and mitigating demographic bias in large language models for diagnostic radiology

Jpn J Radiol. 2026 Jun 5. doi: 10.1007/s11604-026-02021-6. Online ahead of print.

ABSTRACT

PURPOSE: Large language models (LLMs) are increasingly integrated into radiology workflows, but their demographic biases have not been evaluated in diagnostic radiology. This study aimed to investigate racial and sex biases in the diagnostic performance of LLMs (text-only and vision models) in radiology and to evaluate whether prompting strategies mitigate these biases.

MATERIALS AND METHODS: This retrospective study included consecutive Diagnosis Please cases published in Radiology from April 1998 to October 2024, excluding cases with sex-specific diseases. For each case, eight race-sex scenarios were generated by altering four race/ethnicity categories (Asian, Black, Hispanic, White) and two sex categories (male, female). Three LLMs (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Flash) were evaluated as text-only models (medical history and imaging findings) and vision models (medical history and images) using three prompting strategies (basic, self-consistency, chain-of-thought prompting). Generalized estimating equations were used to compare diagnostic accuracy across race/ethnicity, sex, and prompting strategies.

RESULTS: A total of 286 cases were included. Across models and conditions, ten significant race-related and four significant sex-related differences in diagnostic accuracy were observed. Among the four race/ethnicity groups, Black patients were most likely to have significantly lower accuracy (four of the ten statistically significant race-related comparisons [40%]) and least likely to have significantly higher accuracy (1/10 [10%]). Vision models with female patients under basic prompting showed a higher number of significant race-related differences (6/10 [60%]) than vision models with male patients and all text-only models. Text-only and vision models showed no statistically significant differences in diagnostic accuracy across prompting strategies (p = 0.78 and 0.95, respectively); basic and self-consistency prompting produced ten and four significant race- or sex-related differences, respectively, whereas no significant differences were observed with chain-of-thought prompting.

CONCLUSION: Large language models exhibited racial and sex biases in diagnostic radiology, and chain-of-thought prompting may help mitigate these biases.

PMID:42247082 | DOI:10.1007/s11604-026-02021-6

By Nevin Manimala