J Med Syst. 2025 Oct 7;49(1):127. doi: 10.1007/s10916-025-02276-y.
ABSTRACT
Artificial intelligence (AI), specifically large language models (LLM), have gained significant popularity over the last decade with increased performance and expanding applications. AI could improve the quality of patient care in medicine but hidden biases introduced during training could be harmful. This work utilizes GPT-4o-mini to generate patient communications based on systematically generated, synthetic patient data that would be commonly available in a patient’s medical record. To evaluate the AI generated communications for disparities, GPT-4o-mini was used to score the generated communications on empathy, encouragement, accuracy, clarity, professionalism, and respect. Disparities in scores associated with specific components of a patient’s history were used to detect potential biases. A patient’s sex and religious preference were found to have a statistically significant impact on scores. However, further work is needed to evaluate a wider collection of LLMs utilizing more specific and human validated scoring criteria. Overall, this work proposes a novel method of evaluating bias in LLMs by creating synthetic patient histories to formulate AI generated communications and score them with opportunities for further investigation.
PMID:41055822 | DOI:10.1007/s10916-025-02276-y