Proc Mach Learn Res. 2025 Dec;297:136-151.
ABSTRACT
Clinical decisions often require balancing conflicting priorities rather than simply selecting a single “correct” answer. We present an evaluation framework that probes the value judgments embedded in large language models (LLMs) by testing how they assess quality of glycemic control from continuous glucose monitoring (CGM) data. Using synthetic type 1 diabetes profiles, we asked five commercial LLMs to perform pairwise comparisons of CGM summary statistics and derived a percentile ranking for each profile. We then quantified alignment with two reference metrics: time in range (TIR) and the expert-derived Glycemia Risk Index (GRI), which was developed with clinician input regarding preferences across glycemic ranges. Across three insulin therapy modalities, newer models showed stronger correlation with GRI than older models, suggesting a generational shift toward expert consensus. However, a perturbation analysis revealed instances of disagreement around the weighting of mild hypoglycemia and mild hyperglycemia relative to the GRI. These results demonstrate that high average agreement with clinical metrics can mask clinically meaningful misalignments in how LLMs prioritize risks. Our proposed framework reveals how LLM outputs reflect competing priorities in clinical contexts.
PMID:42389650 | PMC:PMC13322355