Clin Chem Lab Med. 2026 Apr 20. doi: 10.1515/cclm-2026-0435. Online ahead of print.
ABSTRACT
OBJECTIVES: Large language models (LLMs) show promise for interpreting laboratory reports, yet real-world validation remains limited. This study evaluated five advanced LLMs in interpreting urinalysis reports for kidney diseases using real-world clinical data, providing empirical evidence for the utility of LLM-assisted result interpretation.
METHODS: We retrospectively collected 120 urinalysis reports from patients with primary glomerular diseases and secondary nephropathies. The testing platforms included the Sysmex UF5000 and Mindray EU8600. Five LLMs (ChatGPT-5, Claude-4.5, Gemini-2.5, DeepSeek-V3.1, Qwen-3) were tasked with interpreting reports across five functional dimensions. Four certified laboratory technologists and four licensed physicians evaluated outputs using a 5-point Likert scale across six quality dimensions. Statistical analyses employed Friedman and Wilcoxon signed-rank tests.
RESULTS: All five LLMs demonstrated clinical utility in interpreting urinalysis reports. Proprietary LLMs achieved higher overall scores (Claude-4.5: 4.78 ± 0.47; ChatGPT-5: 4.73 ± 0.50; Gemini-2.5: 4.69 ± 0.54) compared to open-source LLMs (DeepSeek-V3.1: 4.58 ± 0.66; Qwen-3: 4.57 ± 0.69). Across functional dimensions, the models performed proficiently in identifying abnormal parameters and analyzing their correlations, but suboptimally in interpreting instrument flags. Instrument-dependent variability was observed (Sysmex vs. Mindray, p<0.001). In quality assessments, Claude-4.5 exhibited the best overall performance, ChatGPT-5 excelled in accuracy and clarity, and Gemini-2.5 demonstrated strong practicality. Regarding safety, Claude-4.5 exhibited the lowest hallucination rate (7.5 %). Common hallucinations included misinterpretation, definition errors, and over-interpretation.
CONCLUSIONS: LLMs demonstrate significant capability in urinalysis interpretation, though proprietary models currently excel in reasoning and hallucination resistance. Instrument-specific flag interpretation and hallucination mitigation remain critical challenges requiring Retrieval-Augmented Generation (RAG) integration and human oversight.
PMID:42033087 | DOI:10.1515/cclm-2026-0435