Streamlining Ophthalmic Documentation With Anonymized, Fine-Tuned Language Models: Feasibility Study

Interact J Med Res. 2025 Nov 26;14:e72894. doi: 10.2196/72894.

ABSTRACT

BACKGROUND: The growing administrative burden on clinicians, particularly in medical documentation, contributes to burnout and may compromise patient safety. Recent advancements in generative artificial intelligence (AI) offer a promising solution to improve documentation processes and address these challenges.

OBJECTIVE: This study aims to evaluate the feasibility of using a fine-tuned OpenAI Curie model to automate the generation of medical report summaries (epicrises) in ophthalmology. By assessing the model’s performance through human and automated evaluations, this study seeks to determine its potential for reducing clinician workload while ensuring accuracy, usefulness, and compliance with regulatory requirements.

METHODS: A data set of around 60,000 anonymized medical letters was created using a custom algorithm to comply with General Data Protection Regulation guidelines. The Curie model was fine-tuned on this data set to generate epicrises from medical histories, diagnoses, and findings. The performance evaluation involved various human assessments and automated evaluations from 2 large language models (LLMs).

RESULTS: In the clinical context, 49.9% (384/769) of epicrises were evaluated as helpful or excellent, whereas only 25% (194/769) were considered disturbing. In a human (manual) evaluation, formal correctness was rated significantly higher than the neutral midpoint of 2.5 on the 4-point rating scale, as determined by a 1-sample Wilcoxon signed-rank test (mean 3.59, SD 0.85; W=1686; P<.001). Using paired t tests, we found a significant reduction in time, as correcting an AI epicrisis was faster than manually writing one (mean 109.52, SD 53.30 vs mean 54.25, SD 63.34 s; t₆₈=3.39; P<.01). While medical accuracy and usefulness showed positive trends, these did not reach statistical significance when compared to the neutral midpoint (for medical accuracy, W= 7456; P=.08), for usefulness, W=7652.5; P=.18). Epicrises generated or corrected with AI were significantly shorter than manually written ones (mean 330.43, SD 115.42 vs mean 501.07, SD 243.50 characters; t₆₈=-6.10; P<.001). Automated LLM assessments showed alignment with human ratings, with over 52% (356/679) and 66% (489/743) of responses in the top agreement categories, respectively. This supports overall consistency, though the comparison remains a proof of concept given methodological limitations.

CONCLUSIONS: Our study demonstrates the technical and practical feasibility of introducing fine-tuned commercial LLMs into clinical practice. The AI-generated epicrises were formally and clinically correct in many cases and showed time-saving potential. While medical accuracy and usefulness varied across cases and should be focused on in further developments, a significant workload reduction is likely. Our anonymization process showed that regulatory challenges in the context of AI with patient data can effectively be dealt with. In summary, this study highlights the promise of transformer-based LLMs in reducing administrative tasks in health care. It outlines a pipeline for integrating LLMs into European Union clinical practice, emphasizing the need for careful implementation to ensure efficiency and patient safety.

PMID:41297038 | DOI:10.2196/72894

By Nevin Manimala