Categories
Nevin Manimala Statistics

International Normalized Ratio (INR) Sample Rejection in Neck of Femur Fracture Patients: A Retrospective Closed-Loop Study From a Major UK Trauma Centre

Cureus. 2025 Nov 17;17(11):e97097. doi: 10.7759/cureus.97097. eCollection 2025 Nov.

ABSTRACT

Background Patients presenting with neck-of-femur (NOF) fractures often require urgent surgery, as prolonged delays beyond 36 hours are associated with increased morbidity, mortality, and length of hospital stay, whereas shorter time-to-surgery intervals have been shown to improve outcomes. Many of these patients are elderly and on anticoagulant therapy; therefore, making accurate International Normalized Ratio (INR) assessment is crucial for determining surgical readiness and anaesthetic safety. The INR reflects the extrinsic pathway of coagulation and is prolonged in patients on warfarin or who have underlying coagulopathies. Inaccurate or rejected INR samples delay operative clearance, prolong fasting, and increase bed occupancy and cost of treatment. A frequent pre-analytical cause of INR rejection is underfilling of sodium citrate tubes, which alters the required 9:1 blood-to-anticoagulant ratio. Objective To improve the rejection rate of INR samples through a simple phlebotomy intervention involving staff education and the appropriate use of a discard tube before citrate collection. Methods A retrospective two-cycle closed-loop audit was conducted at Heartlands Hospital, part of the University Hospitals Birmingham (UHB) NHS Foundation Trust. The first cycle included NOF fracture patients admitted between July and August 2023, during which 399 INR samples were analysed. Of these, 66 (16.5%) were rejected, 62 (94%) due to underfilling and four (6%) due to haemolysis. Following targeted interventions, including staff education on correct discard-tube use with butterfly systems and the introduction of shorter-tubing blood collection sets, a second audit cycle was performed and included NOF patients admitted between July and August 2025. In this cycle, 261 INR samples were reviewed, of which 29 (11.1%) were rejected, 27 (93%) for underfilling and two (7%) for haemolysis. Rejection proportions were compared between cycles, and absolute and relative changes were calculated. Statistical significance of the observed difference was assessed using a two-proportion z-test (two-sided, α = 0.05). Data collection and review were performed using the UHB trust Prescribing Information and Communication System (PICS) electronic system. Results INR sample rejection rate decreased from 16.5% in cycle 1 to 11.1% in cycle 2, an absolute reduction of 5.4% and relative reduction of ~33% (p ≈ 0.11). Among patients taking some form of blood thinner (e.g. warfarin, direct oral anticoagulants, low molecular weight heparin), 132 (33%) in cycle 1 and 56 (25%) in cycle 2, INR rejection occurred in 17.4% and 19.6%, respectively. Most rejections were due to underfilling the INR sample tube. Conclusions In trauma patients, particularly those awaiting urgent NOF surgery, preventing INR sample rejection can significantly reduce avoidable operative delays. This closed-loop audit demonstrated that a simple, low-cost intervention focused on correct tube filling, discard-tube use, and appropriate equipment selection led to a clinically meaningful reduction in INR sample rejection rates. Most remaining rejections remain preventable, underscoring the importance of continuous education, reinforcement of best practice, and regular re-audit to sustain long-term improvement.

PMID:41416323 | PMC:PMC12711245 | DOI:10.7759/cureus.97097

Categories
Nevin Manimala Statistics

Role of Diffusion Tensor Imaging-Derived Metrics for the Assessment of Deranged Myelination in Children With Developmental Delay: A Hospital-Based Observational Study

Cureus. 2025 Nov 16;17(11):e96972. doi: 10.7759/cureus.96972. eCollection 2025 Nov.

ABSTRACT

Background Diffusion tensor imaging (DTI) is a sensitive neuroimaging modality that evaluates white matter microstructure by measuring parameters such as fractional anisotropy (FA), mean diffusivity (MD), radial diffusivity (RD), and axial diffusivity (AD). Alterations in these metrics can indicate myelin disruption, axonal injury, or generalized microstructural changes. This study aimed to assess white matter integrity in children with developmental delay and evaluate the diagnostic performance of DTI metrics using receiver operating characteristic (ROC) analysis. Methods A hospital-based observational study was conducted on pediatric subjects with neurodevelopmental abnormalities. DTI was performed, and FA, MD, RD, and AD were quantified across major tracts, including the genu and splenium of the corpus callosum, anterior and posterior limbs of the internal capsule, superior longitudinal fasciculus (SLF), inferior fronto-occipital fasciculus (IFOF), and frontal and parietal white matter. Age-matched controls served as the comparison group. Statistical analysis included group comparisons and ROC curve evaluation for diagnostic accuracy. Results Significant reductions in FA were observed in the genu of the corpus callosum (0.34±0.07), frontal white matter (0.31±0.06), parietal white matter (0.33±0.05), SLF (0.35±0.06), and IFOF (0.37±0.05) (p<0.01 vs. controls). RD was significantly elevated in these regions (0.92-0.97×10⁻³ mm²/s; p<0.01), consistent with demyelination or delayed myelination. MD values were diffusely elevated (1.06-1.12×10⁻³ mm²/s), supporting generalized microstructural disruption, whereas AD showed mild but non-significant changes. ROC analysis demonstrated that FA had the highest diagnostic accuracy: genu of the corpus callosum (AUC 0.89), frontal white matter (AUC 0.91), and SLF (AUC 0.88). RD and MD also showed strong discriminatory ability, while AD performed less robustly. Conclusion Reduced FA, along with elevated RD and MD, reliably reflects white matter microstructural injury in pediatric populations. ROC analysis confirmed FA as the most sensitive biomarker, with high sensitivity and specificity. DTI metrics hold strong clinical potential for the early detection of neurodevelopmental white matter abnormalities.

PMID:41416317 | PMC:PMC12709428 | DOI:10.7759/cureus.96972

Categories
Nevin Manimala Statistics

A Randomized Controlled Trial Comparing Isobaric Versus Hypobaric Plus Isobaric Bupivacaine in Thoracic Segmental Spinal Anesthesia for the Reduction of Shoulder Pain During Laparoscopic Cholecystectomy

Cureus. 2025 Nov 17;17(11):e97048. doi: 10.7759/cureus.97048. eCollection 2025 Nov.

ABSTRACT

Introduction Laparoscopic cholecystectomy (LC) is traditionally performed under general anesthesia (GA). However, thoracic segmental spinal anesthesia (TSSA), where low doses of local anesthetics (LA), often with adjuvants, are used at thoracic spinal levels, is also being explored by some researchers. Shoulder pain is a common issue during LC, adversely impacting the patient’s perioperative experience. A combination of hypobaric and isobaric LA at the thoracic level has been described to mitigate this complication. The primary objective of the study was to compare the efficacy of a combination of hypobaric and isobaric bupivacaine versus isobaric bupivacaine alone during TSSA in LC in reducing intraoperative shoulder pain. The secondary objectives were to assess the incidence of adverse effects (hypotension, bradycardia, nausea, vomiting, etc.) and to evaluate patient and surgeon satisfaction. Methods This randomized, controlled, open-label study was conducted at a tertiary care center after receiving ethical approval and registration at the Clinical Trial Registry of India. A total of 90 patients were recruited, with 45 participants in each group, aged 20-70 years, with ASA Physical Status I-II, scheduled for elective LC. Exclusion criteria included BMI > 35, contraindications to regional anesthesia, allergy to study drugs, spinal deformity, and previous abdominal surgery. Patients were randomly assigned to two groups using computer-generated random numbers. Anesthesia was provided by a senior consultant proficient in TSSA. Group 2 received hypobaric and isobaric bupivacaine, and Group 1 received only isobaric bupivacaine. Both groups received 11 mg bupivacaine with 5 μg dexmedetomidine in TSSA. Data were collected using Microsoft Excel (Microsoft® Corp., Redmond, WA, USA) and analyzed using IBM SPSS Statistics for Windows, Version 23 (Released 2015; IBM Corp., Armonk, NY, USA). Continuous variables were expressed as means ± SD and analyzed using an independent t-test or the Mann-Whitney U-test, depending on the distribution of data. Categorical variables were compared using the Chi-square or Fisher’s exact test. A p-value of <0.05 was considered statistically significant. Results All 90 patients in both groups successfully underwent LC under TSSA with no conversion to GA. The mean age of the participants was 49.11 ± 7.4 years with 58 (64.4%) females. Both groups were comparable in terms of demographic parameters. Intraoperative clinical parameters were comparable in both groups, without any statistically significant differences. Six (13.3%) patients in Group 1 and five (11.1%) patients in Group 2 had hypotension, which was easily corrected with a fluid bolus and a single 6 mg dose of intravenous mephentermine. Six (13.3%) patients in Group 1 reported shoulder pain, whereas in Group 2 only one (2.2%) patient had shoulder pain intraoperatively. Patient and surgeon satisfaction scores were better in Group 2, which was statistically significant. The number needed to treat (NNT) of nine indicates that approximately nine patients would need to receive the hypobaric + isobaric regimen to prevent one case of intraoperative shoulder pain. Conclusions LC can be performed successfully under TSSA with stable hemodynamics. The addition of hypobaric bupivacaine to isobaric bupivacaine provided better shoulder-tip pain control and fewer postoperative complications. Further studies with larger sample sizes are needed to validate these findings.

PMID:41416305 | PMC:PMC12709560 | DOI:10.7759/cureus.97048

Categories
Nevin Manimala Statistics

The Impact of Probiotics on Acne Vulgaris: A Meta-Analysis of Randomized Controlled Trials

Cureus. 2025 Nov 16;17(11):e97010. doi: 10.7759/cureus.97010. eCollection 2025 Nov.

ABSTRACT

Acne vulgaris is a multifactorial inflammatory skin disorder influenced by hormonal activity, microbial imbalance, and immune dysregulation. While conventional treatments such as antibiotics and retinoids remain effective, their long-term use is often limited by side effects, resistance, and poor adherence. This meta-analysis evaluated the efficacy of probiotics as an adjunct or alternative therapy for acne management. Four randomized controlled trials involving 227 participants were analyzed, showing that probiotic supplementation reduced acne severity scores (OR 0.48; 95% CI 0.29-0.79) and non-inflammatory lesion counts (mean difference (MD) -4.62; 95% CI -8.10 to -1.15) compared with controls. A trend toward improvement in inflammatory lesions was observed (MD -2.03; 95% CI -5.46 to 1.41) but was not statistically significant. Heterogeneity across studies ranged from moderate to high, reflecting variation in probiotic strains, formulations, and treatment durations. While these findings suggest a potential benefit of probiotics, the limited number and quality of trials warrant cautious interpretation. Larger, standardized clinical studies are needed to confirm efficacy and identify optimal probiotic regimens for acne management.

PMID:41416302 | PMC:PMC12709052 | DOI:10.7759/cureus.97010

Categories
Nevin Manimala Statistics

Early Weight-Bearing Following Ankle Fixation in a United Kingdom Major Trauma Centre: Real-World Adoption Versus a Contemporary Randomised Trial and British Orthopaedic Association Standards for Trauma

Cureus. 2025 Nov 17;17(11):e97076. doi: 10.7759/cureus.97076. eCollection 2025 Nov.

ABSTRACT

Background Ankle fractures are common and place a substantial burden on services. Contemporary guidance generally permits weight-bearing as tolerated after stable fixation with early outpatient review. Recent randomised evidence suggests that beginning weight-bearing at around two weeks after open reduction and internal fixation (ORIF) can achieve at least comparable functional outcomes without increased complications and may be resource-efficient. Despite this, the adoption of early weight-bearing remains variable. We evaluated our centre’s timing to first weight-bearing and early safety in the year following the dissemination of new evidence. Methods We conducted a single-centre retrospective cohort study at a UK major trauma centre (September 2024-August 2025). Adults (≥18 years) undergoing ankle ORIF were included; exclusions were hindfoot nails, tibial plafond (pilon) fractures, open fractures, or missing follow-up. Data sources were the local trauma database, operative notes, discharge summaries, fracture-clinic letters, imaging, and general practitioner records. Variables included age, sex, length of inpatient stay (LOS), fracture pattern (unimalleolar/bimalleolar/trimalleolar), posterior malleolus fixation (yes/no), syndesmosis fixation (none/screw/suture-button), and discharge device (cast/boot). The primary outcome was time to first weight-bearing (bands: at 2 weeks, 2-6 weeks, >6 weeks; sub-bands 6-8 and >8 weeks). Safety within eight weeks comprised unplanned emergency department/clinic contact, re-operation, and radiographic loss of reduction. Analyses used descriptive statistics (mean/standard deviation (SD); median/interquartile range (IQR)); between-group comparisons employed Kruskal-Wallis or Mann-Whitney U for days to weight-bearing and chi-square for proportions >6 weeks (two-sided p<0.05). Continuous outcomes (LOS, days to first weight-bearing) were non-normally distributed (Shapiro-Wilk p<0.001); hence, non-parametric tests were used. Results Forty-two patients were included. The mean age was 51.1 years (SD 16.5), with a median age of 49.5 years (IQR 39.2-62.5). LOS had a mean of 7.0 days (SD 7.9) and a median of 3.5 days (IQR 1.0-13.0). Time to first weight-bearing: at 2 weeks 2/42 (4.8%), 2-6 weeks 9/42 (21.4%), >6 weeks 31/42 (73.8%) (including 6-8 weeks 21/42 (50.0%), >8 weeks 10/42 (23.8%)). Safety ≤8 weeks in the 6-week or longer group showed unplanned contact 1/42 (2.4%), re-operation 0/42, loss of reduction 0/42, and delayed union 2/42 (4.8%). Safety margins in the 2-week group did not show any complications (0/42 in all parameters). Days to first weight-bearing did not differ significantly by fracture pattern (p=0.066) or syndesmosis fixation (p=0.383); posterior malleolus fixation was associated with longer time (Mann-Whitney p=0.036). Proportions exceeding six weeks did not differ significantly across subgroups. Conclusions Early weight-bearing after ankle ORIF was seldom implemented locally, with most patients first weight-bearing at ≥6 weeks despite reassuring short-term safety. In light of the recent clinical guidance, a default “weight-bearing as tolerated from two weeks” pathway (with clearly defined exceptions), standardised discharge/clinic instructions, and planned re-audit may improve implementation without compromising safety.

PMID:41416288 | PMC:PMC12710795 | DOI:10.7759/cureus.97076

Categories
Nevin Manimala Statistics

The Risk Factors for Complications Following Intestinal Stoma Reversal

Cureus. 2025 Nov 16;17(11):e97018. doi: 10.7759/cureus.97018. eCollection 2025 Nov.

ABSTRACT

Background and objective The restoration of bowel continuity after temporary intestinal stoma formation is a routine general surgical procedure. However, stoma reversal is associated with significant postoperative morbidity and mortality. Surgeons may face challenges such as dense adhesions, iatrogenic bowel injury, or even procedure abandonment, as well as postoperative complications such as surgical site infection (SSI), anastomotic leak, intra-abdominal sepsis, and death. This study aimed to identify common complications and their risk factors to facilitate strategies for optimizing surgical outcomes. Specifically, it sought to determine the incidence and pattern of early (within 30 days) postoperative complications, grade their severity using the Clavien-Dindo classification, and analyze the risk factors associated with complications following intestinal stoma reversal. Methods A prospective observational study was conducted in the Department of General Surgery at a tertiary care medical college hospital. All consenting adult patients undergoing reversal of a temporary intestinal stoma were included. Demographic, clinical, and operative variables – including age, sex, comorbidities, stoma type, indication, local pathology, interval between stoma creation and closure, preoperative chemo- or radiotherapy, hemoglobin, serum albumin, surgical technique, and use of postoperative nutritional support – were recorded. Outcome measures included intraoperative technical difficulty, iatrogenic bowel injury, postoperative complications, SSI, anastomotic leak, and mortality. Statistical analysis was performed using the Chi-square or Fisher’s exact test for categorical data and the t-test for continuous data. A p-value <0.05 was considered statistically significant. Results Seventy patients undergoing stoma reversal were included. Technical difficulty and iatrogenic bowel injury occurred in 23 (32.8%) patients and were significantly associated with colostomy reversal (p=0.002) and end stoma reversal (p=0.0059). Postoperative complications occurred in 32 patients (45.7%). The most common complication was SSI in 26 (37.1%), followed by anastomotic leak in six (8.6% ), intra-abdominal abscess in four (5.7%), abdominal wall dehiscence in four (5.7%), and enterocutaneous fistula in three (4.2%). There were four deaths (5.7%), all due to sepsis following anastomotic leak in patients with comorbidities. Preoperative serum albumin <3.5 g/dL was significantly associated with mortality (p=0.0007), while postoperative nutritional support significantly reduced complications (p=0.001). Conclusions Stoma reversal is linked to considerable morbidity and mortality; hence, the decision to create a diverting stoma should be made judiciously. Ileostomy reversal is technically easier and safer than colostomy or Hartmann’s reversal and may be preferred when diversion is indicated. Delayed reversal beyond three months, optimization of comorbidities, correction of hypoalbuminemia (>3.5 g/dL), and postoperative nutritional support are recommended to minimize complications and improve outcomes.

PMID:41416276 | PMC:PMC12709135 | DOI:10.7759/cureus.97018

Categories
Nevin Manimala Statistics

A Single-Center Retrospective Study on Noninvasive Prediction of Terson Syndrome in Aneurysmal Subarachnoid Hemorrhage (aSAH) Patients: The Role of CT-Measured Posterior Globe Thickness and Age

Cureus. 2025 Nov 17;17(11):e97061. doi: 10.7759/cureus.97061. eCollection 2025 Nov.

ABSTRACT

Background Terson syndrome (TS), an intraocular hemorrhage secondary to aneurysmal subarachnoid hemorrhage (aSAH), has a high incidence rate. Clinically, patients with aSAH often present with concomitant TS; however, owing to the difficulty in performing ophthalmic examinations in critically ill patients, many cases may be missed. This study aimed to develop and evaluate a CT-based diagnostic model incorporating posterior globe thickness to predict TS in patients with aSAH. Materials and methods This was a retrospective study on patients who underwent direct surgery or endovascular treatment for ruptured cerebral aneurysms at our institution between January 1, 2018, and August 31, 2025 (analyzed by eye). We extracted data from eyes definitively diagnosed with TS via ophthalmic examination. In addition to collecting epidemiological and clinical data, posterior globe thickness was measured for each eye. Statistical analyses included the Mann-Whitney U test, chi-square test, generalized estimating equation (GEE) logistic regression analysis, and receiver operating characteristic (ROC) analysis. Statistical significance was set at p < 0.05. Results A total of 177 patients (354 eyes) received aSAH treatment, of whom 26 individuals (52 eyes) underwent ophthalmic examination, and within this subgroup, 11 patients (17 eyes) were diagnosed with TS. In the univariate GEE logistic regression analysis, the presence of TS was significantly correlated with age (p=0.005), World Federation of Neurosurgical Societies (WFNS) grade (p=0.021), complaints of visual and visual field impairment (p=0.021), and posterior globe thickness (p=0.038). The multivariate GEE logistic regression analysis demonstrated that age and posterior globe thickness significantly influenced the risk of developing TS. In this final multivariate model, the odds of having TS decreased by a factor of 0.85 for every one-year increase in age (p=0.007), whereas the odds increased by a factor of 13.74 for every 1 mm increase in posterior globe thickness (p=0.027). ROC analysis, performed using this final multivariate model, yielded a calculation to determine the age-dependent cutoff for posterior globe thickness: Cutoff(mm)≈-1.295+0.0637×Age (years), which showed a sensitivity and specificity of 82.4% and 82.9%, respectively. Conclusion This study proposes a noninvasive prediction model for estimating TS based on CT measurements of posterior globe thickness. Serving as a practical triage tool, these findings suggest that incorporating age significantly enhances the diagnostic utility. To ensure broad generalizability and facilitate its application in clinical practice, prospective multicenter trials are necessary to validate these results.

PMID:41416274 | PMC:PMC12710446 | DOI:10.7759/cureus.97061

Categories
Nevin Manimala Statistics

The therapeutic functions of poetry in mental health: A systematic review and meta-analysis

Psychiatry Res. 2025 Dec 11;356:116897. doi: 10.1016/j.psychres.2025.116897. Online ahead of print.

ABSTRACT

BACKGROUND: Poetry therapy whether using reading, writing, or discussing poems in a therapeutic context, is increasingly applied in mental and physical health care, yet its empirical support remains unclear. This systematic review and meta-analysis examined the effectiveness of poetry-based interventions across psychiatric and somatic outcomes.

METHODS: PubMed and Google Scholar were searched up to November 2023 for original studies evaluating poetry-based interventions on mental or physical health outcomes. Studies in English or French using individual or group poetry activities were eligible. Fifteen studies (randomized controlled, case-control, and pre-post designs) met inclusion criteria; those scoring ≥6 on the Newcastle-Ottawa Scale were included in the meta-analysis. Random-effects models were used to pool standardized mean differences for post-traumatic stress disorder (PTSD), major depressive disorder (MDD), anxiety, resilience, stress, and perceived pain. Heterogeneity, prediction intervals, and publication bias (Egger’s test) were assessed.

RESULTS: Poetry-based interventions were associated with large reductions in PTSD symptoms and significant improvements in depressive symptoms, anxiety, and stress, with effect sizes generally in the moderate-to-large range. In contrast, effects on resilience were statistically non-significant and highly imprecise, and no reliable benefit was found for perceived pain, where heterogeneity and evidence of small-study effects were substantial. Across outcomes, most trials were small, at risk of bias, and methodologically heterogeneous.

CONCLUSIONS: Poetry therapy shows promising benefits for trauma-related, depressive, anxiety, and stress outcomes, but the evidence base is limited by small samples, variable quality, and potential publication bias. High-quality, preregistered randomized controlled trials are needed before poetry-based interventions can be firmly recommended beyond an adjunctive role in routine psychiatric care.

PMID:41411711 | DOI:10.1016/j.psychres.2025.116897

Categories
Nevin Manimala Statistics

A Hybrid Rule- and Large Language Model-Based Embodied Voice Assistant (GRACE) for Cognitive Stimulation in Older Adults: Usability Study Assessing Technical Feasibility, Technology Acceptance, and Working Alliance

JMIR Aging. 2025 Dec 18;8:e76489. doi: 10.2196/76489.

ABSTRACT

BACKGROUND: The health and economic burden of dementia has led the World Health Organization to recognize it as a public health priority. Although there currently does not exist a cure for dementia, there are multiple interventions aimed at preventing the risk of dementia and improving the quality of life of people with dementia. Voice assistants (VAs), particularly those using large language models (LLMs), have emerged as promising tools to deliver these interventions to older adults due to their accessible and natural interface.

OBJECTIVE: This pilot study aimed to evaluate the technical feasibility (ie, functional performance and usability) and user acceptance of the embodied rule-based and LLM VA GRACE, as well as the perceived strength of the collaborative relationship or working alliance, between GRACE and healthy older adults during the delivery of cognitive stimulation interventions.

METHODS: A pilot study was conducted with 21 healthy German-speaking adults aged 60 years and older. Participants interacted with GRACE in a laboratory setting for 10-15 minutes. The interaction involved a structured cognitive stimulation session using rule-based and LLM components. Data were collected using pre- and postinteraction questionnaires and semistructured interviews. Quantitative analysis included descriptive statistics and Wilcoxon signed rank tests. Qualitative data were analyzed thematically.

RESULTS: Participants rated GRACE positively, with statistically significant scores above neutral (P<.001 for perceived ease of use, usefulness, enjoyment, and working alliance; P=.009 for perceived control; and P=.009 for intention to continue interacting). Thematic analysis revealed that GRACE was perceived as easy to understand and unambiguous, friendly, and supportive, with intervention components viewed as enjoyable and appropriately challenging. Areas for improvement included personalization, response delays, and voice quality.

CONCLUSIONS: The results suggest that embodied rule-based and LLM VAs like GRACE are feasible and well-received tools for delivering cognitive interventions to older adults. Future iterations will incorporate feedback and extend testing to individuals at risk for dementia.

PMID:41411649 | DOI:10.2196/76489

Categories
Nevin Manimala Statistics

Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation

JMIR Form Res. 2025 Dec 18;9:e77357. doi: 10.2196/77357.

ABSTRACT

BACKGROUND: Despite the transformative potential of artificial intelligence (AI)-based chatbots in medicine, their implementation is hindered by data privacy and security concerns. DeepSeek offers a conceivable solution through its capability for local offline operations. However, as of 2025, it remains unclear whether DeepSeek can achieve an accuracy comparable to that of conventional, cloud-based AI chatbots.

OBJECTIVE: This study aims to evaluate whether DeepSeek, an AI-based chatbot capable of offline operation, achieves answer accuracy on medical multiple-choice questions (MCQs) comparable to that of leading chatbots (ie, ChatGPT and Gemini) on German medical MCQs, thereby assessing its potential as a privacy-preserving alternative for clinical use.

METHODS: A total of 200 interdisciplinary MCQs from the German Progress Test Medicine were administered to ChatGPT (GPT-o3-mini), DeepSeek (DeepSeek-R1), and Gemini (Gemini 2.0 Flash). Accuracy was defined as the proportion of correctly solved questions. Overall differences among the 3 models were tested with the Cochran Q test, while pairwise comparisons were conducted using the McNemar test. Subgroup analyses were performed by medical domain (Fisher exact test) and question length (Wilcoxon rank-sum test). An a priori power analysis indicated a minimum sample size of 195 questions.

RESULTS: All 3 chatbots surpassed the conventional passing threshold of 60%, with accuracies of 96% (192/200) for DeepSeek, 94% (188/200) for Gemini, and 92.5% (185/200) for ChatGPT. The overall difference among models was not statistically significant (P=.10) nor were pairwise comparisons. However, incorrect responses were significantly associated with longer question length for DeepSeek (P=.049) and ChatGPT (P=.04) but not for Gemini. No significant differences in performance were observed across clinical versus preclinical domains or medical specialties (all P>.05).

CONCLUSIONS: Overall, DeepSeek demonstrates outstanding performance on German medical MCQs comparable to the widely used chatbots ChatGPT and Gemini. Similar to ChatGPT, DeepSeek’s performance declined with increasing question length, highlighting verbosity as a persistent challenge for large language models. While DeepSeek’s offline capability and lower operational costs are advantageous, its safe and reliable application in clinical contexts requires further investigation.

PMID:41411646 | DOI:10.2196/77357