Categories
Nevin Manimala Statistics

A Two-Tiered Rescue Protocol to Mitigate Difficulty-Based Failures of ChatGPT 5 and Gemini on the German M2 Medical Exam: Evaluation Study

JMIR Form Res. 2026 Jun 22. doi: 10.2196/86999. Online ahead of print.

ABSTRACT

BACKGROUND: Large language models (LLMs) have demonstrated expert-level performance on medical licensing examinations, but most benchmarks focus on final accuracy, obscuring model-specific behaviors. Critical gaps remain in understanding model efficiency (latency), the efficacy of tiered “rescue” protocols for error correction, and the systematic correlation between performance and human-rated question difficulty. The German M2 exam, paired with the AMBOSS platform’s user-data-driven difficulty ratings, provides a unique opportunity to map AI performance directly against human cognitive load.

OBJECTIVE: This study aimed to move beyond singular accuracy scores by (1) evaluating and comparing the baseline (Tier 1) accuracy and response latency of next-generation rapid-response LLMs; (2) analyzing the efficacy of a two-tiered rescue (Tier 2) protocol in correcting initial errors; and (3) correlating model performance with the user-data-driven Amboss difficulty rating.

METHODS: We evaluated four LLMs (Gemini 2.5 Flash/Pro and ChatGPT 5 Instant/Thinking) on the complete 316-item German M2 (Fall 2024) medical exam, including all multimodal (image-based) questions. A zero-shot copy-paste prompting strategy was utilized, and outputs were evaluated against ground-truth answers using a strict exact-match criterion. A two-tiered protocol was used: Tier 1 (Flash/Instant) provided baseline responses. If incorrect, a Tier 2 (Pro/Thinking) model was deployed as a “rescue.” Performance was analyzed using McNemar’s test, Wilcoxon signed-rank test, Fisher’s exact test, and logistic regression.

RESULTS: Baseline (Tier 1) accuracy was identical at 91.46% (95% CI 87.85-94.06; n = 289/316) for both Gemini 2.5 Flash and ChatGPT 5 Instant, with 27 errors each. However, Gemini Flash (Mean=1.57s) was significantly faster than ChatGPT Instant (Mean = 2.07s; P < .001). Additionally, ChatGPT Instant expended significantly more time on incorrect answers compared to correct ones (P = .002), whereas Gemini Flash showed no such hesitation (P = .814). The Tier 2 rescue rate for ChatGPT 5 Thinking (48.15%, 13/27; 95% CI 30.74-66.01) was higher, though not statistically significant (P = .406), than for Gemini 2.5 Pro (33.33%, 9/27; 95% CI 18.64-52.18). This rescue protocol elevated final accuracy to 94.30% (95% CI 91.18-96.37) for the Gemini system and 95.57% (95% CI 92.70-97.34) for the ChatGPT system (P = .481). A strong, inverse relationship with difficulty was found: for every one-point difficulty increase, the odds of a correct Tier 1 response decreased by 42.1% (OR 0.579, 95% CI 0.425-0.788; P < .001) for Gemini Flash and 47.7% (OR 0.523, 95% CI 0.379-0.720; P < .001) for ChatGPT Instant. This negative correlation persisted even after the rescue (P = .013 and P = .006, respectively).

CONCLUSIONS: Expert-level LLM performance on the German M2 exam masks a critical, systematic vulnerability: a significant decrease in accuracy directly correlated with increased question difficulty. A two-tiered “rescue” system is an effective strategy to mitigate these difficulty-based failures and achieve >95% accuracy, rivaling the best-performing, full-capacity models. We conclude that a simple reliance on a single model is insufficient; hierarchical systems that manage query difficulty are essential for safe and effective integration into medical education.

PMID:42334858 | DOI:10.2196/86999

Categories
Nevin Manimala Statistics

Palliative Care Coaching for Family Caregivers of Patients With Advanced Cancer: A Randomized Clinical Trial

JAMA Netw Open. 2026 Jun 1;9(6):e2619807. doi: 10.1001/jamanetworkopen.2026.19807.

ABSTRACT

IMPORTANCE: African American and rural-dwelling family caregivers of persons with newly diagnosed advanced cancer perform critical, time-intensive tasks and historically have had limited resources to support their role.

OBJECTIVE: To determine the effect of a lay coach-led, early palliative care telehealth intervention (Educate, Nurture, Advise, Before Life Ends [ENABLE] Cornerstone) for African American and rural-dwelling family caregivers of patients with advanced cancer on caregiver and patient outcomes at 24 weeks.

DESIGN, SETTING, AND PARTICIPANTS: This single-blind randomized clinical trial was conducted from January 2020 to May 2025 at outpatient oncology clinics at 2 large cancer centers in the Southeastern US. Participants were African American and rural-dwelling family caregivers aged 21 years or older self-identifying as an unpaid close friend or family member who is involved with the day-to-day medical care of a patient with advanced cancer.

INTERVENTION: The intervention included 6 weekly, 20- to 60-minute psychosocial telephonic sessions facilitated by a trained lay coach plus monthly follow-up. Usual care consisted of mailed pamphlets outlining resources for families at each of the cancer centers.

MAIN OUTCOMES AND MEASURES: The primary outcome was caregiver distress (anxiety and depressive symptoms as measured by the Hospital Anxiety and Depression Scale [HADS]) at 24 weeks. Secondary outcomes were caregiver and patient quality of life (QOL; measured with the Patient-Reported Outcomes Measurement Information System Global Health Short Form), caregiver burden (Montgomery-Borgatta Caregiver Burden Scale), and patient distress (HADS). Outcomes were assessed using baseline-constrained linear mixed-effects models.

RESULTS: A total of 222 family caregivers (mean [SD] age, 55.5 [14.7] years; 169 [76.1%] female; 114 [51.4%] African American; 101 White [45.5%]; 7 other race [3.2%]) and 165 patients (mean [SD] age, 60.7 [12.2] years; 98 [59.4%] female; 79 African American [47.9%]; 84 White [50.9%]; 2 other race [1.2%]) were randomized. At week 24, no relevant between-group differences were observed in caregiver HADS anxiety (mean [SE] baseline-adjusted difference, 0.23 [0.44]; Cohen d = 0.05; 95% CI, -0.14 to 0.24; P = .60) or HADS depressive symptom scores (mean [SE] baseline-adjusted difference, 0.04 [0.41]; Cohen d = 0.01; 95% CI, -0.19 to 0.21; P = .91). For all other outcomes, 24-week differences were of small magnitude and not statistically significant. Exploratory sensitivity analyses of caregivers distressed at baseline indicated improvements in caregiver anxiety (mean [SE] baseline-adjusted difference, -1.21 [0.53]; Cohen d = -0.38; 95% CI, -0.70 to -0.05) and patient mental health QOL (mean [SE] baseline-adjusted difference, 3.00 [1.37]; Cohen d = 0.45; 95% CI, 0.04 to 0.86), but no statistically significant differences in caregiver burden (mean [SE] baseline-adjusted difference, -1.15 [0.69]; Cohen d = -0.32; 95% CI, -0.71 to 0.06) and patient depression (mean [SE] baseline-adjusted difference, -1.30 [0.71]; Cohen d = -0.37; 95% CI, -0.77 to 0.03).

CONCLUSIONS: This randomized clinical trial of a telehealth intervention for African American and rural-dwelling caregivers of patients with advanced cancer found no differences in caregiver and patient outcomes at 24 weeks. However, an exploratory sensitivity analysis indicated potential improvements in caregiver anxiety and patient mental health QOL.

TRIAL REGISTRATION: ClinicalTrials.gov Identifier: NCT04318886.

PMID:42334850 | DOI:10.1001/jamanetworkopen.2026.19807

Categories
Nevin Manimala Statistics

Development and Implementation of an AI System for Generating Clinical Urine Drug Test Sign-Outs

JAMA Netw Open. 2026 Jun 1;9(6):e2619816. doi: 10.1001/jamanetworkopen.2026.19816.

ABSTRACT

IMPORTANCE: Modern natural language processing tools have potential to improve clinical workflows, but few have been successfully deployed in practice.

OBJECTIVE: To describe the development, deployment, and evaluation of an artificial intelligence (AI) language tool for generating preliminary sign-outs to support a urine drug testing service.

DESIGN, SETTING, AND PARTICIPANTS: In this prognostic study, large language models (LLMs) were used to extract substance use patterns from clinical urine drug test interpretations at a single medical center between January 1, 2014, and February 29, 2024. An AI model using these data was trained to predict substance use from qualitative and quantitative urine testing results. Predicted substance use patterns were used to create preliminary clinical sign-out statements, which were then integrated into an existing clinical workflow.

MAIN OUTCOMES AND MEASURES: Predeployment and postdeployment user studies were performed to evaluate model performance and user experience within the workflow. Statistical differences between event rates were calculated using χ2 tests, and between means using t tests. Differences between human and LLM labelers were calculated using the McNemar test.

RESULTS: A total of 83 553 urine tests from 26 459 patients (12 413 male [46.9%]; mean [SD] age, 47.5 [16.7] years) were analyzed. LLM-based extraction of substance-use patterns was 99.9% accurate (13 509 of 13 520 tests), outperforming human labeling. Substance use prediction was similarly accurate, with area under the receiver operating curve greater than 0.99 for 23 of 26 substances. Workflow integration of the AI tool reduced clinical sign-out times by 28.5 seconds per case (23% efficiency gain), and by 65 seconds per case (51% efficiency gain) when integrated alongside a second, non-AI workflow improvement.

CONCLUSIONS AND RELEVANCE: In this prognostic study, AI-based interpretation of urine drug testing was fast and accurate, providing notable efficiency gains to the clinical service. These findings suggest that natural language processing tool integration can provide substantial clinical benefit, without compromising quality of care.

PMID:42334849 | DOI:10.1001/jamanetworkopen.2026.19816

Categories
Nevin Manimala Statistics

Association of Weight-Adjusted-Waist Index With Brain Health: A 16-Year Population-Based Longitudinal Cohort Study

CNS Neurosci Ther. 2026 Jun;32(6):e71001. doi: 10.1002/cns.71001.

ABSTRACT

BACKGROUND: The long-term association of central obesity with brain structural integrity remains poorly understood. This study aimed to investigate the longitudinal association between cumulative Weight-adjusted-waist Index (WWI) exposure and multi-modal neuroimaging markers of brain health.

METHODS: This prospective community-based cohort study included 935 participants from the META-KLS Study. Cumulative WWI was calculated as the time-weighted average over 12 years prior to MRI acquisition. Neuroimaging outcomes included regional gray matter volume, white matter hyperintensity (WMH), and diffusion tensor imaging (DTI) metrics. Generalized linear models, restricted cubic splines, and mediation analyzes were performed.

RESULTS: Elevated cumulative WWI was associated with adverse brain structural outcomes, particularly in females. In women, higher WWI was linked to extensive WMH burden (pFDR = 0.002), widespread microstructural disintegration (pFDR = 0.023), and specific atrophy in the orbital frontal cortex. A J-shaped dose-response relationship was identified for white matter injury, suggesting a tipping point for metabolic resilience. In exploratory mediation analyzes, FBG, SBP, and hs-CRP statistically accounted for 14.7%, 11.3%, and 11.3% of the association between cumulative WWI and WMH burden, respectively, while SBP accounted for 17.8% of the association with global MD.

CONCLUSION: Cumulative WWI serves as a potential predictor of adverse brain structural outcomes, particularly manifesting as white matter injury and atrophy in women. Early monitoring of WWI offers a vital window for targeted metabolic interventions to preserve brain structural integrity.

PMID:42334833 | DOI:10.1002/cns.71001

Categories
Nevin Manimala Statistics

Use of Sedative-Hypnotic Drugs and the Risk of Developing Alzheimer’s Disease: A Systematic Review, Meta-Analysis and Meta-Regression

Drugs. 2026 Jul;86(7):1103-1119. doi: 10.1007/s40265-026-02335-9. Epub 2026 May 27.

ABSTRACT

BACKGROUND: Sedative-hypnotics, including benzodiazepines (BZDs) and non-benzodiazepine hypnotics (Z-drugs), are widely prescribed for insomnia and anxiety, particularly in older adults. Their long-term cognitive safety and potential association with Alzheimer’s disease (AD) remain uncertain. We examined whether use of BZDs and Z-drugs is associated with incident AD and assessed variation by drug class, pharmacokinetics, and methodological factors.

METHODS: PubMed, Embase, and the Cochrane Central Register of Controlled Trials were searched from inception to 16 August 2025 without language restrictions. Reference lists of eligible articles and reviews were screened. We included observational cohort and nested case-control studies enrolling adults without dementia at baseline that compared BZD or Z-drug users with non-users and reported incident AD diagnosed using validated clinical or administrative criteria (e.g., ICD-9/10, NINCDS-ADRDA, or NIA-AA). We excluded reviews, case reports, conference abstracts, studies with overlapping populations, and studies without extractable effect estimates. Two reviewers independently screened studies, extracted data, and assessed risk of bias using ROBINS-E. Random-effects meta-analyses were performed separately for odds ratios (ORs) and hazard ratios (HRs). Heterogeneity was quantified with I2. Publication bias was evaluated with funnel plots and Egger test when applicable. Subgroup and meta-regression analyses assessed clinical and methodological modifiers. Certainty of evidence was rated using GRADE. The protocol was prospectively registered (PROSPERO CRD420251141623).

RESULTS: Thirteen studies (N = 721,354 subjects) were included. Overall sedative-hypnotic use was associated with higher odds of AD (OR 1.29; 95% CI, 1.10-1.53; I2 = 86.5%). Estimates restricted to HRs were attenuated and not statistically significant (HR 1.17; 95% CI, 0.87-1.58; I2 = 73.1%). In subgroup analyses, BZDs overall (OR 1.21; 95% CI 1.07-1.36), Z-drugs (OR 1.14; 95% CI 1.10-1.18; I2 = 0%), and short-acting agents (OR 1.19; 95% CI 1.04-1.36) were associated with higher odds of AD, whereas broad-acting BZDs were not (OR 1.01; 95% CI 0.98-1.05). Long-acting agents showed a borderline estimate (OR 1.44; 95% CI 0.99-2.09). Age-stratified analyses showed higher odds in individuals aged <75 years (OR 1.36; 95% CI 1.24-1.49), but not in those aged ≥75 years (OR 1.14; 95% CI 0.61-2.11). Estimates were also higher in studies using ICD-based definitions (OR 1.47; 95% CI 1.16-1.86) than in those using clinical criteria (OR 1.13; 95% CI 0.84-1.52). Meta-regression identified drug class and publication year as significant moderators. Risk of bias was rated moderate to serious in several studies, mainly due to residual confounding and exposure misclassification. Certainty of evidence ranged from very low to moderate.

CONCLUSIONS: Use of BZDs and Z-drugs was associated with increased odds of AD, with variation across drug classes and pharmacokinetic profiles. Short-acting agents, BZDs overall, and Z-drugs were associated with higher risk, whereas broad-acting BZDs were not; this finding should be interpreted with caution given subgroup heterogeneity and limited statistical power. Residual confounding and reverse causation limit causal inference. These results support careful prescribing and the need for prospective studies with detailed characterization of exposure, dose, duration, and clinical indication to clarify whether observed associations reflect drug-related effects or underlying disease processes.

TRIAL REGISTRATION: PROSPERO protocol number: CRD420251141623.

PMID:42334823 | DOI:10.1007/s40265-026-02335-9

Categories
Nevin Manimala Statistics

A randomised controlled trial comparing effectiveness of audio-visual aid and ai-personalised video self-modelling interventions to reduce dental fear and anxiety in paediatric patients

Eur Arch Paediatr Dent. 2026 Jun 23. doi: 10.1007/s40368-026-01241-8. Online ahead of print.

ABSTRACT

PURPOSE: Paediatric dental anxiety remains a significant challenge in clinical practice, often impacting co-operation and treatment outcomes. This randomised controlled trial compares the anxiety-reducing effects of AI-based self-modelling versus standard video modelling.

METHODS: A single-blind, parallel-arm randomised controlled trial was conducted on 80 children aged 6-12 years requiring restorative dental treatment. Participants were randomised into two groups. Dental fear and anxiety were assessed using CFSS-DS and MCDASf, along with pulse and heart rate monitoring. Data were collected pre- and post-intervention by a blinded assessor and statistically analysed.

RESULTS: Both groups showed significant within-group reductions in dental fear and anxiety (p < 0.001); however, no statistically significant between-group difference was observed for the primary outcome (CFSS-DS). The AI-based personalised video self-modelling app group demonstrated a greater reduction in heart rate (7.65 vs. 2.18 bpm), with a significant between-group difference (p < 0.001; Cohen’s d = 0.89), indicating reduced short-term physiological arousal rather than overall superiority of the intervention and specific anxiety parameters, particularly related to injections and dental examinations. Intergroup analysis revealed a large effect size for heart rate (Cohen’s d = 0.89) and moderate-to-large effects for selected anxiety items with some item-level differences observed. However, overall CFSS-DS score differences between groups were not statistically significant.

CONCLUSION: Both interventions were effective in reducing dental fear and anxiety. However, no superiority was demonstrated for the primary psychological outcome. The AI-based personalised intervention showed greater reduction in physiological arousal (heart rate), suggesting potential benefits in short-term anxiety modulation.

PMID:42334822 | DOI:10.1007/s40368-026-01241-8

Categories
Nevin Manimala Statistics

Validation of the three-level hepatectomy complexity classification and its AI application in robotic liver surgery

Updates Surg. 2026 Jun 23. doi: 10.1007/s13304-026-02724-5. Online ahead of print.

ABSTRACT

Robotic liver surgery (RLS) is expanding in recent years. Complication prediction is crucial for postoperative outcomes. Traditional MIS scores are poorly studied in RLS, and conventional statistics often oversimplify the multifactorial and interrelated nature of these complications. This study aimed to evaluate the three-level complexity Institut Mutualiste Montsouris (IMM) classification in RLS and assess its integration into an AI algorithm to predict major complications. We retrospectively analyzed data of patients underwent RLS. Surgical complexity was stratified into grades I (low complexity), II (intermediate), and III (high). The cumulative incidence rate and conditional probability of postoperative complication and risk factors for complication ≥ Clavien-Dindo grade II were assessed. The prediction model was developed by training/testing a machine learning (ML) algorithm after feature selection with uni-multivariate analysis. We calculated the receiver operating characteristic (ROC) curve and model accuracy. We analyzed 1,045 patients who underwent RLS, classifying them into three complexity levels: Grade I (n = 581), Grade II (n = 267), and Grade III (n = 109). Significant differences were observed in intra- and postoperative outcomes across the three grades. Multivariate analysis identified ASA score (HR 2.1, p = 0.02), number of lesions (HR 1.8, p = 0.001), and operative time (OR 1, p = 0.004) as key predictors of complications. Associated with the three-level complexity classification, the Neural Network showed the best performance with AUC (0.653) and a precision of 0.996. Three-level complexity IMM classification is a useful tool in RLS for predicting intra-postoperative outcomes. It can be integrated into the Neural Network algorithm to predict major complications.

PMID:42334817 | DOI:10.1007/s13304-026-02724-5

Categories
Nevin Manimala Statistics

Bridging to surgery versus palliation in malignant colorectal obstruction: complication risks and mediation by clinical success

Updates Surg. 2026 Jun 23. doi: 10.1007/s13304-026-02730-7. Online ahead of print.

ABSTRACT

Self-expandable metal stents (SEMS) are routinely used in malignant colorectal obstruction (MCO) for palliation or as a bridge to surgery. However, the association between treatment intent and complication risk, as well as the potential role of clinical success as an intermediate procedural endpoint, remains unclear. We retrospectively analyzed 413 patients with MCO who underwent SEMS placement between 2014 and 2024. Patients were categorized by therapeutic intent (palliation vs. bridge to surgery), and complication rates were compared. Mediation analysis was performed using the Sobel test, structural equation modeling (SEM), and bootstrap-based causal mediation to assess whether clinical success mediated the relationship between therapeutic purpose and complications. Complications occurred in 60 patients (14.5%). Palliation was associated with a higher complication rate compared to bridging (20.0% vs. 8.0%, p = 0.001). Clinical success showed a statistically significant indirect association in the exploratory mediation analysis Therapeutic intent effects (Sobel p = 0.035). SEM confirmed a positive association between therapeutic purpose and clinical success (standardized β = 0.171, p < 0.001) and a negative association between clinical success and complications (β = – 0.191, p = 0.009). Bootstrap mediation analysis revealed that 13.0% of the total effect was mediated through clinical success (p = 0.031). Therapeutic intent was associated with complication risk after SEMS placement, and clinical success may partially account for this association. However, the modest mediated proportion suggests that complications are likely influenced by multiple additional clinical and procedural factors. Optimizing decompression remains important but should be integrated with careful patient selection and follow-up management, particularly in palliative settings.

PMID:42334812 | DOI:10.1007/s13304-026-02730-7

Categories
Nevin Manimala Statistics

Scenario-based comparative evaluation of ChatGPT-4o and physician groups in pediatric minor head trauma

Ir J Med Sci. 2026 Jun 23. doi: 10.1007/s11845-026-04512-x. Online ahead of print.

ABSTRACT

BACKGROUND: Interest in the use of ChatGPT-4o in scenario-based clinical assessment has increased substantially in recent years. However, studies evaluating ChatGPT-4o in pediatric head trauma scenarios and comparing it with different physician groups remain limited.

AIMS: To evaluate the scenario-based performance of ChatGPT-4o in pediatric head trauma and compare it with that of emergency physicians, neurosurgeons, and pediatricians.

METHODS: This study included 60 pediatric patients who presented between 15 December 2024 and 15 June 2025 and met the inclusion criteria. After clinical follow-up, cases were converted into multiple-choice case scenarios and classified into red, yellow, and green zones according to PECARN. These scenarios were answered by 42 physicians from emergency medicine, neurosurgery, and pediatrics (n=14 per group) and by ChatGPT-4o. Concordance of scenario-based management responses with PECARN recommendations was compared statistically.

RESULTS: Of the 60 cases, 25.0% (n=15) were classified as red zone, 50.0% (n=30) as yellow zone, and 25.0% (n=15) as green zone. ChatGPT-4o showed lower scenario-based performance than all physician groups in red-zone cases. When non-contrast brain CT was accepted as the correct option in the yellow zone, ChatGPT-4o had the lowest overall accuracy (median: 24.50). When observation was accepted as correct, ChatGPT-4o showed the highest accuracy both in the yellow zone (median: 17.00; p=0.001) and overall (median: 35.50; p<0.001). ChatGPT-4o showed the highest accuracy in green-zone cases (median: 8.50).

CONCLUSION: ChatGPT-4o did not demonstrate adequate scenario-based performance in critical pediatric head trauma cases. However, it may have potential as a supportive tool in non-critical case scenarios.

PMID:42334770 | DOI:10.1007/s11845-026-04512-x

Categories
Nevin Manimala Statistics

The association between composite dietary antioxidant index and the presence of cancer in elderly

Ir J Med Sci. 2026 Jun 23. doi: 10.1007/s11845-026-04510-z. Online ahead of print.

ABSTRACT

BACKGROUND: Oxidative stress and antioxidant balance play critical roles in carcinogenesis, particularly among older adults who experience increased oxidative burden. This study explored the association between the composite dietary antioxidant index (CDAI) and cancer prevalence in the elderly.

METHODS: Data from 4,907 elderly participants were analyzed and categorized into quartiles (Q1-Q4) according to CDAI. Logistic regression models estimated odds ratios (ORs) and 95% confidence intervals (CIs) for cancer across quartiles, adjusting for demographic and clinical covariates. Subgroup analyses were performed to evaluate the consistency across subgroups.

RESULTS: Participants with higher CDAI levels were more likely to be male, educated, and have lower diabetes prevalence. The prevalence of cancer increased across CDAI quartiles (19.6% in Q1 to 27.5% in Q4, P < 0.001). In fully adjusted models, Q4 had higher odds of cancer (OR = 1.26, 95% CI 1.02-1.54, P = 0.029). Subgroup analyses indicated stronger associations among women and those with diabetes.

CONCLUSIONS: Unexpectedly, higher CDAI was associated with a greater prevalence of cancer among elderly individuals, suggesting complex, context-dependent effects of dietary antioxidants on cancer prevalence.

PMID:42334769 | DOI:10.1007/s11845-026-04510-z