Categories
Nevin Manimala Statistics

Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study

JMIR Med Inform. 2025 May 16;13:e66917. doi: 10.2196/66917.

ABSTRACT

BACKGROUND: The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored.

OBJECTIVE: This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs’ ability to accurately judge their own responses.

METHODS: We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model’s mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests.

RESULTS: The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003).

CONCLUSIONS: Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.

PMID:40378406 | DOI:10.2196/66917

Categories
Nevin Manimala Statistics

Evaluation of the Digital Support Tool Gro Health W8Buddy as Part of Tier 3 Weight Management Service: Observational Study

J Med Internet Res. 2025 May 16;27:e62661. doi: 10.2196/62661.

ABSTRACT

BACKGROUND: The escalating prevalence of obesity worldwide increases the risk of chronic diseases and diminishes life expectancy, with a growing economic burden necessitating urgent intervention. The existing tiered approach to weight management, particularly specialist tier 3 services, falls short of meeting the population’s needs. The emergence of digital health tools, while promising, remains underexplored in specialized National Health Service weight management services (WMSs).

OBJECTIVE: This service evaluation study assessed the use, effectiveness, and clinical impact of the W8Buddy digital support tool as part of the National Health Service WMS.

METHODS: W8Buddy, a personalized digital platform, provides a tailored weight management plan to empower individuals and was collaboratively developed with input from patients, the clinical team, and DDM Health. It launched at the University Hospitals Coventry and Warwickshire tier 3 WMS in 2022. All patients accessing University Hospitals Coventry and Warwickshire WMS were offered W8Buddy as part of standard care. Data were analyzed using independent samples t tests and Fisher exact tests for continuous and categorical outcomes, respectively. Multiple linear regression analysis explored associations between participant weight, engagement with W8Buddy, and time in the service.

RESULTS: Complete datasets for weights were available for 421 patients (220 W8Buddy group and 192 nonuser control group). W8Buddy users, predominantly female (n=185, 84.1%) and Caucasian, had a mean age of 43 years, while nonusers averaged 46 years (P=.02). Starting weights were comparable: 134 kg in the W8Buddy group and 130.2 kg in controls (P=.14); however, W8Buddy users had slightly higher starting BMI (49.6 vs 46.8 kg/m2, P=.08). A total of 33.5% (n=392) of patients activated W8Buddy and engaged with it. There was significant weight loss among W8Buddy users, with a 0.74 kg monthly loss compared to standard care (β=-.74, 95% CI -1.28 to -0.21; P=.007). The longer an individual stayed in this study and used W8Buddy, the more weight was lost. W8Buddy users with type 2 diabetes mellitus experienced a significant hemoglobin A1c reduction (59.8 to 51.2 mmol/mol, P=.02) compared to nonusers with type 2 diabetes. W8Buddy users also showed significant improvement across the Satisfaction With Life Scale, the Karolinska Sleepiness Scale, and quality of life visual analog scale (P<.001) during follow-up.

CONCLUSIONS: Participants engaging with W8Buddy as part of a digitally enabled tier 3 WMS demonstrated significant improvements in clinical and psychological outcomes, with weight changes statistically significant compared to those not engaging with the digital tool. Reduction in hemoglobin A1c was present in both groups; however, statistical significance was only reached among those engaging with W8Buddy. These findings suggest digital tools can augment traditional services and promote patient empowerment. Future studies must provide long-term data to understand if the benefits from the digital tool are sustained.

PMID:40378402 | DOI:10.2196/62661

Categories
Nevin Manimala Statistics

Characteristics and incidence trends of adults hospitalized with community-acquired pneumonia in Portugal, pre-pandemic

PLoS One. 2025 May 16;20(5):e0322623. doi: 10.1371/journal.pone.0322623. eCollection 2025.

ABSTRACT

Community-acquired pneumonia (CAP) is a major cause of hospitalization that leads to substantial morbidity, mortality, and costs. Evaluating CAP trends over time is important to understand patterns and the impact of public health interventions. This study aims to describe the characteristics and trends in the incidence of adults hospitalized with CAP in Portugal between 2010 and 2018. In this study, we included hospitalization data, prevalence of comorbidities, and population data. CAP hospitalizations of adults (≥18y) living in mainland Portugal discharged from public hospitals were identified using ICD-9-CM or ICD-10-CM codes. Based on previous CAP studies, we selected nine relevant comorbidities. We described the frequency and incidence of CAP hospitalizations per sex, age group, comorbidity, and year of discharge. Trends were explored using Joinpoint regression. We observed 470,545 CAP hospitalizations falling into the 2010-18 period. The majority were males (54.8%) and aged ≥75 years (65.3%). Most often recorded comorbidities were congestive heart failure (26.4%), diabetes (25.5%), and chronic pulmonary disease (19.2%). The Joinpoint regression identified a gradual decline in the incidence rates of CAP hospitalizations for both sexes and all age groups. Of the nine comorbidities selected, seven showed a progressive increase in incidence rates followed by a subsequent decline (all except HIV/AIDS and chronic renal disease). Our findings offer valuable insights for selecting priority groups for public health interventions and design strategies to mitigate the burden of CAP.

PMID:40378392 | DOI:10.1371/journal.pone.0322623

Categories
Nevin Manimala Statistics

Initial validity and reliability testing of the SGBA-5

PLoS One. 2025 May 16;20(5):e0323834. doi: 10.1371/journal.pone.0323834. eCollection 2025.

ABSTRACT

BACKGROUND: A growing body of research indicates that sex (biological) and gender (sociocultural) influence health through a variety of distinct mechanisms. Sex- and Gender-Based Analysis (SGBA) techniques could examine these influences, however, there is a lack of nuanced and easily implementable measurement tools for health research. To address this gap, we created the Sex- and Gender-Based Analysis Tool – 5 item (SGBA-5).

OBJECTIVES: This research aims to assess the validity and reliability of the SGBA-5 for use in health sciences research where sex or gender are not primary variables of interest.

METHODS: A Delphi consensus study was conducted with Canadian researchers (n = 14). The Delphi experts rated the validity of each SGBA-5 item on a 5-point Likert scale each round, receiving summary statistics of other experts’ responses after the first round. A conservative threshold for consensus agreement (75% rating an item 4+ of 5) was used given the novelty of this scale’s items. Reliability was assessed through a two-armed test-retest study. The university student arm (n = 89) was conducted in-person (on paper), and the older adult arm (n = 71) was conducted online (digitally).

RESULTS: The Delphi study ended after three rounds; experts reached consensus agreement on the validity of the biological sex item of the SGBA-5 (93%) and consensus non-agreement on each of the gendered aspect of health items (identity: 64%, expression: 64%, roles: 50%, relations: 57%). Both the student arm (sex item: [Formula: see text], gendered items: [Formula: see text]) and the older adult arm (sex item: [Formula: see text], gendered items: [Formula: see text]) of the test-retest study indicated that all items were reliable.

CONCLUSIONS: The novel SGBA-5 tool demonstrated reliability across all scale items and validity of the biological sex item. The gendered aspects of health items may be valid. Future research can further develop the SGBA-5 as a tool for use in health research.

PMID:40378387 | DOI:10.1371/journal.pone.0323834

Categories
Nevin Manimala Statistics

Seroprevalence of hepatitis A virus infection in urban and rural areas in Vietnam

PLoS One. 2025 May 16;20(5):e0323139. doi: 10.1371/journal.pone.0323139. eCollection 2025.

ABSTRACT

BACKGROUND/OBJECTIVES: The prevalence of hepatitis A virus (HAV) is associated with socioeconomic conditions, access to clean drinking water, and improvements in sanitation. In Vietnam, epidemiological data on HAV have been limited over the past two decades. This study aims to assess age-specific HAV seroprevalence across two distinct geographic regions, urban and rural areas, and identify the risk factors associated with HAV seropositivity in Vietnam.

METHODS: This cross-sectional seroprevalence study was conducted in two distinct areas in Vietnam. Serological testing for anti-HAV total antibodies was performed, and socio-demographic questionnaires were administered to all participants. The age at the midpoint of population immunity (AMPI) was calculated and analyzed.

RESULTS: A total of 1,281 participants aged 1-80 years were included, with 649 from urban areas and 632 from rural areas. Of the total participants, 33.2% were aged <15 years. Overall, HAV seropositivity was 69.2%, with urban areas exhibiting significantly lower seropositivity (57.9%) compared to rural areas (80.7%) (p < 0.001). The AMPI was 29 years, indicating Vietnam is at intermediate HAV endemicity. Multivariate analysis identified key risk factors for HAV infection, including age and rural residence. Conversely, participants with higher educational levels and those who consumed boiled drinking water were less likely to be HAV seropositive.

CONCLUSIONS: The study identified significant differences in the HAV seroprevalence between urban and rural areas, providing critical data for public health officials. These findings highlight the key role of targeted public health interventions and vaccination programs in mitigating HAV infection rates and reducing the disease burden, particularly among high-risk populations in Vietnam.

PMID:40378373 | DOI:10.1371/journal.pone.0323139

Categories
Nevin Manimala Statistics

Verity plots: A novel method of visualizing reliability assessments of artificial intelligence methods in quantitative cardiovascular magnetic resonance

PLoS One. 2025 May 16;20(5):e0323371. doi: 10.1371/journal.pone.0323371. eCollection 2025.

ABSTRACT

BACKGROUND: Artificial intelligence (AI) methods have established themselves in cardiovascular magnetic resonance (CMR) as automated quantification tools for ventricular volumes, function, and myocardial tissue characterization. Quality assurance approaches focus on measuring and controlling AI-expert differences but there is a need for tools that better communicate reliability and agreement. This study introduces the Verity plot, a novel statistical visualization that communicates the reliability of quantitative parameters (QP) with clear agreement criteria and descriptive statistics.

METHODS: Tolerance ranges for the acceptability of the bias and variance of AI-expert differences were derived from intra- and interreader evaluations. AI-expert agreement was defined by bias confidence and variance tolerance intervals being within bias and variance tolerance ranges. A reliability plot was designed to communicate this statistical test for agreement. Verity plots merge reliability plots with density and a scatter plot to illustrate AI-expert differences. Their utility was compared against Correlation, Box and Bland-Altman plots.

RESULTS: Bias and variance tolerance ranges were established for volume, function, and myocardial tissue characterization QPs. Verity plots provided insights into statstistcal properties, outlier detection, and parametric test assumptions, outperforming Correlation, Box and Bland-Altman plots. Additionally, they offered a framework for determining the acceptability of AI-expert bias and variance.

CONCLUSION: Verity plots offer markers for bias, variance, trends and outliers, in addition to deciding AI quantification acceptability. The plots were successfully applied to various AI methods in CMR and decisively communicated AI-expert agreement.

PMID:40378365 | DOI:10.1371/journal.pone.0323371

Categories
Nevin Manimala Statistics

Chemotherapy-related adverse drug reaction and associated factors among adult cancer patient attending Jimma medical center oncology unit, Southwest Ethiopia

PLoS One. 2025 May 16;20(5):e0321785. doi: 10.1371/journal.pone.0321785. eCollection 2025.

ABSTRACT

BACKGROUND: In 2017, reports of adverse drug reactions worldwide reached an estimated 35 million.Chemotherapeutic agents were one of the most often implicated pharmacological classes in inducing adverse drug reactions. Adverse drug reactions increase the overall expense and mortality. Adverse drug reactions increase morbidity, mortality, hospitalization rate and financial expenses. Therefore, this study intended to assess chemotherapy-related adverse drug reactions and associated factors among adult cancer patients.

PATIENTS AND METHOD: A facility-based prospective observational study was conducted from July 2022 to October 2022 at Jimma Medical Center’s oncology unit. A standard data collection tool (Naranjo’s algorithm, modified Hartwig’s severity scale, and modified Schumock-Thornton criteria) was used for assessment of causality, severity, and preventability of adverse reactions, respectively. Socio-demographic profile and any adverse drug reactions reported were collected separately. The data was collected by one pharmacist and two nurses after giving training. Data was entered into Epidata version 4.6.0 and analyzed by SPSS version 25. Bivariate and multivariable logistic regression was conducted to identify independent predictors of the pattern of adverse drug reaction occurrence. A P-value of 0.05 was taken as statistically significant.

RESULT: Out of 154 patients enrolled in the study, 66.2% were female. The mean age of patients was 41.20 ± 13.54 years. From the total, 98 (63.6%) cases developed a total of 198 adverse drug reactions. Out of them, 59.2% were female. The most commonly encountered adverse drug reactions were nausea and vomiting (33.8%) and hair loss (29.3%). Most of the reactions were probable (61.1%) in causality, mild (66.2%) in severity, and not preventable (43.9%) in nature. Female sex (AOR = 1.054; 95% CI= (1.021-1.087); P = 0.001), number of chemotherapy treatments (AOR = 3.33; 95% CI= (1.301-8.52); P = 0.012), and elderly age (AOR = 3.065; 95% CI= (1.01-9.296); P = 0.048) were associated with occurrences of adverse drug reactions.

CONCLUSION: We can deduce from the data that adverse drug reactions are a significant concern for patients undergoing chemotherapy, with nearly two-thirds experiencing ADRs. The most common reactions are nausea and vomiting, which are mostly mild and probable. Age, gender, and the use of several chemotherapy drugs were associated with an increased risk of adverse drug reactions. Hence all concerned bodies should make an effort for early detection and take preventive measure of chemotherapy-related adverse drug reactions. Where feasible, use chemotherapy protocols with alower risk of ADRs. Evaluate dose adjustments for elderly patients. Implement protocols for risk assessment before initiating chemotherapy.

PMID:40378362 | DOI:10.1371/journal.pone.0321785

Categories
Nevin Manimala Statistics

Experience of Using Electronic Inhaler Monitoring Devices for Patients With Chronic Obstructive Pulmonary Disease or Asthma: Systematic Review of Qualitative Studies

JMIR Mhealth Uhealth. 2025 May 16;13:e57645. doi: 10.2196/57645.

ABSTRACT

BACKGROUND: Electronic inhaler monitoring devices (EIMDs) can enhance medication adherence in patients with chronic obstructive pulmonary disease (COPD) and asthma, yet patient perceptions and experiences with these devices vary widely. A systematic qualitative synthesis is required to comprehensively understand patient perspectives on EIMDs, to lay the foundation for developing strategies to improve patient compliance.

OBJECTIVE: This study aims to systematically evaluate qualitative studies on the experiences of patients with COPD and asthma using EIMDs, providing insights to support their clinical application and improve patient engagement.

METHODS: This review synthesized qualitative data from reports found through a systematic search of PubMed, Web of Science, CINAHL, Embase, Cochrane Library, and PsycInfo from January 1983 to July 2024. The reports assessed patient experiences with EIMDs for COPD and asthma. The quality of the included reports was appraised using the Critical Appraisal Skills Program criteria developed by the Centre for Evidence-Based Medicine, University of Oxford, UK.

RESULTS: A total of 7 reports were included, encompassing data from 44 patients with COPD and 146 with asthma. Findings were organized into 9 sub-themes and 3 themes: positive experiences with EIMDs (usability and easy acceptance, enhanced self-management); stresses and challenges of using these devices (negative emotional stress, device trust issues, social difficulties, economic burdens, and technical challenges); and patient expectations from these devices (expectations related to device construction and function and external support).

CONCLUSIONS: Patients have positive experiences using electronic monitoring devices for inhalation devices but also face various social, psychological, and technical challenges. Health care workers should consider patient experiences with EIMDs to tailor these devices to patient needs, ultimately enhancing device acceptance and adherence. Further research should focus on increasing EIMDs convenience and usability for patients with COPD and asthma.

PMID:40378281 | DOI:10.2196/57645

Categories
Nevin Manimala Statistics

Long-Term Safety and Effectiveness of Cold-Crosslinked Hyaluronic Acid Fillers: Multicenter, Randomized, Controlled, Double-Blind Study

Aesthet Surg J. 2025 May 16:sjaf080. doi: 10.1093/asj/sjaf080. Online ahead of print.

ABSTRACT

BACKGROUND: EVOLYSSE FORM (EVLF) and EVOLYSSE SMOOTH (EVLS) are new hyaluronic acid fillers created using an innovative cold crosslinking process.

OBJECTIVES: To collect safety and effectiveness data on new cold-crosslinked fillers to support US approval for the correction of moderate to severe dynamic facial wrinkles and folds.

METHODS: In this randomized, controlled, split-face study, 140 subjects with moderate to severe nasolabial folds received a cold-crosslinked filler in 1 nasolabial fold (EVLF = 70, EVLS = 70) and a traditionally-crosslinked filler, Restylane-L (RESL), in the contralateral fold and were followed through 12 months with an optional retreatment at that timepoint and subsequent 3 months of safety follow-up.

RESULTS: The primary endpoint of mean Wrinkle Severity Rating Scale change from baseline to Month 6 as rated by photographic review panel demonstrated non-inferiority and statistical superiority for the cold-crosslinked fillers. Blinded evaluator Wrinkle Severity Rating Scale assessments showed a mean change from baseline that was statistically significantly better than RESL for EVLF at all visits through 12 months and for EVLS at 6 and 9 months. Most subjects were responders on the Global Aesthetic Improvement Scale throughout the study according to ratings by blinded evaluators, treating investigators, and subjects. The FACE-Q Appraisal of Nasolabial Folds overall mean score showed significant improvement from baseline (p < 0.0001) at all timepoints through Month 12 for all treatment groups. All treatments were well tolerated.

CONCLUSIONS: The new cold-crosslinked fillers were shown to be safe and effective for correction of nasolabial folds, with results lasting for 1 year.

PMID:40378267 | DOI:10.1093/asj/sjaf080

Categories
Nevin Manimala Statistics

A Personalized Predictive Model That Jointly Optimizes Discrimination and Calibration

Stat Med. 2025 May;44(10-12):e70077. doi: 10.1002/sim.70077.

ABSTRACT

Precision medicine is accelerating rapidly in the field of health research. This includes fitting predictive models for individual patients based on patient similarity in an attempt to improve model performance. We propose an algorithm which fits a personalized predictive model (PPM) using an optimal size of a similar subpopulation that jointly optimizes model discrimination and calibration, as it is criticized that calibration is not assessed nearly as often as discrimination despite poorly calibrated models being potentially misleading. We define a mixture loss function that considers model discrimination and calibration, and allows for flexibility in emphasizing one performance measure over another. We empirically show that the relationship between the size of subpopulation and calibration is quadratic, which motivates the development of our jointly optimized model. We also investigate the effect of within-population patient weighting on performance and conclude that the size of subpopulation has a larger effect on the predictive performance of the PPM compared to the choice of weight function.

PMID:40378188 | DOI:10.1002/sim.70077