Categories
Nevin Manimala Statistics

Application of unified health large language model evaluation framework to In-Basket message replies: bridging qualitative and quantitative assessments

J Am Med Inform Assoc. 2025 Mar 10:ocaf023. doi: 10.1093/jamia/ocaf023. Online ahead of print.

ABSTRACT

OBJECTIVES: Large language models (LLMs) are increasingly utilized in healthcare, transforming medical practice through advanced language processing capabilities. However, the evaluation of LLMs predominantly relies on human qualitative assessment, which is time-consuming, resource-intensive, and may be subject to variability and bias. There is a pressing need for quantitative metrics to enable scalable, objective, and efficient evaluation.

MATERIALS AND METHODS: We propose a unified evaluation framework that bridges qualitative and quantitative methods to assess LLM performance in healthcare settings. This framework maps evaluation aspects-such as linguistic quality, efficiency, content integrity, trustworthiness, and usefulness-to both qualitative assessments and quantitative metrics. We apply our approach to empirically evaluate the Epic In-Basket feature, which uses LLM to generate patient message replies.

RESULTS: The empirical evaluation demonstrates that while Artificial Intelligence (AI)-generated replies exhibit high fluency, clarity, and minimal toxicity, they face challenges with coherence and completeness. Clinicians’ manual decision to use AI-generated drafts correlates strongly with quantitative metrics, suggesting that quantitative metrics have the potential to reduce human effort in the evaluation process and make it more scalable.

DISCUSSION: Our study highlights the potential of a unified evaluation framework that integrates qualitative and quantitative methods, enabling scalable and systematic assessments of LLMs in healthcare. Automated metrics streamline evaluation and monitoring processes, but their effective use depends on alignment with human judgment, particularly for aspects requiring contextual interpretation. As LLM applications expand, refining evaluation strategies and fostering interdisciplinary collaboration will be critical to maintaining high standards of accuracy, ethics, and regulatory compliance.

CONCLUSION: Our unified evaluation framework bridges the gap between qualitative human assessments and automated quantitative metrics, enhancing the reliability and scalability of LLM evaluations in healthcare. While automated quantitative evaluations are not ready to fully replace qualitative human evaluations, they can be used to enhance the process and, with relevant benchmarks derived from the unified framework proposed here, they can be applied to LLM monitoring and evaluation of updated versions of the original technology evaluated using qualitative human standards.

PMID:40063081 | DOI:10.1093/jamia/ocaf023

Categories
Nevin Manimala Statistics

Exploring Heart Disease-Related mHealth Apps in India: Systematic Search in App Stores and Metadata Analysis

J Med Internet Res. 2025 Mar 10;27:e53823. doi: 10.2196/53823.

ABSTRACT

BACKGROUND: Smartphone mobile health (mHealth) apps have the potential to enhance access to health care services and address health care disparities, especially in low-resource settings. However, when developed without attention to equity and inclusivity, mHealth apps can also exacerbate health disparities. Understanding and creating solutions for the disparities caused by mHealth apps is crucial for achieving health equity. There is a noticeable gap in research that comprehensively assesses the entire spectrum of existing health apps and extensively explores apps for specific health priorities from a health care and public health perspective. In this context, with its vast and diverse population, India presents a unique context for studying the landscape of mHealth apps.

OBJECTIVE: This study aimed to create a comprehensive dataset of mHealth apps available in India with an initial focus on heart disease (HD)-related apps.

METHODS: We collected individual app data from apps in the “medical” and “health and fitness” categories from the Google Play Store and the Apple App Store in December 2022 and July 2023, respectively. Using natural language processing techniques, we selected HD apps, performed statistical analysis, and applied latent Dirichlet allocation for clustering and topic modeling to categorize the resulting HD apps.

RESULTS: We collected 118,555 health apps from the Apple App Store and 108,945 health apps from the Google Play Store. Within these datasets, we found that approximately 1.7% (1990/118,555) of apps on the Apple App Store and 0.5% (548/108,945) on the Google Play Store included support for Indian languages. Using monograms and bigrams related to HD, we identified 1681 HD apps from the Apple App Store and 588 HD apps from the Google Play Store. HD apps make up only a small fraction of the total number of health apps available in India. About 90% (1496/1681 on Apple App Store and 548/588 on Google Play Store) of the HD apps were free of cost. However, more than 70% (1329/1681, 79.1% on Apple App Store and 423/588, 71.9% on Google Play Store) of HD apps had no reviews and rating-scores, indicating low overall use.

CONCLUSIONS: Our study proposed a robust method for collecting and analyzing metadata from a wide array of mHealth apps available in India through the Apple App Store and Google Play Store. We revealed the limited representation of India’s linguistic diversity within the health and medical app landscape, evident from the negligible presence of Indian-language apps. We observed a scarcity of mHealth apps dedicated to HD, along with a lower level of user engagement, as indicated by reviews and app ratings. While most HD apps are financially accessible, uptake remains a challenge. Further research should focus on app quality assessment and factors influencing user adoption.

PMID:40063078 | DOI:10.2196/53823

Categories
Nevin Manimala Statistics

Comparing oncological outcomes and safety between photodynamic diagnosis-assisted and white-light transurethral resection in elderly patients with non-muscle invasive bladder cancer

Jpn J Clin Oncol. 2025 Mar 10:hyaf047. doi: 10.1093/jjco/hyaf047. Online ahead of print.

ABSTRACT

OBJECTIVES: This study aimed to assess the prognostic outcomes and risk of adverse events in elderly non-muscle invasive bladder cancer (NMIBC) patients receiving photodynamic diagnosis-assisted transurethral resection of bladder cancer (PDD-TURBT).

METHODS: This study retrospectively included 326 patients who were over 70 years old and received either PDD-TURBT (n = 114, PDD group) or white-light TURBT (n = 212, WL group). Oncological outcomes, namely recurrence-free survival (RFS) and progression-free survival (PFS), and adverse event profiles were compared between the two groups.

RESULTS: In the PDD and WL groups, the median RFS periods were not reached and 41.7 months (P < 0.001), and the median PFS periods were not reached and 160.2 months (P = 0.057), respectively. The Grey test which take account to overall death as a competing risk event revealed recurrence tended to decrease in PDD group (P = 0.050). The independent prognostic factors were determined by multivariate Cox regression analyses: WL-TURBT in RFS. After propensity score matching, statistically favorable RFS in the PDD group were shown (P = 0.018). The incidence of AST/ALT elevation and intraoperative hypotension (defined as systolic blood pressure ≤ 80 mmHg) were significantly higher in the PDD group than in the WL group (P = 0.003 and 0.003, respectively).

CONCLUSIONS: Prolonged RFS are expected for PDD-TURBT using oral 5-aminolevulinic acid in elderly NMIBC patients. However, the risks of liver injury and intraoperative hypotension are higher for PDD-TURBT.

PMID:40063065 | DOI:10.1093/jjco/hyaf047

Categories
Nevin Manimala Statistics

Patient preferences toward herpes zoster vaccination among individuals aged 50 years or older in South Korea: Findings from a discrete choice experiment

Hum Vaccin Immunother. 2025 Dec;21(1):2469419. doi: 10.1080/21645515.2025.2469419. Epub 2025 Mar 10.

ABSTRACT

In South Korea, the increasing incidence of herpes zoster (HZ) and aging population warrant consideration of HZ vaccination for older adults. There is a need to understand the HZ vaccine-related preferences of adults aged ≥50 years and adult children (working or financially independent adults contributing to healthcare decision-making for their parents aged ≥50 years). A discrete choice experiment was conducted to elicit HZ vaccine preferences of the HZ-naïve general public aged ≥50 years (n = 500), current/former HZ patients aged ≥50 years (n = 150), and adult children (n = 150). An online questionnaire was administered through March-May 2023; for each preference-elicitation question, respondents selected between three hypothetical HZ vaccine profiles, characterized by five attributes with varying levels, or “no vaccine”. Respondents generally accepted an increased number of doses (from one to two) for a longer protection duration (from ≥4 to ≥7 or ≥10 years). By mean relative importance (RI), protection duration (RI: 37.1%; 95% confidence interval [CI]: 36.0%, 38.1%), lifetime HZ risk reduction (27.3%; 95% CI: 26.3%, 28.4%) and short-term side effects (14.9%; 95% CI: 14.1%, 15.6%) had the strongest impact on respondents’ HZ vaccine decision-making. Adult children viewed short-term side effects with significantly greater RI than the general public and current/former HZ patients (19.1%, 13.5%, 15.2%, respectively, p < .001). Respondents with selected comorbidities placed higher RI than those without comorbidities on protection duration (39.3% versus 34.2%, p < .001) and lower RI on prevention of HZ-related complications (8.7% versus 10.4%, p = .007). Findings may guide health policy design/refinement and physician-patient conversations on HZ vaccination/vaccines.

PMID:40063054 | DOI:10.1080/21645515.2025.2469419

Categories
Nevin Manimala Statistics

What Is a Stepped-Wedge Cluster Randomized Trial?

JAMA Intern Med. 2025 Mar 10. doi: 10.1001/jamainternmed.2024.8216. Online ahead of print.

NO ABSTRACT

PMID:40063042 | DOI:10.1001/jamainternmed.2024.8216

Categories
Nevin Manimala Statistics

Implementing Accuracy, Completeness, and Traceability for Data Reliability

JAMA Netw Open. 2025 Mar 3;8(3):e250128. doi: 10.1001/jamanetworkopen.2025.0128.

ABSTRACT

IMPORTANCE: While it is well known that data quality underlies evidence validity, the measurement and impacts of data reliability are less well understood. The need has been highlighted in the 21st Century Cures Act of 2016 and US Food and Drug Administration (FDA) Real-World Evidence Program framework in 2018, draft guidance in 2021 and final guidance in 2024. Timely visibility into implementation may be provided by the Transforming Real-World Evidence With Unstructured and Structured Data to Advance Tailored Therapy (TRUST) study, a Verantos Inc-led FDA-funded demonstration project to explore data quality and inform regulatory decision-making.

OBJECTIVE: To report early learnings from the TRUST study on distilling data reliability to practice including developing a practical approach to quantify accuracy, completeness, and traceability of real-world data (routinely collected patient health data) and comparing traditional to advanced data and technologies on these dimensions.

DESIGN, SETTING, AND PARTICIPANTS: This quality improvement study was performed using data from 58 hospitals and more than 1180 associated outpatient clinics from academic and community settings in the US. Participants included patients with asthma treated between January 1, 2014, and December 31, 2022. Data were analyzed from January 1 to June 30, 2024.

EXPOSURES: The traditional approach used medical and pharmacy claims as source documentation. The advanced approach used medical and pharmacy claims, electronic health records with unstructured data extracted using artificial intelligence methods, and mortality registry data.

MAIN OUTCOMES AND MEASURES: Accuracy was assessed using the F1 score. Completeness was estimated as a weighted mean of available data sources during each calendar year under study for each patient. Traceability was estimated as the proportion of data elements identified in clinical source documentation.

RESULTS: In total, 120 616 patients met the minimum data requirements (mean [SD] age, 43.2 [18.5] years; 41 011 male [34.0%]). For accuracy, traditional approaches had F1 scores of 59.5% and advanced approaches had scores of 93.4%. For completeness, traditional approaches yielded mean scores of 46.1% (95% CI, 38.2%-54.0%); advanced approaches, 96.6% (95% CI, 85.8%-1.1%). For traceability, traditional approaches had 11.5% (95% CI, 11.4%-11.5%) and advanced approaches had 77.3% (95% CI, 77.3%-77.3%) of data elements traceable to clinical source data.

CONCLUSIONS AND RELEVANCE: In this study, practical implementation of data reliability measurement is described. Findings suggest the potential of using multiple data sources and applying advanced methods to increase real-world data reliability. The inclusion of data reliability standards when generating evidence from these sources has the potential to strengthen support for the use of real-world evidence in the prescription, reimbursement, and approval of medications.

PMID:40063029 | DOI:10.1001/jamanetworkopen.2025.0128

Categories
Nevin Manimala Statistics

Connectome-Based Predictive Modeling of PTSD Development Among Recent Trauma Survivors

JAMA Netw Open. 2025 Mar 3;8(3):e250331. doi: 10.1001/jamanetworkopen.2025.0331.

ABSTRACT

IMPORTANCE: The weak link between subjective symptom-based diagnostics for posttraumatic psychopathology and objective neurobiological indices hinders the development of effective personalized treatments.

OBJECTIVE: To identify early neural networks associated with posttraumatic stress disorder (PTSD) development among recent trauma survivors.

DESIGN, SETTING, AND PARTICIPANTS: This prognostic study used data from the Neurobehavioral Moderators of Posttraumatic Disease Trajectories (NMPTDT) large-scale longitudinal neuroimaging dataset of recent trauma survivors. The NMPTDT study was conducted from January 20, 2015, to March 11, 2020, and included adult civilians who were admitted to a general hospital emergency department in Israel and screened for early PTSD symptoms indicative of chronic PTSD risk. Enrolled participants completed comprehensive clinical assessments and functional magnetic resonance imaging (fMRI) scans at 1, 6, and 14 months post trauma. Data were analyzed from September 2023 to March 2024.

EXPOSURE: Traumatic events included motor vehicle incidents, physical assaults, robberies, hostilities, electric shocks, fires, drownings, work accidents, terror attacks, or large-scale disasters.

MAIN OUTCOMES AND MEASURES: Connectome-based predictive modeling (CPM), a whole-brain machine learning approach, was applied to resting-state and task-based fMRI data collected at 1 month post trauma. The primary outcome measure was PTSD symptom severity across the 3 time points, assessed with the Clinician-Administered PTSD Scale for DSM-5 (CAPS-5). Secondary outcomes included Diagnostic and Statistical Manual of Mental Disorders (Fifth Edition) (DSM-5) PTSD symptom clusters (intrusion, avoidance, negative alterations in mood and cognition, hyperarousal).

RESULTS: A total of 162 recent trauma survivors (mean [SD] age, 33.9 [11.5] years; 80 women [49.4%] and 82 men [50.6%]) were included at 1 month post trauma. Follow-up assessments were completed by 136 survivors (84.0%) at 6 months and by 133 survivors (82.1%) at 14 months post trauma. Among the 162 recent trauma survivors, CPM significantly predicted PTSD severity at 1 month (ρ = 0.18, P < .001) and 14 months (ρ = 0.24, P < .001) post trauma, but not at 6 months post trauma (ρ = 0.03, P = .39). The most predictive edges at 1 month included connections within and between the anterior default mode, motor sensory, and salience networks. These networks, with the additional contribution of the central executive and visual networks, were predictive of symptoms at 14 months. CPM predicted avoidance and negative alterations in mood and cognition at 1 month, but it predicted intrusion and hyperarousal symptoms at 14 months.

CONCLUSIONS AND RELEVANCE: In this prognostic study of recent trauma survivors, individual differences in large-scale neural networks shortly after trauma were associated with variability in PTSD symptom trajectories over the first year following trauma exposure. These findings suggest that CPM may identify potential targets for interventions.

PMID:40063028 | DOI:10.1001/jamanetworkopen.2025.0331

Categories
Nevin Manimala Statistics

Long-Term Outcomes and Determinants of New-Onset Mental Health Conditions After Trauma

JAMA Netw Open. 2025 Mar 3;8(3):e250349. doi: 10.1001/jamanetworkopen.2025.0349.

ABSTRACT

IMPORTANCE: Evidence suggests that trauma-related mortality and morbidities may follow a multiphasic pattern, with outcomes extending beyond hospital discharge.

OBJECTIVES: To determine the incidence of having new mental health conditions after the first (or index) trauma admission and their association with long-term health outcomes.

DESIGN, SETTING, AND PARTICIPANTS: This population-based, linked-data cohort study was conducted between January 1994 and September 2020, with data analyzed in April 2024. Participants were adult patients with trauma admitted to 1 of the 5 adult trauma hospitals in Western Australia. All patients with major trauma with an Injury Severity Score (ISS) greater than 15 were included. For each patient with major trauma, 2 patients with trauma with a lower ISS (<16) were randomly selected.

EXPOSURE: A new mental health condition recorded in either subsequent public or private hospitalizations after trauma admission.

MAIN OUTCOMES AND MEASURES: The primary outcomes were the associations between new mental health conditions after trauma and subsequent risks of trauma readmission, suicide, and all-cause mortality, as determined by Cox proportional hazards regression. Logistic regression was used to determine which factors were associated with developing a new mental health condition after trauma.

RESULTS: Of 29 191 patients (median [IQR] age, 42 [27-65] years; 19 383 male [66.4%]; median [IQR] ISS, 9 [5-16]; 9405 with ISS >15 and 19 786 with ISS <16) considered, 2233 (7.6%) had a mental health condition before their trauma admissions. The median (IQR) follow-up time after the index trauma admission was 99.8 (61.2-148.5) months. Of 26 958 patients without a prior mental health condition, 3299 (11.3%) developed a mental health condition subsequently, including drug dependence (2391 patients [8.2%], with 419 patients [1.4%] experiencing opioid dependence) and neurotic disorders (1574 patients [5.4%]), including posttraumatic stress disorder. Developing a new mental health condition after trauma was associated with subsequent trauma readmissions (adjusted hazard ratio [aHR], 1.30; 95% CI, 1.23-1.37; P < .001), suicides (aHR, 3.14; 95% CI, 2.00-4.91; P < .001), and all-cause mortality (aHR, 1.24; 95% CI, 1.12-1.38; P < .001). Younger age, unemployment, being single or divorced (vs married), Indigenous ethnicity, and a lower socioeconomic status were all associated with developing a new mental health condition after the first trauma admission.

CONCLUSIONS AND RELEVANCE: This cohort study of 29 191 patients with trauma found that mental health conditions after trauma were common and associated with an increased risk of adverse long-term outcomes, indicating that mental health follow-up of patients with trauma, particularly those from vulnerable subgroups, may be warranted.

PMID:40063026 | DOI:10.1001/jamanetworkopen.2025.0349

Categories
Nevin Manimala Statistics

Second-Line Medications for Women Aged 10 to 50 Years With Idiopathic Generalized Epilepsy

JAMA Netw Open. 2025 Mar 3;8(3):e250354. doi: 10.1001/jamanetworkopen.2025.0354.

ABSTRACT

IMPORTANCE: Women with idiopathic generalized epilepsy (IGE) face challenges in treatment due to limited options that are both effective and safe.

OBJECTIVE: To evaluate the effectiveness and safety of substitution monotherapy vs add-on therapy as second-line options for women who might become pregnant with IGE after failure of first-line antiseizure medications (ASMs) other than valproic acid.

DESIGN, SETTING, AND PARTICIPANTS: Multicenter retrospective comparative effectiveness cohort study at 18 primary, secondary, and tertiary adult and children epilepsy centers across 4 countries, analyzing data from 1995 to 2023. Participants were women aged 10 to 50 years diagnosed with IGE who were prescribed a second line of ASM.

MAIN OUTCOMES AND MEASURES: Treatment failure (TF), defined as the replacement or addition of a second ASM due to ineffectiveness, was compared between patients receiving ASM add-on or substitution monotherapy using inverse probability of treatment weighting (IPTW)-adjusted Cox proportional hazards regression. Exploratory analyses were also conducted to assess the effectiveness of individual ASMs and various ASM combinations.

RESULTS: This study included 249 women with a median (IQR) age of 18.0 (15.5-22.0) years. Among them, 146 (58.6%) received an add-on regimen, and 103 (41.4%) received substitution monotherapy. During follow-up, TF occurred in 48 patients (32.9%) receiving add-on therapy and 36 (35.0%) using substitution monotherapy, with no significant differences between groups (IPTW-adjusted hazard ratio [HR], 0.89; 95% CI, 0.53-1.51; P = .69). ASM discontinuation due to ineffectiveness or adverse effects occurred in 36 patients (24.7%) receiving add-on therapy and 29 (28.2%) receiving substitution monotherapy, showing no significant differences (IPTW-adjusted HR, 0.97; 95% CI, 0.57-1.65; P = .92). Rates of ASM discontinuation due to adverse effects only were low in both groups, occurring in 13 patients (9.0%) receiving add-on therapy and 9 (8.7%) receiving a substitution monotherapy. Among add-on regimens other than valproic acid, the combination of levetiracetam and lamotrigine demonstrated a lower risk of TF compared with other combinations with levetiracetam plus other ASM (adjusted HR, 2.41; 95% CI, 1.12-5.17; P = .02) and lamotrigine plus other ASM (adjusted HR, 4.03; 95% CI, 1.73-9.39; P = .001). However, valproic acid remained the most effective second-line ASM when considering individual agents.

CONCLUSIONS AND RELEVANCE: In this comparative effectiveness study of second-line treatment strategies for women with IGE, no significant differences were observed between substitution monotherapy and add-on therapy.

PMID:40063025 | DOI:10.1001/jamanetworkopen.2025.0354

Categories
Nevin Manimala Statistics

Perceptions of Patient-Clinician Communication Among Adults With and Without Serious Illness

JAMA Netw Open. 2025 Mar 3;8(3):e250365. doi: 10.1001/jamanetworkopen.2025.0365.

ABSTRACT

IMPORTANCE: High-quality, person-centered patient-clinician communication is critical in health care and may be less effective for patients with serious illness. Little is understood about differences in patient-clinician communication experiences of adults with and without serious illness.

OBJECTIVES: To determine whether perceptions of patient-clinician communication experiences differ between adults with and without serious illness.

DESIGN, SETTING, AND PARTICIPANTS: This population-based cross-sectional survey was fielded from April 20 to May 31, 2021, and data were analyzed from January 27, 2023, to December 10, 2024. Participants included a nationally representative sample of US English- or Spanish-speaking adults, including people from historically marginalized groups (eg, Black and Hispanic or Latino individuals, people with low income), responding to an online or telephone survey.

EXPOSURE: Participants were categorized by serious illness status. Participants with serious illness replied yes to (1) having a diagnosis from a list of medical conditions and (2) reporting feeling sicker or having decreased functionality during the last year.

MAIN OUTCOMES AND MEASURES: The survey asked about community partner-derived measures of patient-clinician communication experiences, including trusting clinicians, feeling afraid to speak up, and being unsure about next steps. Multivariable logistic regression models were used to estimate the association of serious illness with these communication experiences, adjusting for sociodemographic characteristics. Percentages were weighted according to the National Opinion Research Center’s statistical weighting methods to account for differences in nonresponse.

RESULTS: Of 6126 individuals invited, 1847 (30.2%) completed the survey and were included in analysis (mean [SD] age, 48.4 [17.5] years); 950 (51.8%) identified as female; 191 (11.9%) identified as Black and 287 (16.7%) as Hispanic; and 434 (17.8%) had an annual income less than $30 000 (here called low income). Among respondents, 363 participants (18.5%) had serious illness (mean [SD] age, 50.2 [18.1] years; 218 [64.5%] female; 34 [12.4%] Black; 54 [16.4%] Hispanic; 131 [27.3%] with low income), and 1484 (81.5%) had no serious illness (mean [SD] age, 48.0 [17.4] years; 732 [48.9%] female; 157 [11.8%] Black; 233 [16.7%] Hispanic; 303 [15.6%] with low income). Compared with adults without serious illness, adults with serious illness were more likely to report leaving a visit unsure about next steps (adjusted odds ratio [AOR], 2.30; 95% CI, 1.62-3.27); being afraid to ask questions or speak up (AOR, 2.18; 95% CI, 1.55-3.08); believing they were talked down to or made to feel inferior (AOR, 1.90; 95% CI, 1.24-2.91); and believing that they were treated unfairly by clinicians (AOR, 3.26; 95% CI, 2.43-4.38).

CONCLUSIONS AND RELEVANCE: In this cross-sectional study, adults with serious illness more often had worse patient-clinician communication experiences. Further research is needed to better understand and develop interventions to improve perceptions of patient-clinician communication experiences for adults with serious illness.

PMID:40063024 | DOI:10.1001/jamanetworkopen.2025.0365