Categories
Nevin Manimala Statistics

Relationship between body composition and the risk of non-communicable chronic diseases in active older women from Chillán (Chile).

Rev Esp Salud Publica. 2023 Jun 22;97:e202306045.

ABSTRACT

OBJECTIVE: In Chile, the elderly represent 18% of the population. In women, the aging process impacts body composition, in addition to the coexistence of other pathologies such as chronic noncommunicable diseases (NCDs). The aim of the study was to relate body composition to the presence of chronic noncommunicable diseases in active older women in the city of Chillán.

METHODS: The sample consisted of 284 women belonging to senior centers in Chillán. Body composition was determined by bioimpedanciometry. Sociodemographic information, prevalent pathologies, geriatric syndromes and physical activity were determined by means of a validated questionnaire. Data were analyzed with descriptive and inferential statistics in STATA 15.0 software with an α<0.05.

RESULTS: Of the sample, 63% were under seventy-five years of age, 77.5% had less than twelve years of schooling, the predominant socioeconomic level was low, and the poor perception of health was mainly referred to as well as the use of regular medication. Arterial hypertension (AHT) and hypercholesterolemia were prevalent with 70.4% and 48.2% respectively. Body mass index (BMI) was 29.7±4.8 and 71.8% had excess malnutrition. The group older than seventy-five years presented more body fat (BMF) and extracellular water (ECW). AHT was related to higher BMI, TGM, MBC (Mean Arm Circumference), PC (Calf Circumference) and ECW (p<0.05), while Diabetes mellitus was related to BMI and MBC.

CONCLUSIONS: Hypertension is the most frequent pathology and is related to higher BMI, MGT, CMB, CP and ECW, followed by DM2 which is related to BMI and CMB.

PMID:37387225

Categories
Nevin Manimala Statistics

Study design of National Surveillance of the Work Environment of Employees in Denmark (NASWEED): Prospective cohort with register follow-up

Scand J Public Health. 2023 Jun 30:14034948231182022. doi: 10.1177/14034948231182022. Online ahead of print.

ABSTRACT

AIM: The aim is to report the design and baseline data of the ‘National Surveillance of the Work Environment of Employees in Denmark’ study (NASWEED).

METHODS: NASWEED consist of (a) bi-annual cross-sectional samples, based on probability samples of wage earners of the general working population in Denmark from 2021 onwards (surveillance), (b) a prospective cohort of all previous respondents every two years (epidemiology, questionnaire follow-up) and (c) longitudinal follow-up in Danish registers about work and health (epidemiology, register follow-up). Between February and May 2021, a stratified (38 occupational industries) probability sample of 63,391 Danish residents aged 15-69 years who were employed for at least 34 hours per month received an invitation to participate, of whom 30,099 (47.5%) completed the questionnaire, 897 (1.4%) partially completed the questionnaire and 32,395 (51.1%) did not respond. Baseline was completed in June 2021. NASWEED covers various topics about the work environment (psychosocial, ergonomic, chemical, biological, safety, accidents, working from home, etc), health behaviours and somatic and mental health-related conditions. Statistical analyses will mainly build on survey procedures with model-assisted weights to ensure that the sample will yield representative estimates of the general working population.

DISCUSSION: NASWEED will monitor the development of the work environment and health in Denmark until 2030. The survey data will be included in epidemiological studies with repeated measurement of the work environment, health variables and covariates, and follow-ups in national registers to investigate the prospective association in the years and decades to come between the work environment and workers’ health and labour market participation.

PMID:37387222 | DOI:10.1177/14034948231182022

Categories
Nevin Manimala Statistics

Getting ‘ϕψχal’ with proteins: minimum message length inference of joint distributions of backbone and sidechain dihedral angles

Bioinformatics. 2023 Jun 30;39(Supplement_1):i357-i367. doi: 10.1093/bioinformatics/btad251.

ABSTRACT

The tendency of an amino acid to adopt certain configurations in folded proteins is treated here as a statistical estimation problem. We model the joint distribution of the observed mainchain and sidechain dihedral angles (〈ϕ,ψ,χ1,χ2,…〉) of any amino acid by a mixture of a product of von Mises probability distributions. This mixture model maps any vector of dihedral angles to a point on a multi-dimensional torus. The continuous space it uses to specify the dihedral angles provides an alternative to the commonly used rotamer libraries. These rotamer libraries discretize the space of dihedral angles into coarse angular bins, and cluster combinations of sidechain dihedral angles (〈χ1,χ2,…〉) as a function of backbone 〈ϕ,ψ〉 conformations. A ‘good’ model is one that is both concise and explains (compresses) observed data. Competing models can be compared directly and in particular our model is shown to outperform the Dunbrack rotamer library in terms of model complexity (by three orders of magnitude) and its fidelity (on average 20% more compression) when losslessly explaining the observed dihedral angle data across experimental resolutions of structures. Our method is unsupervised (with parameters estimated automatically) and uses information theory to determine the optimal complexity of the statistical model, thus avoiding under/over-fitting, a common pitfall in model selection problems. Our models are computationally inexpensive to sample from and are geared to support a number of downstream studies, ranging from experimental structure refinement, de novo protein design, and protein structure prediction. We call our collection of mixture models as PhiSiCal (ϕψχal).

AVAILABILITY AND IMPLEMENTATION: PhiSiCal mixture models and programs to sample from them are available for download at http://lcb.infotech.monash.edu.au/phisical.

PMID:37387189 | DOI:10.1093/bioinformatics/btad251

Categories
Nevin Manimala Statistics

Phylogenetic diversity statistics for all clades in a phylogeny

Bioinformatics. 2023 Jun 30;39(Supplement_1):i177-i184. doi: 10.1093/bioinformatics/btad263.

ABSTRACT

The classic quantitative measure of phylogenetic diversity (PD) has been used to address problems in conservation biology, microbial ecology, and evolutionary biology. PD is the minimum total length of the branches in a phylogeny required to cover a specified set of taxa on the phylogeny. A general goal in the application of PD has been identifying a set of taxa of size k that maximize PD on a given phylogeny; this has been mirrored in active research to develop efficient algorithms for the problem. Other descriptive statistics, such as the minimum PD, average PD, and standard deviation of PD, can provide invaluable insight into the distribution of PD across a phylogeny (relative to a fixed value of k). However, there has been limited or no research on computing these statistics, especially when required for each clade in a phylogeny, enabling direct comparisons of PD between clades. We introduce efficient algorithms for computing PD and the associated descriptive statistics for a given phylogeny and each of its clades. In simulation studies, we demonstrate the ability of our algorithms to analyze large-scale phylogenies with applications in ecology and evolutionary biology. The software is available at https://github.com/flu-crew/PD_stats.

PMID:37387175 | DOI:10.1093/bioinformatics/btad263

Categories
Nevin Manimala Statistics

Higher-order genetic interaction discovery with network-based biological priors

Bioinformatics. 2023 Jun 30;39(Supplement_1):i523-i533. doi: 10.1093/bioinformatics/btad273.

ABSTRACT

MOTIVATION: Complex phenotypes, such as many common diseases and morphological traits, are controlled by multiple genetic factors, namely genetic mutations and genes, and are influenced by environmental conditions. Deciphering the genetics underlying such traits requires a systemic approach, where many different genetic factors and their interactions are considered simultaneously. Many association mapping techniques available nowadays follow this reasoning, but have some severe limitations. In particular, they require binary encodings for the genetic markers, forcing the user to decide beforehand whether to use, e.g. a recessive or a dominant encoding. Moreover, most methods cannot include any biological prior or are limited to testing only lower-order interactions among genes for association with the phenotype, potentially missing a large number of marker combinations.

RESULTS: We propose HOGImine, a novel algorithm that expands the class of discoverable genetic meta-markers by considering higher-order interactions of genes and by allowing multiple encodings for the genetic variants. Our experimental evaluation shows that the algorithm has a substantially higher statistical power compared to previous methods, allowing it to discover genetic mutations statistically associated with the phenotype at hand that could not be found before. Our method can exploit prior biological knowledge on gene interactions, such as protein-protein interaction networks, genetic pathways, and protein complexes, to restrict its search space. Since computing higher-order gene interactions poses a high computational burden, we also develop a more efficient search strategy and support computation to make our approach applicable in practice, leading to substantial runtime improvements compared to state-of-the-art methods.

AVAILABILITY AND IMPLEMENTATION: Code and data are available at https://github.com/BorgwardtLab/HOGImine.

PMID:37387173 | DOI:10.1093/bioinformatics/btad273

Categories
Nevin Manimala Statistics

The impossible challenge of estimating non-existent moments of the Chemical Master Equation

Bioinformatics. 2023 Jun 30;39(Supplement_1):i440-i447. doi: 10.1093/bioinformatics/btad205.

ABSTRACT

MOTIVATION: The Chemical Master Equation (CME) is a set of linear differential equations that describes the evolution of the probability distribution on all possible configurations of a (bio-)chemical reaction system. Since the number of configurations and therefore the dimension of the CME rapidly increases with the number of molecules, its applicability is restricted to small systems. A widely applied remedy for this challenge is moment-based approaches which consider the evolution of the first few moments of the distribution as summary statistics for the complete distribution. Here, we investigate the performance of two moment-estimation methods for reaction systems whose equilibrium distributions encounter fat-tailedness and do not possess statistical moments.

RESULTS: We show that estimation via stochastic simulation algorithm (SSA) trajectories lose consistency over time and estimated moment values span a wide range of values even for large sample sizes. In comparison, the method of moments returns smooth moment estimates but is not able to indicate the non-existence of the allegedly predicted moments. We furthermore analyze the negative effect of a CME solution’s fat-tailedness on SSA run times and explain inherent difficulties. While moment-estimation techniques are a commonly applied tool in the simulation of (bio-)chemical reaction networks, we conclude that they should be used with care, as neither the system definition nor the moment-estimation techniques themselves reliably indicate the potential fat-tailedness of the CME’s solution.

PMID:37387158 | DOI:10.1093/bioinformatics/btad205

Categories
Nevin Manimala Statistics

Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function

Bioinformatics. 2023 Jun 30;39(Supplement_1):i318-i325. doi: 10.1093/bioinformatics/btad208.

ABSTRACT

MOTIVATION: Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.

RESULTS: We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.

AVAILABILITY AND IMPLEMENTATION: The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.

PMID:37387145 | DOI:10.1093/bioinformatics/btad208

Categories
Nevin Manimala Statistics

Deep statistical modelling of nanopore sequencing translocation times reveals latent non-B DNA structures

Bioinformatics. 2023 Jun 30;39(Supplement_1):i242-i251. doi: 10.1093/bioinformatics/btad220.

ABSTRACT

MOTIVATION: Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. Oxford Nanopore sequencing is an efficient and low-cost platform, but it is currently unknown whether nanopore reads can be used for identifying non-B structures.

RESULTS: We build the first computational pipeline to predict non-B DNA structures from nanopore sequencing. We formalize non-B detection as a novelty detection problem and develop the GoFAE-DND, an autoencoder that uses goodness-of-fit (GoF) tests as a regularizer. A discriminative loss encourages non-B DNA to be poorly reconstructed and optimizing Gaussian GoF tests allows for the computation of P-values that indicate non-B structures. Based on whole genome nanopore sequencing of NA12878, we show that there exist significant differences between the timing of DNA translocation for non-B DNA bases compared with B-DNA. We demonstrate the efficacy of our approach through comparisons with novelty detection methods using experimental data and data synthesized from a new translocation time simulator. Experimental validations suggest that reliable detection of non-B DNA from nanopore sequencing is achievable.

AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.

PMID:37387144 | DOI:10.1093/bioinformatics/btad220

Categories
Nevin Manimala Statistics

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

Bioinformatics. 2023 Jun 30;39(Supplement_1):i260-i269. doi: 10.1093/bioinformatics/btad233.

ABSTRACT

MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount.

RESULTS: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.

AVAILABILITY AND IMPLEMENTATION: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

PMID:37387143 | DOI:10.1093/bioinformatics/btad233

Categories
Nevin Manimala Statistics

SpatialSort: a Bayesian model for clustering and cell population annotation of spatial proteomics data

Bioinformatics. 2023 Jun 30;39(Supplement_1):i131-i139. doi: 10.1093/bioinformatics/btad242.

ABSTRACT

MOTIVATION: Recent advances in spatial proteomics technologies have enabled the profiling of dozens of proteins in thousands of single cells in situ. This has created the opportunity to move beyond quantifying the composition of cell types in tissue, and instead probe the spatial relationships between cells. However, most current methods for clustering data from these assays only consider the expression values of cells and ignore the spatial context. Furthermore, existing approaches do not account for prior information about the expected cell populations in a sample.

RESULTS: To address these shortcomings, we developed SpatialSort, a spatially aware Bayesian clustering approach that allows for the incorporation of prior biological knowledge. Our method is able to account for the affinities of cells of different types to neighbour in space, and by incorporating prior information about expected cell populations, it is able to simultaneously improve clustering accuracy and perform automated annotation of clusters. Using synthetic and real data, we show that by using spatial and prior information SpatialSort improves clustering accuracy. We also demonstrate how SpatialSort can perform label transfer between spatial and nonspatial modalities through the analysis of a real world diffuse large B-cell lymphoma dataset.

AVAILABILITY AND IMPLEMENTATION: Source code is available on Github at: https://github.com/Roth-Lab/SpatialSort.

PMID:37387130 | DOI:10.1093/bioinformatics/btad242