Categories
Nevin Manimala Statistics

Phylogenetic diversity statistics for all clades in a phylogeny

Bioinformatics. 2023 Jun 30;39(Supplement_1):i177-i184. doi: 10.1093/bioinformatics/btad263.

ABSTRACT

The classic quantitative measure of phylogenetic diversity (PD) has been used to address problems in conservation biology, microbial ecology, and evolutionary biology. PD is the minimum total length of the branches in a phylogeny required to cover a specified set of taxa on the phylogeny. A general goal in the application of PD has been identifying a set of taxa of size k that maximize PD on a given phylogeny; this has been mirrored in active research to develop efficient algorithms for the problem. Other descriptive statistics, such as the minimum PD, average PD, and standard deviation of PD, can provide invaluable insight into the distribution of PD across a phylogeny (relative to a fixed value of k). However, there has been limited or no research on computing these statistics, especially when required for each clade in a phylogeny, enabling direct comparisons of PD between clades. We introduce efficient algorithms for computing PD and the associated descriptive statistics for a given phylogeny and each of its clades. In simulation studies, we demonstrate the ability of our algorithms to analyze large-scale phylogenies with applications in ecology and evolutionary biology. The software is available at https://github.com/flu-crew/PD_stats.

PMID:37387175 | DOI:10.1093/bioinformatics/btad263

Categories
Nevin Manimala Statistics

Higher-order genetic interaction discovery with network-based biological priors

Bioinformatics. 2023 Jun 30;39(Supplement_1):i523-i533. doi: 10.1093/bioinformatics/btad273.

ABSTRACT

MOTIVATION: Complex phenotypes, such as many common diseases and morphological traits, are controlled by multiple genetic factors, namely genetic mutations and genes, and are influenced by environmental conditions. Deciphering the genetics underlying such traits requires a systemic approach, where many different genetic factors and their interactions are considered simultaneously. Many association mapping techniques available nowadays follow this reasoning, but have some severe limitations. In particular, they require binary encodings for the genetic markers, forcing the user to decide beforehand whether to use, e.g. a recessive or a dominant encoding. Moreover, most methods cannot include any biological prior or are limited to testing only lower-order interactions among genes for association with the phenotype, potentially missing a large number of marker combinations.

RESULTS: We propose HOGImine, a novel algorithm that expands the class of discoverable genetic meta-markers by considering higher-order interactions of genes and by allowing multiple encodings for the genetic variants. Our experimental evaluation shows that the algorithm has a substantially higher statistical power compared to previous methods, allowing it to discover genetic mutations statistically associated with the phenotype at hand that could not be found before. Our method can exploit prior biological knowledge on gene interactions, such as protein-protein interaction networks, genetic pathways, and protein complexes, to restrict its search space. Since computing higher-order gene interactions poses a high computational burden, we also develop a more efficient search strategy and support computation to make our approach applicable in practice, leading to substantial runtime improvements compared to state-of-the-art methods.

AVAILABILITY AND IMPLEMENTATION: Code and data are available at https://github.com/BorgwardtLab/HOGImine.

PMID:37387173 | DOI:10.1093/bioinformatics/btad273

Categories
Nevin Manimala Statistics

The impossible challenge of estimating non-existent moments of the Chemical Master Equation

Bioinformatics. 2023 Jun 30;39(Supplement_1):i440-i447. doi: 10.1093/bioinformatics/btad205.

ABSTRACT

MOTIVATION: The Chemical Master Equation (CME) is a set of linear differential equations that describes the evolution of the probability distribution on all possible configurations of a (bio-)chemical reaction system. Since the number of configurations and therefore the dimension of the CME rapidly increases with the number of molecules, its applicability is restricted to small systems. A widely applied remedy for this challenge is moment-based approaches which consider the evolution of the first few moments of the distribution as summary statistics for the complete distribution. Here, we investigate the performance of two moment-estimation methods for reaction systems whose equilibrium distributions encounter fat-tailedness and do not possess statistical moments.

RESULTS: We show that estimation via stochastic simulation algorithm (SSA) trajectories lose consistency over time and estimated moment values span a wide range of values even for large sample sizes. In comparison, the method of moments returns smooth moment estimates but is not able to indicate the non-existence of the allegedly predicted moments. We furthermore analyze the negative effect of a CME solution’s fat-tailedness on SSA run times and explain inherent difficulties. While moment-estimation techniques are a commonly applied tool in the simulation of (bio-)chemical reaction networks, we conclude that they should be used with care, as neither the system definition nor the moment-estimation techniques themselves reliably indicate the potential fat-tailedness of the CME’s solution.

PMID:37387158 | DOI:10.1093/bioinformatics/btad205

Categories
Nevin Manimala Statistics

Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function

Bioinformatics. 2023 Jun 30;39(Supplement_1):i318-i325. doi: 10.1093/bioinformatics/btad208.

ABSTRACT

MOTIVATION: Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.

RESULTS: We developed TransFun-a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.

AVAILABILITY AND IMPLEMENTATION: The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.

PMID:37387145 | DOI:10.1093/bioinformatics/btad208

Categories
Nevin Manimala Statistics

Deep statistical modelling of nanopore sequencing translocation times reveals latent non-B DNA structures

Bioinformatics. 2023 Jun 30;39(Supplement_1):i242-i251. doi: 10.1093/bioinformatics/btad220.

ABSTRACT

MOTIVATION: Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. Oxford Nanopore sequencing is an efficient and low-cost platform, but it is currently unknown whether nanopore reads can be used for identifying non-B structures.

RESULTS: We build the first computational pipeline to predict non-B DNA structures from nanopore sequencing. We formalize non-B detection as a novelty detection problem and develop the GoFAE-DND, an autoencoder that uses goodness-of-fit (GoF) tests as a regularizer. A discriminative loss encourages non-B DNA to be poorly reconstructed and optimizing Gaussian GoF tests allows for the computation of P-values that indicate non-B structures. Based on whole genome nanopore sequencing of NA12878, we show that there exist significant differences between the timing of DNA translocation for non-B DNA bases compared with B-DNA. We demonstrate the efficacy of our approach through comparisons with novelty detection methods using experimental data and data synthesized from a new translocation time simulator. Experimental validations suggest that reliable detection of non-B DNA from nanopore sequencing is achievable.

AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.

PMID:37387144 | DOI:10.1093/bioinformatics/btad220

Categories
Nevin Manimala Statistics

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes

Bioinformatics. 2023 Jun 30;39(Supplement_1):i260-i269. doi: 10.1093/bioinformatics/btad233.

ABSTRACT

MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount.

RESULTS: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.

AVAILABILITY AND IMPLEMENTATION: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

PMID:37387143 | DOI:10.1093/bioinformatics/btad233

Categories
Nevin Manimala Statistics

SpatialSort: a Bayesian model for clustering and cell population annotation of spatial proteomics data

Bioinformatics. 2023 Jun 30;39(Supplement_1):i131-i139. doi: 10.1093/bioinformatics/btad242.

ABSTRACT

MOTIVATION: Recent advances in spatial proteomics technologies have enabled the profiling of dozens of proteins in thousands of single cells in situ. This has created the opportunity to move beyond quantifying the composition of cell types in tissue, and instead probe the spatial relationships between cells. However, most current methods for clustering data from these assays only consider the expression values of cells and ignore the spatial context. Furthermore, existing approaches do not account for prior information about the expected cell populations in a sample.

RESULTS: To address these shortcomings, we developed SpatialSort, a spatially aware Bayesian clustering approach that allows for the incorporation of prior biological knowledge. Our method is able to account for the affinities of cells of different types to neighbour in space, and by incorporating prior information about expected cell populations, it is able to simultaneously improve clustering accuracy and perform automated annotation of clusters. Using synthetic and real data, we show that by using spatial and prior information SpatialSort improves clustering accuracy. We also demonstrate how SpatialSort can perform label transfer between spatial and nonspatial modalities through the analysis of a real world diffuse large B-cell lymphoma dataset.

AVAILABILITY AND IMPLEMENTATION: Source code is available on Github at: https://github.com/Roth-Lab/SpatialSort.

PMID:37387130 | DOI:10.1093/bioinformatics/btad242

Categories
Nevin Manimala Statistics

Genome-wide scans for selective sweeps using convolutional neural networks

Bioinformatics. 2023 Jun 30;39(Supplement_1):i194-i203. doi: 10.1093/bioinformatics/btad265.

ABSTRACT

MOTIVATION: Recent methods for selective sweep detection cast the problem as a classification task and use summary statistics as features to capture region characteristics that are indicative of a selective sweep, thereby being sensitive to confounding factors. Furthermore, they are not designed to perform whole-genome scans or to estimate the extent of the genomic region that was affected by positive selection; both are required for identifying candidate genes and the time and strength of selection.

RESULTS: We present ASDEC (https://github.com/pephco/ASDEC), a neural-network-based framework that can scan whole genomes for selective sweeps. ASDEC achieves similar classification performance to other convolutional neural network-based classifiers that rely on summary statistics, but it is trained 10× faster and classifies genomic regions 5× faster by inferring region characteristics from the raw sequence data directly. Deploying ASDEC for genomic scans achieved up to 15.2× higher sensitivity, 19.4× higher success rates, and 4× higher detection accuracy than state-of-the-art methods. We used ASDEC to scan human chromosome 1 of the Yoruba population (1000Genomes project), identifying nine known candidate genes.

PMID:37387128 | DOI:10.1093/bioinformatics/btad265

Categories
Nevin Manimala Statistics

Radiogenomics in NF2-Associated Schwannomatosis (Neurofibromatosis Type II): Exploratory Data Analysis

Stud Health Technol Inform. 2023 Jun 29;305:588-591. doi: 10.3233/SHTI230565.

ABSTRACT

Our pilot study aimed at exploratory radiogenomic data analysis in patients with NF2-associated schwannomatosis (formerly neurofibromatosis type II) to assume the potential of image biomarkers in this pathology. Fifty-three unrelated patients (37 (69.8%) women, avg. age 30.2 ± 11.2 y.o.) were enrolled in the study. First-order, gray-level co-occurrence matrix (GLCM), gray-level run length matrix (GLRLM), and geometry-based statistics were calculated (3718 features per region of interest). We demonstrated imaging patterns and statistically significant differences in radiomic features potentially related to the genotype and clinical phenotype of the disease. However, the clinical utility of these patterns should be further evaluated. The study was supported by the Russian Science Foundation grant 21-15-00262.

PMID:37387099 | DOI:10.3233/SHTI230565

Categories
Nevin Manimala Statistics

Predicting In-Hospital Mortality During the COVID-19 Pandemic in Patients with Heart Failure: A Single-Center Exploratory Study

Stud Health Technol Inform. 2023 Jun 29;305:487-490. doi: 10.3233/SHTI230539.

ABSTRACT

The aim of this study was to investigate whether exposure to the pandemic was associated with increased in-hospital mortality for health failure. We collected data from patients hospitalized between 2019 and 2020 and we assessed the likelihood of in-hospital death. Although the positive association of exposure to the COVID period with an increased in-hospital mortality is not statistically significant, this may underscore other factors that may influence mortality. Our study was designed to contribute to a better understanding of the impact of the pandemic on in-hospital mortality and to identify potential areas for intervention in patient care.

PMID:37387073 | DOI:10.3233/SHTI230539