Categories
Nevin Manimala Statistics

Statistical approaches for differential expression analysis in metatranscriptomics

Bioinformatics. 2021 Jul 12;37(Supplement_1):i34-i41. doi: 10.1093/bioinformatics/btab327.

ABSTRACT

MOTIVATION: Metatranscriptomics (MTX) has become an increasingly practical way to profile the functional activity of microbial communities in situ. However, MTX remains underutilized due to experimental and computational limitations. The latter are complicated by non-independent changes in both RNA transcript levels and their underlying genomic DNA copies (as microbes simultaneously change their overall abundance in the population and regulate individual transcripts), genetic plasticity (as whole loci are frequently gained and lost in microbial lineages) and measurement compositionality and zero-inflation. Here, we present a systematic evaluation of and recommendations for differential expression (DE) analysis in MTX.

RESULTS: We designed and assessed six statistical models for DE discovery in MTX that incorporate different combinations of DNA and RNA normalization and assumptions about the underlying changes of gene copies or species abundance within communities. We evaluated these models on multiple simulated and real multi-omic datasets. Models adjusting transcripts relative to their encoding gene copies as a covariate were significantly more accurate in identifying DE from MTX in both simulated and real datasets. Moreover, we show that when paired DNA measurements (metagenomic data) are not available, models normalizing MTX measurements within-species while also adjusting for total-species RNA balance sensitivity, specificity and interpretability of DE detection, as does filtering likely technical zeros. The efficiency and accuracy of these models pave the way for more effective MTX-based DE discovery in microbial communities.

AVAILABILITY AND IMPLEMENTATION: The analysis code and synthetic datasets used in this evaluation are available online at http://huttenhower.sph.harvard.edu/mtx2021.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252963 | DOI:10.1093/bioinformatics/btab327

Categories
Nevin Manimala Statistics

DECODE: a Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays

Bioinformatics. 2021 Jul 12;37(Supplement_1):i280-i288. doi: 10.1093/bioinformatics/btab283.

ABSTRACT

MOTIVATION: Mapping distal regulatory elements, such as enhancers, is a cornerstone for elucidating how genetic variations may influence diseases. Previous enhancer-prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have implemented enhancer discovery as a binary classification problem without accurate boundary detection, producing low-resolution annotations with superfluous regions and reducing the statistical power for downstream analyses (e.g. causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays (DECODE). First, we employed direct enhancer-activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution, we implemented a weakly supervised object detection framework for enhancer localization with precise boundary detection (to a 10 bp resolution) using Gradient-weighted Class Activation Mapping.

RESULTS: Our DECODE binary classifier outperformed a state-of-the-art enhancer prediction method by 24% in transgenic mouse validation. Furthermore, the object detection framework can condense enhancer annotations to only 13% of their original size, and these compact annotations have significantly higher conservation scores and genome-wide association study variant enrichments than the original predictions. Overall, DECODE is an effective tool for enhancer classification and precise localization.

AVAILABILITY AND IMPLEMENTATION: DECODE source code and pre-processing scripts are available at decode.gersteinlab.org.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252960 | DOI:10.1093/bioinformatics/btab283

Categories
Nevin Manimala Statistics

Disease gene prediction with privileged information and heteroscedastic dropout

Bioinformatics. 2021 Jul 12;37(Supplement_1):i410-i417. doi: 10.1093/bioinformatics/btab310.

ABSTRACT

MOTIVATION: Recently, machine learning models have achieved tremendous success in prioritizing candidate genes for genetic diseases. These models are able to accurately quantify the similarity among disease and genes based on the intuition that similar genes are more likely to be associated with similar diseases. However, the genetic features these methods rely on are often hard to collect due to high experimental cost and various other technical limitations. Existing solutions of this problem significantly increase the risk of overfitting and decrease the generalizability of the models.

RESULTS: In this work, we propose a graph neural network (GNN) version of the Learning under Privileged Information paradigm to predict new disease gene associations. Unlike previous gene prioritization approaches, our model does not require the genetic features to be the same at training and test stages. If a genetic feature is hard to measure and therefore missing at the test stage, our model could still efficiently incorporate its information during the training process. To implement this, we develop a Heteroscedastic Gaussian Dropout algorithm, where the dropout probability of the GNN model is determined by another GNN model with a mirrored GNN architecture. To evaluate our method, we compared our method with four state-of-the-art methods on the Online Mendelian Inheritance in Man dataset to prioritize candidate disease genes. Extensive evaluations show that our model could improve the prediction accuracy when all the features are available compared to other methods. More importantly, our model could make very accurate predictions when >90% of the features are missing at the test stage.

AVAILABILITY AND IMPLEMENTATION: Our method is realized with Python 3.7 and Pytorch 1.5.0 and method and data are freely available at: https://github.com/juanshu30/Disease-Gene-Prioritization-with-Privileged-Information-and-Heteroscedastic-Dropout.

PMID:34252957 | DOI:10.1093/bioinformatics/btab310

Categories
Nevin Manimala Statistics

Modeling drug combination effects via latent tensor reconstruction

Bioinformatics. 2021 Jul 12;37(Supplement_1):i93-i101. doi: 10.1093/bioinformatics/btab308.

ABSTRACT

MOTIVATION: Combination therapies have emerged as a powerful treatment modality to overcome drug resistance and improve treatment efficacy. However, the number of possible drug combinations increases very rapidly with the number of individual drugs in consideration, which makes the comprehensive experimental screening infeasible in practice. Machine-learning models offer time- and cost-efficient means to aid this process by prioritizing the most effective drug combinations for further pre-clinical and clinical validation. However, the complexity of the underlying interaction patterns across multiple drug doses and in different cellular contexts poses challenges to the predictive modeling of drug combination effects.

RESULTS: We introduce comboLTR, highly time-efficient method for learning complex, non-linear target functions for describing the responses of therapeutic agent combinations in various doses and cancer cell-contexts. The method is based on a polynomial regression via powerful latent tensor reconstruction. It uses a combination of recommender system-style features indexing the data tensor of response values in different contexts, and chemical and multi-omics features as inputs. We demonstrate that comboLTR outperforms state-of-the-art methods in terms of predictive performance and running time, and produces highly accurate results even in the challenging and practical inference scenario where full dose-response matrices are predicted for completely new drug combinations with no available combination and monotherapy response measurements in any training cell line.

AVAILABILITY AND IMPLEMENTATION: comboLTR code is available at https://github.com/aalto-ics-kepaco/ComboLTR.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252952 | DOI:10.1093/bioinformatics/btab308

Categories
Nevin Manimala Statistics

Metaball skinning of synthetic astroglial morphologies into realistic mesh models for visual analytics and in silico simulations

Bioinformatics. 2021 Jul 12;37(Supplement_1):i426-i433. doi: 10.1093/bioinformatics/btab280.

ABSTRACT

MOTIVATION: Astrocytes, the most abundant glial cells in the mammalian brain, have an instrumental role in developing neuronal circuits. They contribute to the physical structuring of the brain, modulating synaptic activity and maintaining the blood-brain barrier in addition to other significant aspects that impact brain function. Biophysically, detailed astrocytic models are key to unraveling their functional mechanisms via molecular simulations at microscopic scales. Detailed, and complete, biological reconstructions of astrocytic cells are sparse. Nonetheless, data-driven digital reconstruction of astroglial morphologies that are statistically identical to biological counterparts are becoming available. We use those synthetic morphologies to generate astrocytic meshes with realistic geometries, making it possible to perform these simulations.

RESULTS: We present an unconditionally robust method capable of reconstructing high fidelity polygonal meshes of astroglial cells from algorithmically-synthesized morphologies. Our method uses implicit surfaces, or metaballs, to skin the different structural components of astrocytes and then blend them in a seamless fashion. We also provide an end-to-end pipeline to produce optimized two- and three-dimensional meshes for visual analytics and simulations, respectively. The performance of our pipeline has been assessed with a group of 5000 astroglial morphologies and the geometric metrics of the resulting meshes are evaluated. The usability of the meshes is then demonstrated with different use cases.

AVAILABILITY AND IMPLEMENTATION: Our metaball skinning algorithm is implemented in Blender 2.82 relying on its Python API (Application Programming Interface). To make it accessible to computational biologists and neuroscientists, the implementation has been integrated into NeuroMorphoVis, an open source and domain specific package that is primarily designed for neuronal morphology visualization and meshing.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252950 | DOI:10.1093/bioinformatics/btab280

Categories
Nevin Manimala Statistics

Build a better bootstrap and the RAWR shall beat a random path to your door: phylogenetic support estimation revisited

Bioinformatics. 2021 Jul 12;37(Supplement_1):i111-i119. doi: 10.1093/bioinformatics/btab263.

ABSTRACT

MOTIVATION: The standard bootstrap method is used throughout science and engineering to perform general-purpose non-parametric resampling and re-estimation. Among the most widely cited and widely used such applications is the phylogenetic bootstrap method, which Felsenstein proposed in 1985 as a means to place statistical confidence intervals on an estimated phylogeny (or estimate ‘phylogenetic support’). A key simplifying assumption of the bootstrap method is that input data are independent and identically distributed (i.i.d.). However, the i.i.d. assumption is an over-simplification for biomolecular sequence analysis, as Felsenstein noted.

RESULTS: In this study, we introduce a new sequence-aware non-parametric resampling technique, which we refer to as RAWR (‘RAndom Walk Resampling’). RAWR consists of random walks that synthesize and extend the standard bootstrap method and the ‘mirrored inputs’ idea of Landan and Graur. We apply RAWR to the task of phylogenetic support estimation. RAWR’s performance is compared to the state-of-the-art using synthetic and empirical data that span a range of dataset sizes and evolutionary divergence. We show that RAWR support estimates offer comparable or typically superior type I and type II error compared to phylogenetic bootstrap support. We also conduct a re-analysis of large-scale genomic sequence data from a recent study of Darwin’s finches. Our findings clarify phylogenetic uncertainty in a charismatic clade that serves as an important model for complex adaptive evolution.

AVAILABILITY AND IMPLEMENTATION: Data and software are publicly available under open-source software and open data licenses at: https://gitlab.msu.edu/liulab/RAWR-study-datasets-and-scripts.

PMID:34252944 | DOI:10.1093/bioinformatics/btab263

Categories
Nevin Manimala Statistics

CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data

Bioinformatics. 2021 Jul 12;37(Supplement_1):i51-i58. doi: 10.1093/bioinformatics/btab286.

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) technology has been widely applied to capture the heterogeneity of different cell types within complex tissues. An essential step in scRNA-seq data analysis is the annotation of cell types. Traditional cell-type annotation is mainly clustering the cells first, and then using the aggregated cluster-level expression profiles and the marker genes to label each cluster. Such methods are greatly dependent on the clustering results, which are insufficient for accurate annotation.

RESULTS: In this article, we propose a semi-supervised learning method for cell-type annotation called CALLR. It combines unsupervised learning represented by the graph Laplacian matrix constructed from all the cells and supervised learning using sparse logistic regression. By alternately updating the cell clusters and annotation labels, high annotation accuracy can be achieved. The model is formulated as an optimization problem, and a computationally efficient algorithm is developed to solve it. Experiments on 10 real datasets show that CALLR outperforms the compared (semi-)supervised learning methods, and the popular clustering methods.

AVAILABILITY AND IMPLEMENTATION: The implementation of CALLR is available at https://github.com/MathSZhang/CALLR.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252936 | DOI:10.1093/bioinformatics/btab286

Categories
Nevin Manimala Statistics

‘Single-subject studies’-derived analyses unveil altered biomechanisms between very small cohorts: implications for rare diseases

Bioinformatics. 2021 Jul 12;37(Supplement_1):i67-i75. doi: 10.1093/bioinformatics/btab290.

ABSTRACT

MOTIVATION: Identifying altered transcripts between very small human cohorts is particularly challenging and is compounded by the low accrual rate of human subjects in rare diseases or sub-stratified common disorders. Yet, single-subject studies (S3) can compare paired transcriptome samples drawn from the same patient under two conditions (e.g. treated versus pre-treatment) and suggest patient-specific responsive biomechanisms based on the overrepresentation of functionally defined gene sets. These improve statistical power by: (i) reducing the total features tested and (ii) relaxing the requirement of within-cohort uniformity at the transcript level. We propose Inter-N-of-1, a novel method, to identify meaningful differences between very small cohorts by using the effect size of ‘single-subject-study’-derived responsive biological mechanisms.

RESULTS: In each subject, Inter-N-of-1 requires applying previously published S3-type N-of-1-pathways MixEnrich to two paired samples (e.g. diseased versus unaffected tissues) for determining patient-specific enriched genes sets: Odds Ratios (S3-OR) and S3-variance using Gene Ontology Biological Processes. To evaluate small cohorts, we calculated the precision and recall of Inter-N-of-1 and that of a control method (GLM+EGS) when comparing two cohorts of decreasing sizes (from 20 versus 20 to 2 versus 2) in a comprehensive six-parameter simulation and in a proof-of-concept clinical dataset. In simulations, the Inter-N-of-1 median precision and recall are > 90% and >75% in cohorts of 3 versus 3 distinct subjects (regardless of the parameter values), whereas conventional methods outperform Inter-N-of-1 at sample sizes 9 versus 9 and larger. Similar results were obtained in the clinical proof-of-concept dataset.

AVAILABILITY AND IMPLEMENTATION: R software is available at Lussierlab.net/BSSD.

PMID:34252934 | DOI:10.1093/bioinformatics/btab290

Categories
Nevin Manimala Statistics

Predicting MHC-peptide binding affinity by differential boundary tree

Bioinformatics. 2021 Jul 12;37(Supplement_1):i254-i261. doi: 10.1093/bioinformatics/btab312.

ABSTRACT

MOTIVATION: The prediction of the binding between peptides and major histocompatibility complex (MHC) molecules plays an important role in neoantigen identification. Although a large number of computational methods have been developed to address this problem, they produce high false-positive rates in practical applications, since in most cases, a single residue mutation may largely alter the binding affinity of a peptide binding to MHC which cannot be identified by conventional deep learning methods.

RESULTS: We developed a differential boundary tree-based model, named DBTpred, to address this problem. We demonstrated that DBTpred can accurately predict MHC class I binding affinity compared to the state-of-art deep learning methods. We also presented a parallel training algorithm to accelerate the training and inference process which enables DBTpred to be applied to large datasets. By investigating the statistical properties of differential boundary trees and the prediction paths to test samples, we revealed that DBTpred can provide an intuitive interpretation and possible hints in detecting important residue mutations that can largely influence binding affinity.

AVAILABILITY AND IMPLEMENTATION: The DBTpred package is implemented in Python and freely available at: https://github.com/fpy94/DBT.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252932 | DOI:10.1093/bioinformatics/btab312

Categories
Nevin Manimala Statistics

scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

Bioinformatics. 2021 Jul 12;37(Supplement_1):i358-i366. doi: 10.1093/bioinformatics/btab273.

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.

RESULTS: Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data.

AVAILABILITY AND IMPLEMENTATION: The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252925 | DOI:10.1093/bioinformatics/btab273