Categories
Nevin Manimala Statistics

Investigation of REFINED CNN ensemble learning for anti-cancer drug sensitivity prediction

Bioinformatics. 2021 Jul 12;37(Supplement_1):i42-i50. doi: 10.1093/bioinformatics/btab336.

ABSTRACT

MOTIVATION: Anti-cancer drug sensitivity prediction using deep learning models for individual cell line is a significant challenge in personalized medicine. Recently developed REFINED (REpresentation of Features as Images with NEighborhood Dependencies) CNN (Convolutional Neural Network)-based models have shown promising results in improving drug sensitivity prediction. The primary idea behind REFINED-CNN is representing high dimensional vectors as compact images with spatial correlations that can benefit from CNN architectures. However, the mapping from a high dimensional vector to a compact 2D image depends on the a priori choice of the distance metric and projection scheme with limited empirical procedures guiding these choices.

RESULTS: In this article, we consider an ensemble of REFINED-CNN built under different choices of distance metrics and/or projection schemes that can improve upon a single projection based REFINED-CNN model. Results, illustrated using NCI60 and NCI-ALMANAC databases, demonstrate that the ensemble approaches can provide significant improvement in prediction performance as compared to individual models. We also develop the theoretical framework for combining different distance metrics to arrive at a single 2D mapping. Results demonstrated that distance-averaged REFINED-CNN produced comparable performance as obtained from stacking REFINED-CNN ensemble but with significantly lower computational cost.

AVAILABILITY AND IMPLEMENTATION: The source code, scripts, and data used in the paper have been deposited in GitHub (https://github.com/omidbazgirTTU/IntegratedREFINED).

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252971 | DOI:10.1093/bioinformatics/btab336

Categories
Nevin Manimala Statistics

Haplotype-based membership inference from summary genomic data

Bioinformatics. 2021 Jul 12;37(Supplement_1):i161-i168. doi: 10.1093/bioinformatics/btab305.

ABSTRACT

MOTIVATION: The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g. the allele frequencies, or even the presence/absence of genetic variants (e.g. shared by the Beacon project) in the group. These methods rely on the availability of the target’s genome, i.e. the DNA profile of a target human subject, and thus are often referred to as the membership inference method.

RESULTS: In this article, we demonstrate the haplotypes, i.e. the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a target’s genome. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs.

AVAILABILITY AND IMPLEMENTATION: The implementation of the membership inference algorithms is available at https://github.com/diybu/Haplotype-based-membership-inferences.

PMID:34252973 | DOI:10.1093/bioinformatics/btab305

Categories
Nevin Manimala Statistics

PathCNN: interpretable convolutional neural networks for survival prediction and pathway analysis applied to glioblastoma

Bioinformatics. 2021 Jul 12;37(Supplement_1):i443-i450. doi: 10.1093/bioinformatics/btab285.

ABSTRACT

MOTIVATION: Convolutional neural networks (CNNs) have achieved great success in the areas of image processing and computer vision, handling grid-structured inputs and efficiently capturing local dependencies through multiple levels of abstraction. However, a lack of interpretability remains a key barrier to the adoption of deep neural networks, particularly in predictive modeling of disease outcomes. Moreover, because biological array data are generally represented in a non-grid structured format, CNNs cannot be applied directly.

RESULTS: To address these issues, we propose a novel method, called PathCNN, that constructs an interpretable CNN model on integrated multi-omics data using a newly defined pathway image. PathCNN showed promising predictive performance in differentiating between long-term survival (LTS) and non-LTS when applied to glioblastoma multiforme (GBM). The adoption of a visualization tool coupled with statistical analysis enabled the identification of plausible pathways associated with survival in GBM. In summary, PathCNN demonstrates that CNNs can be effectively applied to multi-omics data in an interpretable manner, resulting in promising predictive power while identifying key biological correlates of disease.

AVAILABILITY AND IMPLEMENTATION: The source code is freely available at: https://github.com/mskspi/PathCNN.

PMID:34252964 | DOI:10.1093/bioinformatics/btab285

Categories
Nevin Manimala Statistics

EnHiC: learning fine-resolution Hi-C contact maps using a generative adversarial framework

Bioinformatics. 2021 Jul 12;37(Supplement_1):i272-i279. doi: 10.1093/bioinformatics/btab272.

ABSTRACT

MOTIVATION: The high-throughput chromosome conformation capture (Hi-C) technique has enabled genome-wide mapping of chromatin interactions. However, high-resolution Hi-C data requires costly, deep sequencing; therefore, it has only been achieved for a limited number of cell types. Machine learning models based on neural networks have been developed as a remedy to this problem.

RESULTS: In this work, we propose a novel method, EnHiC, for predicting high-resolution Hi-C matrices from low-resolution input data based on a generative adversarial network (GAN) framework. Inspired by non-negative matrix factorization, our model fully exploits the unique properties of Hi-C matrices and extracts rank-1 features from multi-scale low-resolution matrices to enhance the resolution. Using three human Hi-C datasets, we demonstrated that EnHiC accurately and reliably enhanced the resolution of Hi-C matrices and outperformed other GAN-based models. Moreover, EnHiC-predicted high-resolution matrices facilitated the accurate detection of topologically associated domains and fine-scale chromatin interactions.

AVAILABILITY AND IMPLEMENTATION: EnHiC is publicly available at https://github.com/wmalab/EnHiC.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252966 | DOI:10.1093/bioinformatics/btab272

Categories
Nevin Manimala Statistics

DECODE: a Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays

Bioinformatics. 2021 Jul 12;37(Supplement_1):i280-i288. doi: 10.1093/bioinformatics/btab283.

ABSTRACT

MOTIVATION: Mapping distal regulatory elements, such as enhancers, is a cornerstone for elucidating how genetic variations may influence diseases. Previous enhancer-prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have implemented enhancer discovery as a binary classification problem without accurate boundary detection, producing low-resolution annotations with superfluous regions and reducing the statistical power for downstream analyses (e.g. causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called Deep-learning framework for Condensing enhancers and refining boundaries with large-scale functional assays (DECODE). First, we employed direct enhancer-activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution, we implemented a weakly supervised object detection framework for enhancer localization with precise boundary detection (to a 10 bp resolution) using Gradient-weighted Class Activation Mapping.

RESULTS: Our DECODE binary classifier outperformed a state-of-the-art enhancer prediction method by 24% in transgenic mouse validation. Furthermore, the object detection framework can condense enhancer annotations to only 13% of their original size, and these compact annotations have significantly higher conservation scores and genome-wide association study variant enrichments than the original predictions. Overall, DECODE is an effective tool for enhancer classification and precise localization.

AVAILABILITY AND IMPLEMENTATION: DECODE source code and pre-processing scripts are available at decode.gersteinlab.org.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252960 | DOI:10.1093/bioinformatics/btab283

Categories
Nevin Manimala Statistics

Statistical approaches for differential expression analysis in metatranscriptomics

Bioinformatics. 2021 Jul 12;37(Supplement_1):i34-i41. doi: 10.1093/bioinformatics/btab327.

ABSTRACT

MOTIVATION: Metatranscriptomics (MTX) has become an increasingly practical way to profile the functional activity of microbial communities in situ. However, MTX remains underutilized due to experimental and computational limitations. The latter are complicated by non-independent changes in both RNA transcript levels and their underlying genomic DNA copies (as microbes simultaneously change their overall abundance in the population and regulate individual transcripts), genetic plasticity (as whole loci are frequently gained and lost in microbial lineages) and measurement compositionality and zero-inflation. Here, we present a systematic evaluation of and recommendations for differential expression (DE) analysis in MTX.

RESULTS: We designed and assessed six statistical models for DE discovery in MTX that incorporate different combinations of DNA and RNA normalization and assumptions about the underlying changes of gene copies or species abundance within communities. We evaluated these models on multiple simulated and real multi-omic datasets. Models adjusting transcripts relative to their encoding gene copies as a covariate were significantly more accurate in identifying DE from MTX in both simulated and real datasets. Moreover, we show that when paired DNA measurements (metagenomic data) are not available, models normalizing MTX measurements within-species while also adjusting for total-species RNA balance sensitivity, specificity and interpretability of DE detection, as does filtering likely technical zeros. The efficiency and accuracy of these models pave the way for more effective MTX-based DE discovery in microbial communities.

AVAILABILITY AND IMPLEMENTATION: The analysis code and synthetic datasets used in this evaluation are available online at http://huttenhower.sph.harvard.edu/mtx2021.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252963 | DOI:10.1093/bioinformatics/btab327

Categories
Nevin Manimala Statistics

Metaball skinning of synthetic astroglial morphologies into realistic mesh models for visual analytics and in silico simulations

Bioinformatics. 2021 Jul 12;37(Supplement_1):i426-i433. doi: 10.1093/bioinformatics/btab280.

ABSTRACT

MOTIVATION: Astrocytes, the most abundant glial cells in the mammalian brain, have an instrumental role in developing neuronal circuits. They contribute to the physical structuring of the brain, modulating synaptic activity and maintaining the blood-brain barrier in addition to other significant aspects that impact brain function. Biophysically, detailed astrocytic models are key to unraveling their functional mechanisms via molecular simulations at microscopic scales. Detailed, and complete, biological reconstructions of astrocytic cells are sparse. Nonetheless, data-driven digital reconstruction of astroglial morphologies that are statistically identical to biological counterparts are becoming available. We use those synthetic morphologies to generate astrocytic meshes with realistic geometries, making it possible to perform these simulations.

RESULTS: We present an unconditionally robust method capable of reconstructing high fidelity polygonal meshes of astroglial cells from algorithmically-synthesized morphologies. Our method uses implicit surfaces, or metaballs, to skin the different structural components of astrocytes and then blend them in a seamless fashion. We also provide an end-to-end pipeline to produce optimized two- and three-dimensional meshes for visual analytics and simulations, respectively. The performance of our pipeline has been assessed with a group of 5000 astroglial morphologies and the geometric metrics of the resulting meshes are evaluated. The usability of the meshes is then demonstrated with different use cases.

AVAILABILITY AND IMPLEMENTATION: Our metaball skinning algorithm is implemented in Blender 2.82 relying on its Python API (Application Programming Interface). To make it accessible to computational biologists and neuroscientists, the implementation has been integrated into NeuroMorphoVis, an open source and domain specific package that is primarily designed for neuronal morphology visualization and meshing.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252950 | DOI:10.1093/bioinformatics/btab280

Categories
Nevin Manimala Statistics

Modeling drug combination effects via latent tensor reconstruction

Bioinformatics. 2021 Jul 12;37(Supplement_1):i93-i101. doi: 10.1093/bioinformatics/btab308.

ABSTRACT

MOTIVATION: Combination therapies have emerged as a powerful treatment modality to overcome drug resistance and improve treatment efficacy. However, the number of possible drug combinations increases very rapidly with the number of individual drugs in consideration, which makes the comprehensive experimental screening infeasible in practice. Machine-learning models offer time- and cost-efficient means to aid this process by prioritizing the most effective drug combinations for further pre-clinical and clinical validation. However, the complexity of the underlying interaction patterns across multiple drug doses and in different cellular contexts poses challenges to the predictive modeling of drug combination effects.

RESULTS: We introduce comboLTR, highly time-efficient method for learning complex, non-linear target functions for describing the responses of therapeutic agent combinations in various doses and cancer cell-contexts. The method is based on a polynomial regression via powerful latent tensor reconstruction. It uses a combination of recommender system-style features indexing the data tensor of response values in different contexts, and chemical and multi-omics features as inputs. We demonstrate that comboLTR outperforms state-of-the-art methods in terms of predictive performance and running time, and produces highly accurate results even in the challenging and practical inference scenario where full dose-response matrices are predicted for completely new drug combinations with no available combination and monotherapy response measurements in any training cell line.

AVAILABILITY AND IMPLEMENTATION: comboLTR code is available at https://github.com/aalto-ics-kepaco/ComboLTR.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252952 | DOI:10.1093/bioinformatics/btab308

Categories
Nevin Manimala Statistics

Disease gene prediction with privileged information and heteroscedastic dropout

Bioinformatics. 2021 Jul 12;37(Supplement_1):i410-i417. doi: 10.1093/bioinformatics/btab310.

ABSTRACT

MOTIVATION: Recently, machine learning models have achieved tremendous success in prioritizing candidate genes for genetic diseases. These models are able to accurately quantify the similarity among disease and genes based on the intuition that similar genes are more likely to be associated with similar diseases. However, the genetic features these methods rely on are often hard to collect due to high experimental cost and various other technical limitations. Existing solutions of this problem significantly increase the risk of overfitting and decrease the generalizability of the models.

RESULTS: In this work, we propose a graph neural network (GNN) version of the Learning under Privileged Information paradigm to predict new disease gene associations. Unlike previous gene prioritization approaches, our model does not require the genetic features to be the same at training and test stages. If a genetic feature is hard to measure and therefore missing at the test stage, our model could still efficiently incorporate its information during the training process. To implement this, we develop a Heteroscedastic Gaussian Dropout algorithm, where the dropout probability of the GNN model is determined by another GNN model with a mirrored GNN architecture. To evaluate our method, we compared our method with four state-of-the-art methods on the Online Mendelian Inheritance in Man dataset to prioritize candidate disease genes. Extensive evaluations show that our model could improve the prediction accuracy when all the features are available compared to other methods. More importantly, our model could make very accurate predictions when >90% of the features are missing at the test stage.

AVAILABILITY AND IMPLEMENTATION: Our method is realized with Python 3.7 and Pytorch 1.5.0 and method and data are freely available at: https://github.com/juanshu30/Disease-Gene-Prioritization-with-Privileged-Information-and-Heteroscedastic-Dropout.

PMID:34252957 | DOI:10.1093/bioinformatics/btab310

Categories
Nevin Manimala Statistics

CALLR: a semi-supervised cell-type annotation method for single-cell RNA sequencing data

Bioinformatics. 2021 Jul 12;37(Supplement_1):i51-i58. doi: 10.1093/bioinformatics/btab286.

ABSTRACT

MOTIVATION: Single-cell RNA sequencing (scRNA-seq) technology has been widely applied to capture the heterogeneity of different cell types within complex tissues. An essential step in scRNA-seq data analysis is the annotation of cell types. Traditional cell-type annotation is mainly clustering the cells first, and then using the aggregated cluster-level expression profiles and the marker genes to label each cluster. Such methods are greatly dependent on the clustering results, which are insufficient for accurate annotation.

RESULTS: In this article, we propose a semi-supervised learning method for cell-type annotation called CALLR. It combines unsupervised learning represented by the graph Laplacian matrix constructed from all the cells and supervised learning using sparse logistic regression. By alternately updating the cell clusters and annotation labels, high annotation accuracy can be achieved. The model is formulated as an optimization problem, and a computationally efficient algorithm is developed to solve it. Experiments on 10 real datasets show that CALLR outperforms the compared (semi-)supervised learning methods, and the popular clustering methods.

AVAILABILITY AND IMPLEMENTATION: The implementation of CALLR is available at https://github.com/MathSZhang/CALLR.

SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

PMID:34252936 | DOI:10.1093/bioinformatics/btab286