Categories
Nevin Manimala Statistics

Application of WGCNA and PloGO2 in the Analysis of Complex Proteomic Data

Methods Mol Biol. 2023;2426:375-390. doi: 10.1007/978-1-0716-1967-4_17.

ABSTRACT

In this protocol we describe our workflow for analyzing complex, multi-condition quantitative proteomic experiments, with the aim to extract biological insights. The tool we use is an R package, PloGO2, contributed to Bioconductor, which we can optionally precede by running correlation network analysis with WGCNA. We describe the data required and the steps we take, including detailed code examples and outputs explanation. The package was designed to generate gene ontology or pathway summaries for many data subsets at the same time, visualize protein abundance summaries for each biological category examined, help determine enriched protein subsets by comparing them all to a reference set, and suggest key highly correlated hub proteins, if the optional network analysis is employed.

PMID:36308698 | DOI:10.1007/978-1-0716-1967-4_17

Categories
Nevin Manimala Statistics

Integrating Multiple Quantitative Proteomic Analyses Using MetaMSD

Methods Mol Biol. 2023;2426:361-374. doi: 10.1007/978-1-0716-1967-4_16.

ABSTRACT

MetaMSD is a proteomic software that integrates multiple quantitative mass spectrometry data analysis results using statistical summary combination approaches. By utilizing this software, scientists can combine results from their pilot and main studies to maximize their biomarker discovery while effectively controlling false discovery rates. It also works for combining proteomic datasets generated by different labeling techniques and/or different types of mass spectrometry instruments. With these advantages, MetaMSD enables biological researchers to explore various proteomic datasets in public repositories to discover new biomarkers and generate interesting hypotheses for future studies. In this protocol, we provide a step-by-step procedure on how to install and perform a meta-analysis for quantitative proteomics using MetaMSD.

PMID:36308697 | DOI:10.1007/978-1-0716-1967-4_16

Categories
Nevin Manimala Statistics

Multivariate Analysis with the R Package mixOmics

Methods Mol Biol. 2023;2426:333-359. doi: 10.1007/978-1-0716-1967-4_15.

ABSTRACT

The high-dimensional nature of proteomics data presents challenges for statistical analysis and biological interpretation. Multivariate analysis, combined with insightful visualization can help to reveal the underlying patterns in complex biological data. This chapter introduces the R package mixOmics which focuses on data exploration and integration. We first introduce methods for single data sets: both Principal Component Analysis, which can identify the patterns of variance present in data, and sparse Partial Least Squares Discriminant Analysis, which aims to identify variables that can classify samples into known groups. We then present integrative methods with Projection to Latent Structures and further extensions for discriminant analysis. We illustrate each technique on a breast cancer multi-omics study and provide the R code and data as online supplementary material for readers interested in reproducing these analyses.

PMID:36308696 | DOI:10.1007/978-1-0716-1967-4_15

Categories
Nevin Manimala Statistics

Statistical Analysis of Post-Translational Modifications Quantified by Label-Free Proteomics Across Multiple Biological Conditions with R: Illustration from SARS-CoV-2 Infected Cells

Methods Mol Biol. 2023;2426:267-302. doi: 10.1007/978-1-0716-1967-4_12.

ABSTRACT

Protein post-translational modifications (PTMs) are essential elements of cellular communication. Their variations in abundance can affect cellular pathways, leading to cellular disorders and diseases. A widely used method for revealing PTM-mediated regulatory networks is their label-free quantitation (LFQ) by high-resolution mass spectrometry. The raw data resulting from such experiments are generally interpreted using specific software, such as MaxQuant, MassChroQ, or Proline for instance. They provide data matrices containing quantified intensities for each modified peptide identified. Statistical analyses are then necessary (1) to ensure that the quantified data are of good enough quality and sufficiently reproducible, (2) to highlight the modified peptides that are differentially abundant between the biological conditions under study. The objective of this chapter is therefore to provide a complete data analysis pipeline for analyzing the quantified values of modified peptides in presence of two or more biological conditions using the R software. We illustrate our pipeline starting from MaxQuant outputs dealing with the analysis of A549-ACE2 cells infected by SARS-CoV-2 at different time stamps, freely available on PRIDE (PXD020019).

PMID:36308693 | DOI:10.1007/978-1-0716-1967-4_12

Categories
Nevin Manimala Statistics

Exploring Protein Interactome Data with IPinquiry: Statistical Analysis and Data Visualization by Spectral Counts

Methods Mol Biol. 2023;2426:243-265. doi: 10.1007/978-1-0716-1967-4_11.

ABSTRACT

Immunoprecipitation mass spectrometry (IP-MS) is a popular method for the identification of protein-protein interactions. This approach is particularly powerful when information is collected without a priori knowledge and has been successively used as a first key step for the elucidation of many complex protein networks. IP-MS consists in the affinity purification of a protein of interest and of its interacting proteins followed by protein identification and quantification by mass spectrometry analysis. We developed an R package, named IPinquiry, dedicated to IP-MS analysis and based on the spectral count quantification method. The main purpose of this package is to provide a simple R pipeline with a limited number of processing steps to facilitate data exploration for biologists. This package allows to perform differential analysis of protein accumulation between two groups of IP experiments, to retrieve protein annotations, to export results, and to create different types of graphics. Here we describe the step-by-step procedure for an interactome analysis using IPinquiry from data loading to result export and plot production.

PMID:36308692 | DOI:10.1007/978-1-0716-1967-4_11

Categories
Nevin Manimala Statistics

msmsEDA & msmsTests: Label-Free Differential Expression by Spectral Counts

Methods Mol Biol. 2023;2426:197-242. doi: 10.1007/978-1-0716-1967-4_10.

ABSTRACT

msmsTests is an R/Bioconductor package providing functions for statistical tests in label-free LC-MS/MS data by spectral counts. These functions aim at discovering differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. The three models admit blocking factors to control for nuisance variables. To assure a good level of reproducibility a post-test filter is available, where (1) a minimum effect size considered biologically relevant, and (2) a minimum expression of the most abundant condition, may be set. A companion package, msmsEDA, proposes functions to explore datasets based on msms spectral counts. The provided graphics help in identifying outliers, the presence of eventual batch factors, and check the effects of different normalizing strategies. This protocol illustrates the use of both packages on two examples: A purely spike-in experiment of 48 human proteins in a standard yeast cell lysate; and a cancer cell-line secretome dataset requiring a biological normalization.

PMID:36308691 | DOI:10.1007/978-1-0716-1967-4_10

Categories
Nevin Manimala Statistics

Statistical Analysis of Quantitative Peptidomics and Peptide-Level Proteomics Data with Prostar

Methods Mol Biol. 2023;2426:163-196. doi: 10.1007/978-1-0716-1967-4_9.

ABSTRACT

Prostar is a software tool dedicated to the processing of quantitative data resulting from mass spectrometry-based label-free proteomics. Practically, once biological samples have been analyzed by bottom-up proteomics, the raw mass spectrometer outputs are processed by bioinformatics tools, so as to identify peptides and quantify them, notably by means of precursor ion chromatogram integration. From that point, the classical workflows aggregate these pieces of peptide-level information to infer protein-level identities and amounts. Finally, protein abundances can be statistically analyzed to find out proteins that are significantly differentially abundant between compared conditions. Prostar original workflow has been developed based on this strategy. However, recent works have demonstrated that processing peptide-level information is often more accurate when searching for differentially abundant proteins, as the aggregation step tends to hide some of the data variabilities and biases. As a result, Prostar has been extended by workflows that manage peptide-level data, and this protocol details their use. The first one, deemed “peptidomics,” implies that the differential analysis is conducted at peptide level, independently of the peptide-to-protein relationship. The second workflow proposes to aggregate the peptide abundances after their preprocessing (i.e., after filtering, normalization, and imputation), so as to minimize the amount of protein-level preprocessing prior to differential analysis.

PMID:36308690 | DOI:10.1007/978-1-0716-1967-4_9

Categories
Nevin Manimala Statistics

Towards a More Accurate Differential Analysis of Multiple Imputed Proteomics Data with mi4limma

Methods Mol Biol. 2023;2426:131-140. doi: 10.1007/978-1-0716-1967-4_7.

ABSTRACT

Imputing missing values is a common practice in label-free quantitative proteomics. Imputation replaces a missing value by a user-defined one. However, the imputation itself is not optimally considered downstream of the imputation process. In particular, imputed datasets are considered as if they had always been complete. The uncertainty due to the imputation is not properly taken into account. Hence, the mi4p package provides a more accurate statistical analysis of multiple-imputed datasets. A rigorous multiple imputation methodology is implemented, leading to a less biased estimation of parameters and their variability, thanks to Rubin’s rules. The imputation-based peptide’s intensities’ variance estimator is then moderated using Bayesian hierarchical models. This estimator is finally included in moderated t-test statistics to provide differential analyses results.

PMID:36308688 | DOI:10.1007/978-1-0716-1967-4_7

Categories
Nevin Manimala Statistics

Left-Censored Missing Value Imputation Approach for MS-Based Proteomics Data with GSimp

Methods Mol Biol. 2023;2426:119-129. doi: 10.1007/978-1-0716-1967-4_6.

ABSTRACT

Missing values caused by the limit of detection or quantification (LOD/LOQ) were widely observed in mass spectrometry (MS)-based omics studies and could be recognized as missing not at random (MNAR). MNAR leads to biased statistical estimations and jeopardizes downstream analyses. Although a wide range of missing value imputation methods was developed for omics studies, a limited number of methods were designed appropriately for the situation of MNAR. To facilitate MS-based omics studies, we introduce GSimp, a Gibbs sampler-based missing value imputation approach, to deal with left-censor missing values in MS-proteomics datasets. In this book, we explain the MNAR and elucidate the usage of GSimp for MNAR in detail.

PMID:36308687 | DOI:10.1007/978-1-0716-1967-4_6

Categories
Nevin Manimala Statistics

Integrating Identification and Quantification Uncertainty for Differential Protein Abundance Analysis with Triqler

Methods Mol Biol. 2023;2426:91-117. doi: 10.1007/978-1-0716-1967-4_5.

ABSTRACT

Protein quantification for shotgun proteomics is a complicated process where errors can be introduced in each of the steps. Triqler is a Python package that estimates and integrates errors of the different parts of the label-free protein quantification pipeline into a single Bayesian model. Specifically, it weighs the quantitative values by the confidence we have in the correctness of the corresponding PSM. Furthermore, it treats missing values in a way that reflects their uncertainty relative to observed values. Finally, it combines these error estimates in a single differential abundance FDR that not only reflects the errors and uncertainties in quantification but also in identification. In this tutorial, we show how to (1) generate input data for Triqler from quantification packages such as MaxQuant and Quandenser, (2) run Triqler and what the different options are, (3) interpret the results, (4) investigate the posterior distributions of a protein of interest in detail, and (5) verify that the hyperparameter estimations are sensible.

PMID:36308686 | DOI:10.1007/978-1-0716-1967-4_5