BiocStyle::markdown()
knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE) options(width=110)
```{=html}
# __MSstats: Protein/Peptide significance analysis__ Package: MSstats Author: Anshuman Raina & Devon Kohler Date: 5th Semptember 2024 ## __Introduction__ `MSstats`, an R package in Bioconductor, supports protein differential analysis for statistical relative quantification of proteins and peptides in global, targeted and data-independent proteomics. It handles shotgun, label-free and label-based (universal synthetic peptide-based) SRM (selected reaction monitoring), and DIA (data independent acquisition) experiments. It can be used for experiments with complex designs (e.g. comparing more than two experimental conditions, or a repeated measure design, such as a time course). This vignette summarizes the introduction and various options of all functionalities in `MSstats`. More details are available in `User Manual`. For more information about the MSstats workflow, including a detailed description of the available processing options and their impact on the resulting differential analysis, please see the following publication: Kohler et al, Nature Protocols 19, 2915–2938 (2024). ## __Installation__ To install this package, start R (version “4.0”) and enter: ``` {r code Installation} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("MSstats") library(MSstats) library(ggplot2)
To begin with, we will load sample datasets, including both annotated and plain data. The dataset you need can be found here.
We will also load the Annotation Dataset using MSstatsConvert. You can access this dataset here.
``` {r code Load Dataset} library(MSstats)
pd_raw = system.file("tinytest/raw_data/PD/pd_input.csv", package = "MSstatsConvert")
annotation_raw = system.file("tinytest/raw_data/PD/annot_pd.csv", package = "MSstatsConvert")
pd = data.table::fread(pd_raw) annotation = data.table::fread(annotation_raw)
head(pd, 5) head(annotation, 5)
### __1.2 Loading PD Data to MSstats__ The imported data from Step 1.1. now must be converted through `MSstatsConvert` package's `PDtoMSstatsFormat` converter. This function converts the Proteome Discoverer output into the required input format for `MSstats`. Actual data modification can be seen below: ```r library(MSstatsConvert) pd_imported = MSstatsConvert::PDtoMSstatsFormat(pd, annotation, use_log_file = FALSE) head(pd_imported)
We have the following converters, which allow you to convert various types of
output reports which include the feature level data to the required input format
of MSstats
. Further information about the converters can be found in the
MSstatsConvert
package.
DIANNtoMSstatsFormat
DIAUmpiretoMSstatsFormat
FragPipetoMSstatsFormat
MaxQtoMSstatsFormat
OpenMStoMSstatsFormat
OpenSWATHtoMSstatsFormat
PDtoMSstatsFormat
ProgenesistoMSstatsFormat
SkylinetoMSstatsFormat
SpectronauttoMSstatsFormat
MetamorpheusToMSstatsFormat
We show an example of how to use the above said Converters. For more information about using the individual converters please see the coresponding documentation.
skyline_raw = system.file("tinytest/raw_data/Skyline/skyline_input.csv", package = "MSstatsConvert") skyline = data.table::fread(skyline_raw) head(skyline, 5)
msstats_format = MSstatsConvert::SkylinetoMSstatsFormat(skyline_raw, qvalue_cutoff = 0.01, useUniquePeptide = TRUE, removeFewMeasurements = TRUE, removeOxidationMpeptides = TRUE, removeProtein_with1Feature = TRUE)
head(msstats_format)
Once we import the dataset correctly with Converter, we need to pre-process the
data which is done by the dataProcess
function. This step involves data
processing and quality control of the measured feature intensities.
This function includes 5 main processing steps (with other additional small steps):
Log transformation - Transform the feature intensities from their original scale to the log scale. This step helps make the data closer to being normally distributed, requiring less replicates for the central limit theorem to kick in.
Normalization - There are three different normalization options supported. 'equalizeMedians' (default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals is performed. 'globalStandards' represents normalization with global standards proteins. FALSE represents no normalization is performed.
Feature selection - This also has three options i.e. Select All features, Top-N features (by mean intensity) or “Best” features.
Missing value imputation - We impute plausible values in case of missing data points. The RunLevelData can be queried to show Number of imputed intensities (censored intensities) in a RUN and Protein.
Summarization - After data processing the individual features are summarized up to the protein-level using Tukey's Median Polish. Linear summarization is also available as an option.
``` {r code dataProcess} summarized = dataProcess( pd_imported, logTrans = 2, normalization = "equalizeMedians", featureSubset = "all", n_top_feature = 3, summaryMethod = "TMP", equalFeatureVar = TRUE, censoredInt = "NA", MBimpute = TRUE )
head(summarized$FeatureLevelData)
head(summarized$ProteinLevelData)
head(summarized$SummaryMethod)
### __1.4.1 Data Process Plots__ After processing the input data, `MSstats` provides multiple plots to analyze the results. Here we show the various types of plots we can use. By default, a pdf file will be downloaded with corresponding feature level data and the Plot generated. Alternatively, the `address` parameter can be set to `FALSE` which will output the plots directly. ```r # Profile plot dataProcessPlots(data=summarized, type="ProfilePlot", address = FALSE, which.Protein = "P0ABU9") # Quality control plot dataProcessPlots(data=summarized, type="QCPlot", address = FALSE, which.Protein = "P0ABU9") # Quantification plot for conditions dataProcessPlots(data=summarized, type="ConditionPlot", address = FALSE, which.Protein = "P0ABU9")
In this step we test for differential changes in protein abundance across conditions using a linear mixed-effects model. The model will be automatically adjusted based on your experimental design.
A contrast matrix must be provided to the model. Alternatively, all pairwise
comparisons can be made by passing pairwise
to the function. For more
information on creating contrast matrices, please see the citation linked
at the beginning of this document.
``` {r code groupComparison}
model = groupComparison("pairwise", summarized)
Model Details ``` {r Model } head(model$ModelQC) head(model$ComparisonResult)
Visualization for model-based analysis and summarizing differentially abundant
proteins. To summarize the results of log-fold changes and adjusted p-values
for differentially abundant proteins, groupComparisonPlots
takes testing
results from function groupComparison
as input and automatically generate
three types of figures in pdf files as output :
Volcano plot : For each comparison separately. It illustrates actual
log-fold changes and adjusted p-values for each comparison separately with all
proteins. The x-axis is the log fold change. The base of logarithm
transformation is the same as specified in “logTrans” from dataProcess
. The
y-axis is the negative log2 or log10 adjusted p-values. The horizontal dashed
line represents the FDR cutoff. The points below the FDR cutoff line are
non-significantly abundant proteins (colored in black). The points above the
FDR cutoff line are significantly abundant proteins (colored in red/blue for
up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific
value), the points above the FDR cutoff line but within the FC cutoff line are
non-significantly abundant proteins (colored in black).
Heatmap : For multiple comparisons. It illustrates up-/down-regulated proteins for multiple comparisons with all proteins. Each column represents each comparison of interest. Each row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. Color gold represents proteins are not significantly different in abundance.
Comparison plot : For multiple comparisons per protein. It illustrates log-fold change and its variation of multiple comparisons for single protein. X-axis is comparison of interest. Y-axis is the log fold change. The red points are the estimated log fold change from the model. The error bars are the confidence interval with 0.95 significant level for log fold change. This interval is only based on the standard error, which is estimated from the model.
``` {r GroupComparisonPlots}
groupComparisonPlots( model$ComparisonResult, type="Heatmap", sig = 0.05, FCcutoff = FALSE, logBase.pvalue = 10, ylimUp = FALSE, ylimDown = FALSE, xlimUp = FALSE, x.axis.size = 10, y.axis.size = 10, dot.size = 3, text.size = 4, text.angle = 0, legend.size = 13, ProteinName = TRUE, colorkey = TRUE, numProtein = 100, clustering = "both", width = 800, height = 600, which.Comparison = "all", which.Protein = "all", address = FALSE, isPlotly = FALSE )
groupComparisonPlots( model$ComparisonResult, type="VolcanoPlot", sig = 0.05, FCcutoff = FALSE, logBase.pvalue = 10, ylimUp = FALSE, ylimDown = FALSE, xlimUp = FALSE, x.axis.size = 10, y.axis.size = 10, dot.size = 3, text.size = 4, text.angle = 0, legend.size = 13, ProteinName = TRUE, colorkey = TRUE, numProtein = 100, clustering = "both", width = 800, height = 600, which.Comparison = "Condition2 vs Condition4", which.Protein = "all", address = FALSE, isPlotly = FALSE )
### __1.6 GroupComparisonQCPlots__ To check and verify that the resultant data of `groupComparison` offers a linear model for whole plot inference, `groupComparisonQC` plots take the fitted data and provide two ways of plotting: 1. Normal Q-Q plot : Quantile-Quantile plots represents normal quantile-quantile plot for each protein after fitting models 2. Residual plot : represents a plot of residuals versus fitted values for each protein in the dataset. Results based on statistical models for whole plot level inference are accurate as long as the assumptions of the model are met. The model assumes that the measurement errors are normally distributed with mean 0 and constant variance. The assumption of a constant variance can be checked by examining the residuals from the model. ``` {r GroupComparisonQCplots, results='hide', message=FALSE, warning=FALSE} source("..//R//groupComparisonQCPlots.R") groupComparisonQCPlots(data=model, type="QQPlots", address=FALSE, which.Protein = "P0ABU9") groupComparisonQCPlots(data=model, type="ResidualPlots", address=FALSE, which.Protein = "P0ABU9")
Calculate sample size for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on intensity-based linear model. The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. Two options of the calculation:
sample_size_calc = designSampleSize(model$FittedModel, desiredFC=c(1.75,2.5), power = TRUE, numSample=5)
To illustrate the relationship of desired fold change and the calculated minimal number sample size which are
The input is the result from function designSampleSize
.
designSampleSizePlots(sample_size_calc, isPlotly=FALSE)
Model-based quantification for each condition or for each biological samples
per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent
Acquisition (DDA or shotgun), and Data-Independent Acquisition
(DIA or SWATH-MS) experiment. Quantification takes the processed data set by
dataProcess
as input and automatically generate the quantification results
(data.frame) with long or matrix format. The quantification for endogenous
samples is based on run summarization from subplot model, with TMP robust
estimation.
Sample quantification : individual biological sample quantification for
each protein. The label of each biological sample is a combination of the
corresponding group and the sample ID. If there are no technical replicates or
experimental replicates per sample, sample quantification is the same as run
summarization from dataProcess
(RunlevelData
from dataProcess
). If there
are technical replicates or experimental replicates, sample quantification is
median among run quantification corresponding MS runs.
Group quantification : quantification for individual group or individual condition per protein. It is median among sample quantification.
sample_quant_long = quantification(summarized, type = "Sample", format = "long") sample_quant_long sample_quant_wide = quantification(summarized, type = "Sample", format = "matrix") sample_quant_wide group_quant_long = quantification(summarized, type = "Group", format = "long") group_quant_long group_quant_wide = quantification(summarized, type = "Group", format = "matrix") group_quant_wide
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.