In Vitek-Lab/MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

BiocStyle::markdown()

knitr::opts_chunk$set(fig.width=10, fig.height=7, warning=FALSE, message=FALSE)
options(width=110)

```{=html}

# __MSstats: Protein/Peptide significance analysis__

Package: MSstats

Author: Anshuman Raina & Devon Kohler

Date: 5th Semptember 2024

## __Introduction__

`MSstats`, an R package in Bioconductor, supports protein differential analysis 
for statistical relative quantification of proteins and peptides in global, 
targeted and data-independent proteomics. It handles shotgun, label-free and 
label-based (universal synthetic peptide-based) SRM (selected reaction 
monitoring), and DIA (data independent acquisition) experiments. It can be used 
for experiments with complex designs (e.g. comparing more than two experimental 
conditions, or a repeated measure design, such as a time course).

This vignette summarizes the introduction and various options of all 
functionalities in `MSstats`. More details are available in `User Manual`.

For more information about the MSstats workflow, including a detailed 
description of the available processing options and their impact on the 
resulting differential analysis, please see the following publication:

Kohler et al, Nature Protocols 19, 2915–2938 (2024).

## __Installation__

To install this package, start R (version “4.0”) and enter:

``` {r code Installation}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("MSstats")
library(MSstats)
library(ggplot2)

1. Workflow

1.1 Raw Data

To begin with, we will load sample datasets, including both annotated and plain data. The dataset you need can be found here.

We will also load the Annotation Dataset using MSstatsConvert. You can access this dataset here.

``` {r code Load Dataset} library(MSstats)

Load data

pd_raw = system.file("tinytest/raw_data/PD/pd_input.csv", package = "MSstatsConvert")

annotation_raw = system.file("tinytest/raw_data/PD/annot_pd.csv", package = "MSstatsConvert")

pd = data.table::fread(pd_raw) annotation = data.table::fread(annotation_raw)

head(pd, 5) head(annotation, 5)

### __1.2 Loading PD Data to MSstats__

The imported data from Step 1.1. now must be converted through `MSstatsConvert` 
package's `PDtoMSstatsFormat` converter.

This function converts the Proteome Discoverer output into the required input 
format for `MSstats`.

Actual data modification can be seen below:

```r
library(MSstatsConvert)

pd_imported = MSstatsConvert::PDtoMSstatsFormat(pd, annotation, 
                                                use_log_file = FALSE)

head(pd_imported)

1.3 Converters

We have the following converters, which allow you to convert various types of output reports which include the feature level data to the required input format of MSstats. Further information about the converters can be found in the MSstatsConvert package.

DIANNtoMSstatsFormat
DIAUmpiretoMSstatsFormat
FragPipetoMSstatsFormat
MaxQtoMSstatsFormat
OpenMStoMSstatsFormat
OpenSWATHtoMSstatsFormat
PDtoMSstatsFormat
ProgenesistoMSstatsFormat
SkylinetoMSstatsFormat
SpectronauttoMSstatsFormat
MetamorpheusToMSstatsFormat

We show an example of how to use the above said Converters. For more information about using the individual converters please see the coresponding documentation.

skyline_raw = system.file("tinytest/raw_data/Skyline/skyline_input.csv", 
                    package = "MSstatsConvert")

skyline = data.table::fread(skyline_raw)
head(skyline, 5)

msstats_format = MSstatsConvert::SkylinetoMSstatsFormat(skyline_raw,
                                      qvalue_cutoff = 0.01,
                                      useUniquePeptide = TRUE,
                                      removeFewMeasurements = TRUE,
                                      removeOxidationMpeptides = TRUE,
                                      removeProtein_with1Feature = TRUE)

head(msstats_format)

1.4 Data Process

Once we import the dataset correctly with Converter, we need to pre-process the data which is done by the dataProcess function. This step involves data processing and quality control of the measured feature intensities.

This function includes 5 main processing steps (with other additional small steps):

Log transformation - Transform the feature intensities from their original scale to the log scale. This step helps make the data closer to being normally distributed, requiring less replicates for the central limit theorem to kick in.
Normalization - There are three different normalization options supported. 'equalizeMedians' (default) represents constant normalization (equalizing the medians) based on reference signals is performed. 'quantile' represents quantile normalization based on reference signals is performed. 'globalStandards' represents normalization with global standards proteins. FALSE represents no normalization is performed.
Feature selection - This also has three options i.e. Select All features, Top-N features (by mean intensity) or “Best” features.
Missing value imputation - We impute plausible values in case of missing data points. The RunLevelData can be queried to show Number of imputed intensities (censored intensities) in a RUN and Protein.
Summarization - After data processing the individual features are summarized up to the protein-level using Tukey's Median Polish. Linear summarization is also available as an option.

``` {r code dataProcess} summarized = dataProcess( pd_imported, logTrans = 2, normalization = "equalizeMedians", featureSubset = "all", n_top_feature = 3, summaryMethod = "TMP", equalFeatureVar = TRUE, censoredInt = "NA", MBimpute = TRUE )

head(summarized$FeatureLevelData)

head(summarized$ProteinLevelData)

head(summarized$SummaryMethod)

### __1.4.1 Data Processing Options__

Reference: [Kohler et al. 2024](https://www.nature.com/articles/s41596-024-01000-3#Sec20)

#### Normalization

Four options for normalization are included in MSstats: median, quantile, global standards and no normalization. There is no single best normalization for all experiments. Researchers must consider the assumptions underlying each normalization option and the appropriateness of the assumptions for their study. Below, we summarize the normalization options, their assumptions and the effect on downstream statistical analysis.

```r
library(kableExtra)

table_data <- data.frame(
  Name = c("Median", "", "Quantile", "", "Global standards", "", "", "None", ""),
  Description = c(
    "Equalize medians of all log feature intensities in each run", "",
    "Equalize the distributions of all log feature intensities in each run", "",
    "Equalize median log-intensities of endogenous or spiked-in reference peptides or proteins. Apply adjustment to the remainder of log feature intensities", "", "",
    "Do not apply any normalization", ""
  ),
  Assumption = c(
    "All steps of data collection and acquisition were randomized",
    "Most of the proteins in the experiment are the same and have the same concentration for all of the runs. The experimental artifacts affect every peptide in a run by the same constant amount",
    "All steps of data collection and acquisition were randomized",
    "Most of the proteins in the experiment are the same and have the same concentration for all of the runs. The experimental artifacts affect every peptide non-linearly, as a function of its log intensity",
    "All steps of data collection and acquisition were randomized",
    "The reference peptides or proteins are present in each run and have the same concentration for all of the runs. All experimental artifacts occur only after standards were added.",
    "The experimental artifacts affect every protein in a run by the same constant amount",
    "All steps of data collection and acquisition were randomized",
    "The experiment has no systematic artifacts or has been normalized in another custom manner"
  ),
  Effect = c(
    "The normalization estimates the artifact deviations in each run with a single quantity, reducing overfitting",
    "The normalization reduces bias and variance of the estimated log fold change",
    "The normalization estimates the artifact deviations in each run with a complex non-linear function, potentially leading to overfitting",
    "The normalization reduces bias and variance of the estimated log fold change but may over-correct",
    "The normalization estimates the artifact deviations in each run with a single quantity, which reduces overfitting",
    "The normalization estimates the artifact deviations from a small number of peptides, which may increase overfitting. The normalization does not eliminate artifacts that occurred before adding spiked references",
    "The normalization reduces bias and variance of the estimated log fold change",
    "All patterns of variation of interest and of nuisance variation are preserved",
    ""
  )
)

tryCatch({
  kable(table_data, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
}, error = function(e) {
  stop(paste0("Error in rendering the normalization options table: ", e$message))
})

If the assumptions of the normalization are not verified, the normalization may, in fact, increase bias or variance of the estimated log fold change. For example, if the experiment is not randomized and the experimental artifacts are confounded with the conditions, the median and quantile normalizations will introduce bias.

Feature Selection

Feature selection is used to determine which protein features should be used to infer the overall protein abundance in a sample. The options here are:

Using all features
Using the top ‘N’ features
Removing uninformative features and outliers

Using all features will simply leverage all available information to infer the underlying protein abundance. Top ‘N’ features selects a pre-specified number of features with the highest average intensity across all runs for protein-level inference. This option is useful if you believe that the features with lower average intensity are less reliable, or in cases in which some of the proteins have a very large number of features (such as in DIA experiments). For any individual protein, it is usually possible to determine changes in abundance by looking at the peaks with highest intensity; in these cases, using all features results in redundancy while greatly increasing the computational processing time. Finally, removing uninformative features and outliers attempts to select the ‘best’ features by removing features that have too many missing values, that are too noisy or have outliers.

Missing Value Imputation

Missing value imputation attempts to infer feature intensities in runs in which they were not measured. MSstats imputes these values by using an accelerated failure time model

imputation_table <- data.frame(
  Name = c("Imputation", "No imputation"),
  Description = c(
    "Infer missing feature intensities by using an accelerated failure time model. It will not impute for runs in which all features are missing",
    "Do not apply imputation"
  ),
  Assumption = c(
    "Features are missing for reasons of low abundance (e.g., features are missing not at random)",
    "Assume no information about reasons for missingness or that features are missing at random"
  ),
  Effect = c(
    "If the assumption is true, imputation will remove bias toward high intensities in the summarization step. Otherwise, bias will be introduced via inaccurate imputation",
    "If the assumption is true, no new bias will be introduced. Otherwise, if features are missing for reasons of low abundance, summarized values will be biased toward high intensities"
  )
)

tryCatch({
  kable(imputation_table, "html", escape = FALSE, col.names = c("Name", "Description", "Assumption", "Effect")) %>%
  kable_styling(full_width = TRUE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(2:4, width = "30em")
}, error = function(e) {
  stop(paste0("Error in rendering the imputation options table: ", e$message))
})

1.4.2 Data Process Plots

After processing the input data, MSstats provides multiple plots to analyze the results. Here we show the various types of plots we can use. By default, a pdf file will be downloaded with corresponding feature level data and the Plot generated. Alternatively, the address parameter can be set to FALSE which will output the plots directly.

# Profile plot
dataProcessPlots(data=summarized, type="ProfilePlot", 
                 address = FALSE, which.Protein = "P0ABU9")

# Quality control plot
dataProcessPlots(data=summarized, type="QCPlot", 
                 address = FALSE, which.Protein = "P0ABU9")

# Quantification plot for conditions
dataProcessPlots(data=summarized, type="ConditionPlot", 
                 address = FALSE, which.Protein = "P0ABU9")

1.5 Modeling

In this step we test for differential changes in protein abundance across conditions using a linear mixed-effects model. The model will be automatically adjusted based on your experimental design.

A contrast matrix must be provided to the model. Alternatively, all pairwise comparisons can be made by passing pairwise to the function. For more information on creating contrast matrices, please see the citation linked at the beginning of this document.

``` {r code groupComparison}

model = groupComparison("pairwise", summarized)

Model Details

``` {r Model }

head(model$ModelQC)

head(model$ComparisonResult)

1.5.1 How to Account for Covariates

If you’re comparing disease vs. control, you may also want to account for other factors (covariates) like gender. Here’s how you can do that step-by-step:

Step 1: Set up your conditions

In your annotation file, define each condition as a combination of the disease status and the covariate. For example, you could label your samples like this:
- Control_Male
- Control_Female
- Disease_Male
- Disease_Female

Step 2: Create a contrast matrix

The contrast matrix defines how you want to compare the groups. The order of the conditions in the matrix should match the order you define:

Example 1: Compare Control vs. Disease
If you want to compare control vs. disease (averaging over gender), you can define the contrast matrix like this:

# Order of conditions: control_male, control_female, disease_male, disease_female
contrast.matrix = matrix(c(0.5, 0.5, -0.5, -0.5), nrow = 1)
colnames(contrast.matrix) = c("Control_Male", "Control_Female", "Disease_Male", "Disease_Female")
row.names(contrast.matrix) <- "Control - Disease"

0.5 for Control_Male and Control_Female means you’re averaging the control samples.
-0.5 for Disease_Male and Disease_Female means you’re averaging the disease samples and subtracting them from the control average.

Example 2: Compare Male vs. Female If you want to compare male vs. female (averaging over disease status), set up the contrast matrix like this:

# Order of conditions: control_male, control_female, disease_male, disease_female
contrast.matrix = matrix(c(0.5, -0.5, 0.5, -0.5), nrow = 1)
colnames(contrast.matrix) = c("Control_Male", "Control_Female", "Disease_Male", "Disease_Female")
row.names(contrast.matrix) <- "Male - Female"

0.5 for Control_Male and Disease_Male means you’re averaging the male samples.
-0.5 for Control_Female and Disease_Female means you’re averaging the female samples and subtracting them from the male average.

This setup allows you to control for gender while testing the main effect of disease, or vice versa.

1.5.2 groupComparisonPlot

Visualization for model-based analysis and summarizing differentially abundant proteins. To summarize the results of log-fold changes and adjusted p-values for differentially abundant proteins, groupComparisonPlots takes testing results from function groupComparison as input and automatically generate three types of figures in pdf files as output :

Volcano plot : For each comparison separately. It illustrates actual log-fold changes and adjusted p-values for each comparison separately with all proteins. The x-axis is the log fold change. The base of logarithm transformation is the same as specified in “logTrans” from dataProcess. The y-axis is the negative log2 or log10 adjusted p-values. The horizontal dashed line represents the FDR cutoff. The points below the FDR cutoff line are non-significantly abundant proteins (colored in black). The points above the FDR cutoff line are significantly abundant proteins (colored in red/blue for up-/down-regulated). If fold change cutoff is specified (FCcutoff = specific value), the points above the FDR cutoff line but within the FC cutoff line are non-significantly abundant proteins (colored in black).
Heatmap : For multiple comparisons. It illustrates up-/down-regulated proteins for multiple comparisons with all proteins. Each column represents each comparison of interest. Each row represents each protein. Color red/blue represents proteins in that specific comparison are significantly up-regulated/down-regulated proteins with FDR cutoff and/or FC cutoff. The color scheme shows the evidences of significance. The darker color it is, the stronger evidence of significance it has. Color gold represents proteins are not significantly different in abundance.
Comparison plot : For multiple comparisons per protein. It illustrates log-fold change and its variation of multiple comparisons for single protein. X-axis is comparison of interest. Y-axis is the log fold change. The red points are the estimated log fold change from the model. The error bars are the confidence interval with 0.95 significant level for log fold change. This interval is only based on the standard error, which is estimated from the model.

``` {r GroupComparisonPlots}

groupComparisonPlots( model$ComparisonResult, type="Heatmap", sig = 0.05, FCcutoff = FALSE, logBase.pvalue = 10, ylimUp = FALSE, ylimDown = FALSE, xlimUp = FALSE, x.axis.size = 10, y.axis.size = 10, dot.size = 3, text.size = 4, text.angle = 0, legend.size = 13, ProteinName = TRUE, colorkey = TRUE, numProtein = 100, clustering = "both", width = 800, height = 600, which.Comparison = "all", which.Protein = "all", address = FALSE, isPlotly = FALSE )

groupComparisonPlots( model$ComparisonResult, type="VolcanoPlot", sig = 0.05, FCcutoff = FALSE, logBase.pvalue = 10, ylimUp = FALSE, ylimDown = FALSE, xlimUp = FALSE, x.axis.size = 10, y.axis.size = 10, dot.size = 3, text.size = 4, text.angle = 0, legend.size = 13, ProteinName = TRUE, colorkey = TRUE, numProtein = 100, clustering = "both", width = 800, height = 600, which.Comparison = "Condition2 vs Condition4", which.Protein = "all", address = FALSE, isPlotly = FALSE )

### __1.6 GroupComparisonQCPlots__

To check and verify that the resultant data of `groupComparison` offers a linear 
model for whole plot inference, `groupComparisonQC` plots take the fitted data 
and provide two ways of plotting:

1. Normal Q-Q plot : Quantile-Quantile plots represents normal quantile-quantile
plot for each protein after fitting models
2. Residual plot : represents a plot of residuals versus fitted values for each 
protein in the dataset.

Results based on statistical models for whole plot level inference are accurate 
as long as the assumptions of the model are met. The model assumes that the 
measurement errors are normally distributed with mean 0 and constant variance. 
The assumption of a constant variance can be checked by examining the residuals 
from the model.


``` {r GroupComparisonQCplots, results='hide', message=FALSE, warning=FALSE}

source("..//R//groupComparisonQCPlots.R")

groupComparisonQCPlots(data=model, type="QQPlots", address=FALSE, 
                       which.Protein = "P0ABU9")


groupComparisonQCPlots(data=model, type="ResidualPlots", address=FALSE, 
                       which.Protein = "P0ABU9")

1.7 Sample Size Calculation

Calculate sample size for future experiments of a Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment based on intensity-based linear model. The function fits the model and uses variance components to calculate sample size. The underlying model fitting with intensity-based linear model with technical MS run replication. Estimated sample size is rounded to 0 decimal. Two options of the calculation:

number of biological replicates per condition
power

sample_size_calc = designSampleSize(model$FittedModel,
                                    desiredFC=c(1.75,2.5),
                                    power = TRUE,
                                    numSample=5)

1.7.1 Sample Size Calculation Plot

To illustrate the relationship of desired fold change and the calculated minimal number sample size which are

The input is the result from function designSampleSize.

designSampleSizePlots(sample_size_calc, isPlotly=FALSE)

1.8 Quantification from groupComparison Data

Model-based quantification for each condition or for each biological samples per protein in a targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition (DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS) experiment. Quantification takes the processed data set by dataProcess as input and automatically generate the quantification results (data.frame) with long or matrix format. The quantification for endogenous samples is based on run summarization from subplot model, with TMP robust estimation.

Sample quantification : individual biological sample quantification for each protein. The label of each biological sample is a combination of the corresponding group and the sample ID. If there are no technical replicates or experimental replicates per sample, sample quantification is the same as run summarization from dataProcess (RunlevelData from dataProcess). If there are technical replicates or experimental replicates, sample quantification is median among run quantification corresponding MS runs.
Group quantification : quantification for individual group or individual condition per protein. It is median among sample quantification.

sample_quant_long = quantification(summarized,
                             type = "Sample",
                             format = "long")
sample_quant_long
sample_quant_wide = quantification(summarized,
                              type = "Sample",
                              format = "matrix")
sample_quant_wide
group_quant_long = quantification(summarized,
                                  type = "Group",
                                  format = "long")
group_quant_long
group_quant_wide = quantification(summarized,
                                  type = "Group",
                                  format = "matrix")
group_quant_wide

Vitek-Lab/MSstats documentation built on April 14, 2025, 1:43 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Vitek-Lab/MSstats
Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

In Vitek-Lab/MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

1. Workflow

1.1 Raw Data

Load data

1.3 Converters

1.4 Data Process

Feature Selection

Missing Value Imputation

1.4.2 Data Process Plots

1.5 Modeling

1.5.1 How to Account for Covariates

Step 1: Set up your conditions

Step 2: Create a contrast matrix

1.5.2 groupComparisonPlot

1.7 Sample Size Calculation

1.7.1 Sample Size Calculation Plot

1.8 Quantification from groupComparison Data

R Package Documentation

Browse R Packages

We want your feedback!

Vitek-Lab/MSstats Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

In Vitek-Lab/MSstats: Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments

1. Workflow

1.1 Raw Data

Load data

1.3 Converters

1.4 Data Process

Feature Selection

Missing Value Imputation

1.4.2 Data Process Plots

1.5 Modeling

1.5.1 How to Account for Covariates

Step 1: Set up your conditions

Step 2: Create a contrast matrix

1.5.2 groupComparisonPlot

1.7 Sample Size Calculation

1.7.1 Sample Size Calculation Plot

1.8 Quantification from groupComparison Data

R Package Documentation

Browse R Packages

We want your feedback!

Vitek-Lab/MSstats
Protein Significance Analysis in DDA, SRM and DIA for Label-free or Label-based Proteomics Experiments