In statOmics/MSqRob: Robust statistical inference for quantitative LC-MS proteomics

A first notice

This vignette might be a bit overwhelming for a first-time user.

You might want to skim through this vignette and then take a look at the same two examples stripped to their essential analyses. These can be found at https://github.com/statOmics/MSqRobData/blob/master/inst/extdata/Francisella/analysis_Francisella.Rmd and https://github.com/statOmics/MSqRobData/blob/master/inst/extdata/CPTAC/analysis_CPTAC.Rmd for the Francisella and CPTAC example respectively.

Note that we mainly focus on MaxQuant output, but that any type type of search engine output can be used with MSqRob, see "1.2. Importing other data types" for more info.

If anything is unclear, please do not hesitate to contact me at ludger.goeminne@epfl.ch or to open an issue on GitHub. Your feedback is highly welcomed!

Introduction

This vignette will guide you through the main features of the MSqRob package. We will highlight two examples in order to help you familiarize with its main features.

The MSqRob package is designed for the analysis of differential abundance in label-free mass spectrometry-based shotgun proteomics experiments. These experiments typically output large spectral files (e.g. ".mzML" files) that need to be searched for identifications by specialized peak detection and peptide identification software such as Andromeda (used in the MaxQuant software package) and Mascot Distiller. These algorithms output lists of putative peptides accompanied by their corresponding overall intensity estimates.

Often, researchers further summarize these peptide intensities to protein intensities. However, as we have shown before, this summarization-based approaches tend to perform suboptimal compared to peptide-based models, wherein the peptide intensities are directly provided as input to the statistical model used for differential quantification (Goeminne et al., 2015). Nonetheless, most of these models still suffer from overfitting issues.

This package implements the robust ridge approach that prevents overfitting in sparse data but also accurately fits highly abundant proteins as described in Goeminne et al. (2016).

MSqRob is completely free and open source, but please make a reference to Goeminne et al. (2016) when using our package in any kind of publication.

Downloading, installing and loading MSqRob

Installing the package

On Windows, make sure that RTools is installed. Go to: https://cran.r-project.org/bin/windows/Rtools/ to download RTools. A user guide on how to install RTools on Windows can be found at: https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows. Errors in MSqRob on Windows related to unable to zip the results Excel file might be related to errors in configuring RTools.

First, we need to install the package devtools:

#Change to eval=TRUE to execute this piece of code.
install.packages("devtools", repos = "http://cran.us.r-project.org")
library(devtools)

Then, we call this to install MSqRob from GitHub:

#Change to eval=TRUE to execute this piece of code.
devtools::install_github("statOmics/MSqRob")
library(MSqRob)

Note that the latest version of MSqRob can always be found at: https://github.com/statOmics/MSqRob.

The latest version of this vignette can be found at: https://github.com/statOmics/MSqRob/blob/master/vignettes/MSqRob.Rmd.

Loading MSqRob

Loading the package is done via the library command, as with any other package.

library(MSqRob)

Additionally, we load the packages Biobase, MSnbase and limma, as we will use a few of their functions in this vignette.

library(Biobase)
library(MSnbase)
library(limma)

The MSqRobData package

R packages uploaded on CRAN have a maximum size limit of 5 MB. As proteomics datasets are typically several hunderds of GB in size and even an output file of a peptide search can be tens of MBs in size, we uploaded our example data in a separate data package called MSqRobData. If you want to run the examples in this vignette, you also need the data package. This can also be downloaded from GitHub in the following way:

#Change to eval=TRUE to execute this piece of code.
devtools::install_github("statOmics/MSqRobData")
library(MSqRobData)

The Shiny App

If you want to work with the Shiny graphical user interface (MaxQuant output only), use the following command:

#Change to eval=TRUE to execute this piece of code.
shiny::runApp(system.file('App-MSqRob', package='MSqRob'))

Francisella tularensis dataset (design with pseudoreplication)

The bacterium Francisella tularensis is a facultative intracellular parasite of macrophages and must import host cell arginine in order to be able to multiply inside the host cell. Ramond et al. (2015) investigated the effect of gene deletion of a newly identified arginine transporter called ArgP. Their data is deposited in the PRIDE repository at http://www.ebi.ac.uk/pride/archive/projects/PXD001584. We used the authors' supplied peptides.txt file which is the result of preprocessing with MaxQuant version 1.4.1.2. As the authors used now outdated GI numbers, we modified this file to incorporate NCBI accessions and protein names (September 25, 2016).

1. Importing data

MSqRob data input can come from any kind of source, as long as feature-level or peptide-level data is available. I will here focus on a MaxQuant example, but in fact, output from any search engine can be used.

1.1. Importing MaxQuant data

First, specify the location of the peptides.txt MaxQuant output file. By default, MaxQuant creates a "combined" folder in the same folder as the raw files are (or if the raw files are in different directories, in the directory of the first raw file). The peptides.txt file can then be found in "path_to_raw_files/combined/txt/peptides.txt". Just provide the path to where the file is located on your computer. The system.file function is used here only to access the example tab-delimited peptides.txt file that is delivered with our package.

file_peptides_txt <- system.file("extdata/Francisella", "peptides.txt", package = "MSqRobData")
file_peptides_txt

Import the data using the import2MSnSet function. The pattern argument should contain a text string that is present in all column names that contain peptide intensities but not in the other column names. This defaults to "Intensity\ " for MaxQuant data. The default setting remove_pattern=TRUE then removes the pattern from these column names so that the names equal the names that were entered in MaxQuant's "Experiment" column when conducting the search. In this particular Francisella experiment, our column names start with a number (either "1" or "3"), which is not a valid column name in R. Therefore, an "X" is prepended to these names.

peptidesFranc <- import2MSnSet(file_peptides_txt, filetype = "MaxQuant", remove_pattern = TRUE)

import2MSnSet creates an MSnSet object with 3 slots:

exprs: contains a matrix of peptide intensities of which the rows correspond to the peptides present in the dataset and the columns correspond to the different mass spec runs.
fData: contains a data frame containing all additional information (such as peptide sequences, protein names, scores, number of missed cleavages, etc.) present in the input file.
pData: contains a data frame with 0 columns and as many rows as there are mass spec runs. This slot will be used to store the experimental annotation.

If you want to assess whether importing happened in a correct way, you can always check the exprs, fData and pData slots.

The exprs slot should contain all the peptide intensities grouped per column.

head(exprs(peptidesFranc))

The fData slot contains all other columns.

head(fData(peptidesFranc))

The pData slot will contain the experimental annotation. Right now, it has 18 rows (equal to the number of mass spec runs) and 0 columns (as the experimental annotation is not yet loaded).

pData(peptidesFranc)

1.2. Importing other data types

The import2MSnSet has built-in support for moFF (Argentini et al., 2016), mzTab (Griss et al., 2014) and Progenesis (Nonlinear Dynamics, Newcastle upon Tyne, U.K.) output. You should then simply set filetype to "moFF", "mzTab" or "Progenesis" respectively.

If you are using yet another data type, you can use MSqRob's read2MSnSet function to import your data. Hereby, you have to specify (1) your file, (2) a regular expression pattern that indicates the columns with the peptide intensities or simply the indices of these columns and (3) the separator (e.g. "\t", ",", ...).

2. Preprocessing

2.1. Create an experimental annotation data frame.

Now, we are going to create a data frame that contains the experimental annotation. This data frame should contain all explanatory variables that groups different runs together. This is important in order to be able to include factors on which you will do inference (such as treatment effects or effects over time, or, in our case: the effect of the genetic knock-out) as well as other factors that might have an influence (such as biological repeats, technical repeats and potential confounders such as batch effects).

Note that the "experimental annotation" can also be given as an Excel file or a tab delimited file! This file should have the column names as a header (i.e. the first row in the file) and the same structure as the exp_annotation data frame we create below. If you have your experimental annotation in a file, just provide the filepath to this annotation file in the preprocess_MaxQuant function.

Also note that exactly one column in the experimental annotation should contain the names of the mass spec runs. These names should be equal to the names you entered in the "Experiment" column in MaxQuant (i.e. the column names of the exprs slot of the peptidesFranc MSnSet object). In our case, we will call this column "run" (see below). By default, MSqRob guesses this column by taking the only column of which its elements correspond to the column names of the exprs slot. If there is no such a column, or there are multiple such columns, MSqRob throws an error.

In this example, we create the experimental annotation data frame ourselves.

Extract the names of the mass spec runs. The mass spec run names are equal to the column names of the exprs slot of the peptidesFranc MSnSet object. Other columns in the experimental annotation data frame will typically be derived from this column.

runs <- sampleNames(peptidesFranc)
runs
colnames(exprs(peptidesFranc))

Note again that the sample names have a prepended "X" because a syntactically valid name in R cannot start with a number.

Add a factor for the genetic knock-out, which we will call "genotype" (i.e. wild type (WT) or knock-out (KO)).

#Create the "genotype" vector
genotype <- factor(c(rep("WT",9),rep("KO",9)), levels=c("WT","KO"))
genotype

Add a factor for each of the 6 biological replicates (i.e. 3 biological repeats for each genotype).

#Create a factor for the biological replicates
n <- nchar(runs)
biorep <- as.factor(paste0("b_",rep(1:6,each=3)))
biorep

#Alternative: extract the biological repeat based on the mass spec run names.
# biorep <- as.factor(paste0("b_",factor(as.numeric(factor(paste0(substr(runs, 1,1),substr(runs, n-2,n-2))))+c(rep(0,9),rep(3,9)))))

Each of the 18 technical repeats are here represented by the 18 mass spec runs.

Create an experimental annotation data frame.

exp_annotation <- data.frame(run=runs, genotype=genotype, biorep=biorep)
exp_annotation

Note that you can also import the experimental annotation from a tab-delimited or Excel file. As an example, we show you here how you can import data from an experimental annotation file delivered with our package. First, provide the path to where the data is saved.

file_exp_annotation <- system.file("extdata/Francisella", "label-free_Francisella_annotation.xlsx", package = "MSqRobData")
file_exp_annotation

Then, import the data using the read.xlsx function from the openxlsx package and convert the columns of the data frame to factors.

#Import Excel file
exp_annotation_xlsx <- openxlsx::read.xlsx(file_exp_annotation)
#Convert columns of data frame to factors
exp_annotation_xlsx <- as.data.frame(unclass(exp_annotation_xlsx))
exp_annotation_xlsx

Note that here, there is no "X" prepended to the run names. However, names will be changed automatically to syntactically valid names by MSqRob during preprocessing.

2.2. Preprocess the data using the preprocess_MaxQuant function.

The identification of peptides that carry one or more chemical modifications is often not as reliable as the identification of unmodified peptides. Therefore, a common step during the preprocessing of proteomics data consists of removing the proteins of which all identified peptides have one or more modification sites. If we want to remove these proteins that are only identified by modified peptides from the dataset, we also need the proteinGroups.txt file. This file can be found in the same location as the peptides.txt file ("path_to_raw_files/combined/txt/proteinGroups.txt"). Here, we make use of the proteinGroups.txt file that is delivered with the package.

Note: if you don't want to remove proteins that are only identified by modified peptides, set the remove_only_site argument of the preprocess_MaxQuant function to FALSE and leave the file_proteinGroups argument at its default value NULL.

Define the path to the proteinGroups.txt file.

file_proteinGroups <- system.file("extdata/Francisella", "proteinGroups.txt", package = "MSqRobData")
file_proteinGroups

Add to the useful_properties argument all column names of the fData slot of the peptidesFranc MSnSet object that you would like to keep (such as gene names, protein names, ontologies, etc.). Other columns will be removed for improved efficiency and memory usage. Here, we would like to keep the GI number and the protein names. These 2 columns are present in our peptides.txt file, but note that you can use basic R manipulations to add other columns to the fData slot if you want to. We certainly want to keep the "Sequence" slot, as peptide-specific effects must be included in our model.

peptidesFranc2 <- preprocess_MaxQuant(peptidesFranc, accession="Proteins", exp_annotation=exp_annotation, logtransform=TRUE, base=2, normalisation="quantiles", smallestUniqueGroups=TRUE, useful_properties=c("GI number","Protein.names","Sequence"), filter=c("Contaminant","Reverse"), remove_only_site=TRUE, file_proteinGroups=file_proteinGroups,  filter_symbol="+", minIdentified=2)

preprocess_MaxQuant internally executes the following preprocessing steps in this order:

Peptide intensities are (log2-)transformed (logtransform=TRUE, base=2).
As a normalization approach, we choose to quantile normalize the peptide intensities (normalisation="quantiles"). Other options are "sum", "max", "center.mean", "center.median", "quantiles.robust", "vsn" or "none" (see MSnbase::normalize for more information).
Handling overlapping protein groups (smallestUniqueGroups=TRUE): in our approach a peptide can map to multiple accessions (accession="Proteins"), as long as there is none of these proteins present in a smaller subgroup. Peptides that map to protein groups for wich also subgroups are present in the dataset are removed from the dataset. (Proteins in a protein group are separated by a ;).
Remove contaminants and reverse sequences (filter=c("Contaminant","Reverse"), filter_symbol="+"): each row with the filter symbol ("+" in our case) in the columns "Contaminant" and "Reverse" will be removed from the data. Note that the "Contaminant" column in older MaxQuant peptides.txt files can also be called "Potential.contaminant".
Remove all proteins that are only identified by peptides carrying a modification site (remove_only_site=TRUE, file_proteinGroups=file_proteinGroups).
Remove all superfluous columns in the fData slot, except the "Proteins", "GI.number", "Protein.names" and "Sequence" columns (useful_properties=c("GI.number","Protein.names","Sequence")). Note that the "Proteins" column does not have to be provided: the accession column is always retained.
Remove all peptides that are only identified once in the dataset (minIdentified=2).
Add experimental annotation (exp_annotation=exp_annotation).

We can now make density plots to inspect the effect of the preprocessing on the distribution of our log2-transformed peptide intensities.

3. Make diagnostic plots

Here we take a closer look at how you can make the diagnostic plots from the Shiny app. We will plot both the density plots and the MDS plot.

#Make sure the order of the rows in your exp_annotation is the same as the columns of the exprs slot in your peptides object.
all(exp_annotation[,"run"]==colnames(exprs(peptidesFranc)))

#Color by biological repeat
colors <- as.numeric(exp_annotation[,"biorep"])

### Density plot before preprocessing ###

densAll <- apply(log2(exprs(peptidesFranc)),2,density,na.rm=TRUE)

#Automatically determine the plot window size
ymax=max(vapply(densAll,function(d) max(d$y),1))
rangematrix <- vapply(densAll,function(d) range(d$x, na.rm=TRUE), c(1,1)) #no longer
xlim=range(rangematrix,na.rm=TRUE)
ylim=c(0,ymax)

plot(densAll[[1]],col=colors[1],xlim=xlim,ylim=ylim, las=1, frame.plot=FALSE, main="Density before preprocessing")
for (i in 2:length(densAll)) lines(densAll[[i]],col=colors[i])

######

### Density plot after preprocessing ###

densAll <- apply(log2(exprs(peptidesFranc2)),2,density,na.rm=TRUE)

#Automatically determine the plot window size
ymax=max(vapply(densAll,function(d) max(d$y),1))
rangematrix <- vapply(densAll,function(d) range(d$x, na.rm=TRUE), c(1,1)) #no longer
xlim=range(rangematrix,na.rm=TRUE)
ylim=c(0,ymax)

plot(densAll[[1]],col=colors[1],xlim=xlim,ylim=ylim, las=1, frame.plot=FALSE, main="Density after preprocessing")
for (i in 2:length(densAll)) lines(densAll[[i]],col=colors[i])

######

We can also construct an MDS plot. This plot will cluster similar runs together.

plotMDS(exprs(peptidesFranc2), col=colors, las=1)

4. Convert the MSnSet object peptidesFranc2 into a protdata object proteinsFranc.

The accession argument denotes by which column the individual peptide observations should be grouped. The annotations argument indicates which columns in peptidesFranc2 are annotations linked to the accessions (in our case: the GI number and the protein names). These annotations will be added to a separate annotation slot.

proteinsFranc <- MSnSet2protdata(peptidesFranc2, accession="Proteins", annotations = c("GI.number", "Protein.names"))

Advanced: add information on the design after creating your protdata object

Imagine you forgot a factor in your experimental annotation. Should you redo the whole preprocessing part? No, often, you can still add this factor thanks to the addVarFromVar function. We will now show how we can add an extra (redundant) factor called techrep.

Create a factor for each of the 18 technical replicates (each sample was analyzed in triplicate). This factor is completely redundant, as the "run" column in the experimental annotation already has 18 different levels and is thus already equal to the factor for the technical replicates.

techrep <- as.factor(paste0("t_",1:18))
names(techrep) <- levels(exp_annotation$run)
techrep
proteinsFranc2 <- addVarFromVar(proteinsFranc, basecol="run", name="techrep", vector=techrep)
proteinsFranc2

5. Inspect the data for a specific protein via a detailed protein plot

Here, we illustrate how you can reproduce the detail plots from the Shiny app. We will take a closer look at the protein "WP_003039212".

protein_name <- "WP_003039212"
data <- getData(proteinsFranc[protein_name])

#We plot the peptide sequences as different point symbols
points <- as.numeric(droplevels(as.factor(data$Sequence)))
#We don't want to use repeated plot symbols or invisible symbols
pch_vals <- c(0:25,32:127)
#Repeat pch_vals if there would be more than 122 unique levels
pch_vals <- rep_len(pch_vals, length(unique(points)))
pch_vals <- pch_vals[points]

#We color by mass spec run
colors <- grDevices::colorRampPalette(RColorBrewer::brewer.pal(8,"Spectral"))(length(unique(droplevels(as.factor(data$run)))))
colors <- colors[as.numeric(droplevels(as.factor(data$run)))]

#We make a different boxplot for each genotype
boxplot(data$quant_value~as.factor(data$genotype), outline=FALSE, ylim=c(min(data$quant_value)-0.2,max(data$quant_value)+0.2), ylab="preprocessed peptide intensity", xlab="", main=paste0("Detail plot for protein ", protein_name), las=2, frame.plot=FALSE, frame=FALSE, col="grey", pars=list(boxcol="white")) 

#Add the data as separate points     
points(jitter((as.numeric(data$genotype)), factor=2),data$quant_value, col=colors, pch=pch_vals)

6. Fit the model

Fit the robust ridge models for each protein. The fit.model function will fit a model to each protein in the protdata object proteinsFranc. The default setting reg="ridge" indicates that ridge regression should be used. The default setting weights="Huber" indicates that an M estimation algorithm should be used that assigns Huber weights to each peptide observation based on their residuals.

Caution: fitting the models takes some time (less than 15 minutes for the Francisella tularensis dataset on our system).

system.time(protLMFranc <- fit.model(proteinsFranc, response="quant_value", fixed=c("genotype"),  random=c("biorep","run","Sequence"), add.intercept=TRUE, weights="Huber"))

The system.time function was only added to output the duration of the model fitting. If you are not interested in this, you can of course omit it.

5.1. Inspect the model

protLMFranc is a protLM object.

class(protLMFranc)

Printing out a protLM object returns an overview of the first 6 accessions with their annotations and their corresponding models.

protLMFranc

The numbers in the Std.Dev. column give the estimated standard deviations of the random effects. Effects starting with ridgeGroup indicate shrunken fixed effects (we make use of the random effects framework of the lm4 package in order to estimate Ridge penalties). However, as the fixed effects in the ridgeGroups are orthogonalized using the Gramm-Schmidt procedure, these numbers are not very informative.

A protLM object has three slots:

an accession slot: a vector containing all the accessions
a model slot: a list of all fitted models (one for each accession)
an annotation slot: data frame of which each row (one for each accession) corresponds to the annotations of a different accession.

These slots can be accessed using the getAccessions, getModels and getAnnotations functions respectively. Here we give the accessions, models and annotations for the first 5 models.

getAccessions(protLMFranc)[1:5]
getModels(protLMFranc)[1:5]
getAnnotations(protLMFranc)[1:5,]

protLM objects can be subsetted using either their numeric index or their corresponding accession.

Retain only the part corresponding to model 24 and model 50.

protLMFranc[c(24,50)]

Inspect the protLM object corresponding to accession "WP_003033643".

protLMFranc["WP_003033643"]

Say we want to inspect cell division protein "WP_003040227". We can then select protein "WP_003040227" from the protLM object. We can extract the model via the getModels function. simplify=TRUE is the default setting and indicates that if you select one model, you only want this model and not a list with as only element this model. Note that if you select multiple models, you automatically receive a list of models.

modelWP_003040227 <- getModels(protLMFranc["WP_003040227"], simplify=TRUE)
modelWP_003040227

The betaBVcovDf function extracts a list from a model containing the following four elements:

beta: the estimated parameters: both fixed effect sizes and BLUPs (best linear unbiased predictors) for the random effects
vcov: the variance-covariance matrix
df: the residual degrees of freedom
sigma: the residual standard deviation
df_exp: if the number of experimental units would be specified: a more conservative estimate for the degrees of freedom based on the number of experimental units minus the degrees of freedom lost due to between-treatment effects. Otherwise: NULL.

betaBVcovDf <- getBetaVcovDf(modelWP_003040227)

betaBVcovDf$beta
betaBVcovDf$vcov
betaBVcovDf$df
betaBVcovDf$df_exp
betaBVcovDf$sigma

The getBetaVcovDfList function can be applied on (a subset of) the protLM object protLMFranc. This returns a list containing one or more lists similar to the betaBVcovDf list.

betaVcovDfList <- getBetaVcovDfList(protLMFranc[1:2])
#Inspect the structure of the list without printing it out compeletely:
str(betaVcovDfList)

6.2 Save your results

You can save your proteinsFranc and protLMFranc objects to a .RDatas object that can be loaded in R or in the Shiny graphical user interface. If you want your .RDatas object to be loaded into the Shiny App, you should also specify the levelOptions, random and plot2DependentVars objects. These objects are settings that are saved automatically when using the MSqRob Shiny App. In this example, I save the .RDatas object to my desktop.

#Change to eval=TRUE to execute this piece of code.

result_files_new <- list()
result_files_new$proteins <- proteinsFranc
result_files_new$models <- protLMFranc

### This part is only needed if you want to load your .RDatas file in the MSqRob Shiny App ###

#Should contain all contrast options for the fixed effects (see also 5. Statistical inference)
result_files_new$levelOptions <- c("genotypeWT","genotypeKO")
#Should contain all random effects
result_files_new$random <- c("Sequence","biorep","run")
#Should contain a list with all fixed and random effects
result_files_new$plot2DependentVars <- as.list(c("genotype","Sequence","biorep","run"))

#############################################################################################

saves_MSqRob(result_files, file="/Users/lgoeminn/Desktop/project_Francisella.RDatas") #Change file path to your desired save location!

If you want to load your saved file, first specify the folder where it is saved.

#Change to eval=TRUE to execute this piece of code.
result_path <- "/Users/lgoeminn/Desktop/project_Francisella.RDatas"

To inspect the different objects in the file, you can use the inspect_loads_MSqRob function.

#Change to eval=TRUE to execute this piece of code.
inspect_loads_MSqRob(result_path)

To load this file and assign new protdata and protLM object to it, execute the following piece of code.

#Change to eval=TRUE to execute this piece of code.

#To load the "project_Francisella.RDatas" file
result_files <- loads_MSqRob(result_path)

#assign a new protdata object to the saved proteins
proteinsFranc_result <- result_files$proteins
#assign a new protLM object to the saved models
protLMFranc_result <- result_files$models

7. Statistical inference

We are interested in which proteins differ significantly in abundance between ArgP mutant and wild type Francisella tularensis. Via the "MSqRob_levels" attribute, you can check which factor levels are present in each model. Pick a model for which there are no missing levels for the factors for which you want to test contrasts and inspect the names. Note that these names are always the combination of the parameter name and the factor level.

attr(modelWP_003040227,"MSqRob_levels")

Here, we notice that the mutant is encoded as "genotypeKO" and the wild type as "genotypeWT". We construct a contrast matrix L for inferring on the fold change of interest. We are interested in the log2 fold change between mutant and wild type. As all intensities are log2-transformed, we should take the difference between "genotypeKO" and "genotypeWT" (because of the following property of logarithms: log2(a/b)=log2(a)-log2(b)).

L <- makeContrast(contrasts=c("genotypeKO-genotypeWT"),
                  levels=c("genotypeWT","genotypeKO"))
L

Imagine we would want to test the average log2 protein intensity between mutant and wild type, then L would be constructed as follows:

L2 <- makeContrast(contrasts=c("genotypeKO/2+genotypeWT/2"),
                  levels=c("genotypeWT","genotypeKO"))
L2

Important notification: do not use the estimated parameters beta to determine the factor levels! You will notice for example that the level "genotypeWT" is missing in the column names of the betas. This is because "genotypeWT" is the reference class, giving the parameter corresponding to "genotypeKO" the interpretation of the effect of "genotypeKO" relative to "genotypeWT". It is important however to include "genotypeWT" in the contrast matrix because for proteins where factor levels are missing, these parameters can get different interpretations. The makeContrast function takes this into account and will return NA when a contrast cannot be properly estimated.

betaBVcovDf$beta

Build a data frame containing estimate, standard error, degrees of freedom, T-value and p-value for each protein in the protLM object protLMFranc, by using the test.protLMcontrast function.

resultFranc <- test.protLMcontrast(protLMFranc, L)

The prot.p.adjust function adds a new column "qval" to the data frame based on an existing "pval" column. The new column contains adjusted p-values using one of the methods in the p.adjust.methods vector.

resultFranc <- prot.p.adjust(resultFranc, method="fdr")

The prot.signif function needs a matrix (or a list of matrices) containing a column named "pval" and a column named "qval". The "pval" column will be used to sort the matrix. The function generates a new column "signif" containing 1 for all values of "qval" with a value lower than the specified FDR level (default: 0.05) and 0 otherwise.

resultFranc <- prot.signif(resultFranc, level = 0.05)

Inspect the first 6 significant proteins:

head(resultFranc,6)

Advanced: importing as a data frame

It is also possible to import your data as a simple data frame and make use of some custom preprocessing pipeline. Here, we will show the same preprocessing pipeline for the Francisella dataset as outlined above, but note that adding, removing or modifying steps in the preprocessing pipeline is now very simple as the data is in data frame format.

First, we import the Francisella dataset as a data frame via the base R function read.table.

pepFrancDf <- read.table(file = file_peptides_txt, sep = "\t", header = TRUE, quote="", comment.char = "")

Of course, other filetypes can also be imported. Say for example your input file is a comma separated file stored in My Documents on a Windows computer, you could use for example:

#Change to eval=TRUE to execute this piece of code.
pepFrancDf <- read.table("C:/Users/Ludger/Documents/peptides.csv", sep = ",", dec=".", header = TRUE, quote="", comment.char = "")

The data frame pepFrancD is in wide format.

head(pepFrancDf)

If we want to mimick the MaxQuant preprocessing pipeline, we need to add a factor that indicates whether a protein is only identified by modified peptides. This requires some basic R programming. The information about which proteins are only identified by modified peptides can be found in the proteinGroups.txt file. We again make use of the proteinGroups.txt file that is delivered with the package.

#Import the proteinGroups.txt file delivered with the MSqRobData package 
file_proteinGroups <- system.file("extdata/Francisella", "proteinGroups.txt", package = "MSqRobData")

  proteinGroups <- read.table(file_proteinGroups, sep="\t", header=TRUE, quote="", comment.char = "")

  #Extract the column that indicates which proteins are only identified by a modification site
  only_site <- proteinGroups$Only.identified.by.site

  #If there are no such proteins (as is the case here), the column will be completely empty and R will import this by default as "NA". However, we want the an empty value instead of "NA" to keep this column consistent with the "Contaminant" and "Reverse" columns.
  only_site[is.na(only_site)] <- ""
  #Select the protein accessions that are only identified by one or more modified peptides
  filter_symbol <- "+"
  removed_proteins <- proteinGroups$Protein.IDs[only_site==filter_symbol]

  #Create a logical variable "rem" that holds "TRUE" if a row in the pepFrancDf data frame should be removed and FALSE otherwise.
  accession <- "Proteins"
  rem <- as.character(pepFrancDf[,accession]) %in% as.character(removed_proteins)

  #Create a new column "only_site" in "pepFrancDf" that indicates with a "+" which rows should be removed.
  pepFrancDf$only_site <- ""
  pepFrancDf$only_site[rem] <- "+"

Note that in this particular case, the previous chunck of code was superfluous, as there are no proteins in the dataset that are only identified by modified peptides.

We can easily preprocess the data frame using the preprocess_wide function (for data in "long" format, use the preprocess_long function). This function uses our standard preprocessing pipeline on data frames in "wide" format (i.e. the data frame has one observation row per subject with each measurement present as a different variable).

We already made the exp_annotation object before (see "2.1. Create an experimental annotation data frame."), so we will not repeat this code again here. Importantly, however, contrary to the exprs slot in the peptidesFranc2 object created before, the names of all intensity columns in data frame pepFrancDf start with "Intensity.".

This is because the default setting remove_pattern=TRUE in the import2MSnSet function removes the prefix "Intensity\ " from the columns containing the intensities as this allows users that are unfamiliar with R to just copy and paste the names they provided in MaxQuant's "Experiment" column as an input. Here, we just imported the peptides.txt file as it is. We thus create a new experimental annotation data frame called exp_annotation2 with run names starting with "Intensity." (a space is not syntactically valid in R, so it is automatically converted to a dot).

runs2 <- colnames(pepFrancDf)[grepl("Intensity.",colnames(pepFrancDf))]
runs2
#Extract the relevant part of the mass spec run names that points to the genotype.
genotype2 <- factor(substr(runs2, 11,13))
genotype2
n <- nchar(runs2)
biorep2 <- as.factor(paste0("b_",factor(as.numeric(factor(paste0(substr(runs2, 10,10),substr(runs2, n-2,n-2))))+c(rep(0,9),rep(3,9)))))
biorep
exp_annotation2 <- data.frame(run=runs2, genotype=genotype2, biorep=biorep2)
exp_annotation2

The preprocess_wide function allows for the same preprocessing as the preprocess_MaxQuant function on a regular data frame. If you inspect this function, you will notice that the same 8 preprocessing steps are executed. Remember that you can do any preprocessing you like using basic R data frame manipulations. The split argument indicates which character string is used to separate the accession groups. The exp_annotation argument, like before, contains the experimental annotation data frame or a character string indicating the filepath where the experimental annotation is saved as an Excel file or a tab-delimited file. The quant_cols argument contains a vector of names or a vector of numeric indices pointing to the columns containing the quantitative values of interest (mostly intensities or areas under the curve). If quant_cols contains only one character string, each column containing this pattern will be regarded as a column containing the quantitative values of interest. aggr_by indicates the column by which the data should be aggregated. Here, the data is already aggregated to the peptide level (i.e. over different charge states and modification statuses). We need no further aggregation, thus we provide the "Sequence" column (the level to which aggregation has already been done). aggr_function will be superfluous here, but if further aggregation would have been necessary, raw intensities would be summed up (aggr_function="sum"). The other parameters are similar to the preprocess_MaxQuant function.

pepFrancDf2 <- preprocess_wide(pepFrancDf, accession="Proteins", split=";", exp_annotation=exp_annotation2, quant_cols="Intensity.", aggr_by="Sequence", aggr_function="sum", logtransform=TRUE, base=2, normalisation="quantiles", smallestUniqueGroups=TRUE, useful_properties=c("GI.number","Protein.names","Sequence"), filter=c("Contaminant","Reverse","only_site"), filter_symbol="+", minIdentified=2)

Note that the experimental annotation is now present as an attribute of the pepFrancDf2 data frame.

attr(pepFrancDf2,"MSqRob_exp_annotation")

After custom preprocessing, you can transform the data to a protdata object via the df2protdata function.

The acc_col argument in the df2protdata function indicates the column index or column name of the accessions (i.e. the protein (group) identifiers). The quant_cols argument is a vector of indices pointing to all columns containing peptide intensities. In the example below, the column named "Proteins" will be added to the accession slot while all columns with indices in the quant_cols vector will be added as a matrix to the intensities slot. quant_name indicates the name that will be given to the column containing the (preprocessed) intensities and run_name indicates the name that will be given to the column containing the mass spec run names in the new protdata object. Like before, we keep "GI.number" and "Protein.names" as annotations.

#Select columns with intensities.
quant_cols <- which(grepl("Intensity.", colnames(pepFrancDf2)))
quant_cols

#Call df2protdata function
proteinsFrancDf <- df2protdata(pepFrancDf2, acc_col="Proteins", quant_cols=quant_cols, quant_name="quant_value", run_name="run", annotations=c("GI.number","Protein.names"))
#Inspect proteinsFrancDf protdata object.
proteinsFrancDf

Note that the proteinsFrancDf object is the same as the proteinsFranc object, only with a different order of columns in their data slot.

CPTAC dataset (randomized complete block design)

library(MSqRob)
library(Biobase)

This example involves the CPTAC Study 6. In this dataset Sigma's UPS1 standard containing 48 human proteins was spiked into a reference yeast proteome and analyzed on 7 instruments in 5 different laboratories (Paulovich et al., 2010). Raw data files can be downloaded at: https://cptac-data-portal.georgetown.edu/cptac/dataPublic/list?currentPath=%2FPhase_I_Data%2FStudy6&nonav=true

Here, we limited ourselves to the data of LTQ-Orbitrap at site 86, LTQ-Orbitrap O at site 65 and LTQ-Orbitrap W at site 56. The data was searched with MaxQuant version 1.5.2.8. Detailed search settings can be found in Goeminne et al. (2016).

1. Import MaxQuant's peptides.txt file

First, specify the location of the peptides.txt MaxQuant output file. By default, MaxQuant creates a "combined" folder in the same folder as the raw files are (or if the raw files are in different directories, in the directory of the first raw file). The peptides.txt file can then be found in "path_to_raw_files/combined/txt/peptides.txt". Here, we make use of the peptides.txt file that is delivered with the package.

file_peptides_txt <- system.file("extdata/CPTAC", "peptides.txt", package = "MSqRobData")

Import MaxQuant's peptides.txt file and convert it to an MSnSet object (see package MSnbase).

peptidesCPTAC <- import2MSnSet(file_peptides_txt, filetype="MaxQuant", remove_pattern=TRUE)

The "pattern" argument indicates which columns in the peptides.txt file contain peptide intensities. Currently, all columns containing "Intensity\ " in their column names will be treated as peptide intensities (default for MaxQuant). remove_pattern=TRUE (default) indicates that the name of the pattern is removed from the column names.

2. Preprocessing

2.1. Add information on the design: create an experimental annotation data frame.

Extract the names of the mass spec runs via the sampleNames function from the Biobase package.

runs <- sampleNames(peptidesCPTAC)

Note that as the names again start with a number, an "X" is prepended to make the names syntactically valid. Add a grouping factor for the spike-in condition (i.e. 6A, 6B, 6C, 6D and 6E).

condition <- factor(substr(runs, 2,3))

Add a grouping factor for the lab effect (i.e. LTQ-Orbitrap at site 86, LTQ-Orbitrap O at site 65 and LTQ-Orbitrap W at site 56).

lab <- factor(rep(rep(1:3,each=3),5))
#Alternative:
#lab <- factor((as.numeric(substr(runs, 5,5))-1)%/%3+1)

Create an experimental annotation data frame.

exp_annotation <- data.frame(run=runs, condition=condition, lab=lab)
exp_annotation

Remember that the exp_annotation can also be given as an Excel file or a tab delimited file!

2.2. Preprocess the data using the preprocess_MaxQuant function.

If we want to remove proteins that are only identified by modified peptides from the dataset, we also need the proteinGroups.txt file. This file can be found in the same location as the peptides.txt file ("path_to_raw_files/combined/txt/proteinGroups.txt"). Here, we make use of the proteinGroups.txt file that is delivered with the package.

Note: if you don't want to remove proteins that are only identified by modified peptides, set remove_only_site=FALSE and leave the file_proteinGroups argument at its default value (NULL).

file_proteinGroups <- system.file("extdata/CPTAC", "proteinGroups.txt", package = "MSqRobData")

Important: in the CPTAC dataset, some human UPS proteins are NOT contaminants as these proteins were spiked in on purpose! We only want to remove those contaminants that are not human UPS proteins. We thus need to unmark these proteins as contaminant before preprocessing.

fData(peptidesCPTAC)[grepl("_HUMAN_UPS", fData(peptidesCPTAC)$Proteins),]$Potential.contaminant <- ""

Add to useful_properties all column names of the fData slot of the peptidesCPTAC MSnSet object that you would like to keep (such as gene names, protein names, ontologies, etc.). Other columns will be removed for improved efficiency and memory usage. We certainly want to keep the "Proteins" slot, as this will be our protein (group) identifier. You also want to keep the "Sequence" slot, as peptide-specific effects must be included in our model.

peptidesCPTAC2 <- preprocess_MaxQuant(peptidesCPTAC, accession="Proteins", exp_annotation=exp_annotation, logtransform=TRUE, base=2, normalisation="quantiles", smallestUniqueGroups=TRUE, useful_properties=c("Proteins","Sequence"), filter=c("Potential.contaminant","Reverse"), remove_only_site=TRUE, file_proteinGroups=file_proteinGroups, filter_symbol="+", minIdentified=2)

The following preprocessing steps are executed in this order:

Peptide intensities are log2-transformed (logtransform=TRUE, base=2).
As a normalization approach, we choose to quantile normalize the peptide intensities (normalisation="quantiles"). Other options are "sum", "max", "center.mean", "center.median", "quantiles.robust", "vsn" or "none" (see MSnbase::normalize for more information).
Handling overlapping protein groups (smallestUniqueGroups=TRUE): in our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup. Peptides that map to protein groups for wich also subgroups are present in the dataset are removed from the dataset.
Remove reverse sequences and contaminants (filter=c("Potential.contaminant","Reverse")).
Remove all proteins that are only identified by peptides carrying a modification site (remove_only_site=TRUE, file_proteinGroups=file_proteinGroups).
Remove all superfluous columns in the fData slot, except the "Proteins" and "Sequence" columns (useful_properties=c("Proteins","Sequence")).
Remove all peptides that are only identified once in the dataset (minIdentified=2).
Add experimental annotation.

2.3. Advanced alternative: do the preprocessing yourself.

It is possible to do all the preprocessing steps yourself. That way, you can do more advanced preprocessing or change the order of certain preprocessing steps if you like to.

Log2-transform the peptide intensities.

peptidesCPTAC3 <- log(peptidesCPTAC, base=2)
#Alternative: 
#peptidesCPTAC3 <- log2(peptidesCPTAC3)

After log2-transformation, we want to replace -Inf values in the peptide intensities by NA values as these values originate from zeros in the untransformed intensities matrix, which means these intensities were actually missing.

exprs <- exprs(peptidesCPTAC3)
exprs[is.infinite(exprs)] <- NA
exprs(peptidesCPTAC3) <- exprs

As a normalization approach, we choose to quantile normalize the peptide intensities.

peptidesCPTAC3 <- normalize(peptidesCPTAC3, "quantiles")

Handling overlapping protein groups: in our approach a peptide can map to multiple proteins, as long as there is none of these proteins present in a smaller subgroup.

groups2 <- smallestUniqueGroups(fData(peptidesCPTAC3)$Proteins)
sel <- fData(peptidesCPTAC3)$Proteins %in% groups2
peptidesCPTAC3 <- peptidesCPTAC3[sel]
head(fData(peptidesCPTAC3))

Remove reverse sequences and contaminants.

#Remove reverse sequences
sel <- fData(peptidesCPTAC3)$Reverse != "+"
peptidesCPTAC3 <- peptidesCPTAC3[sel]
head(exprs(peptidesCPTAC3))

#Remove contaminants
sel <- fData(peptidesCPTAC3)$Potential.contaminant != "+"
peptidesCPTAC3 <- peptidesCPTAC3[sel]
head(exprs(peptidesCPTAC3))

Small check:

length(unique(fData(peptidesCPTAC3)$Proteins[grepl('_HUMAN_UPS',fData(peptidesCPTAC3)$Proteins)]))

46 out of a total 48 spiked-in human UPS proteins are present in this dataset with at least one peptide sequence identified.

Remove proteins that are only identified by modified peptides.

  proteinGroups <- read.table(file_proteinGroups, sep="\t", header=TRUE, quote="", comment.char = "")
  only_site <- proteinGroups$Only.identified.by.site
  only_site[is.na(only_site)] <- ""
  removed_proteins <- proteinGroups$Protein.IDs[only_site=="+"]

  sel <- !(as.character(Biobase::fData(peptidesCPTAC3)[,"Proteins"]) %in% as.character(removed_proteins))
  peptidesCPTAC3 <- peptidesCPTAC3[sel]

Retain only those properties in the properties slot that are useful for our further analysis.

fData(peptidesCPTAC3) <- fData(peptidesCPTAC3)[,c("Proteins","Sequence")] #, "ups"

How many times shoud a peptide be identified in order to keep it in the dataset? We require at least 2 identifications of a peptide sequence, as with 1 identification, the models will be perfectly confounded.

minIdentified <- 2
keepers <- rowSums(!is.na(exprs(peptidesCPTAC3)))>=minIdentified
peptidesCPTAC3 <- peptidesCPTAC3[keepers]
head(exprs(peptidesCPTAC3))

#Remove unused levels
fData(peptidesCPTAC3) <- droplevels(fData(peptidesCPTAC3))

Add the experimental design to the MSnSet object.

rownames(exp_annotation) <- exp_annotation[,"run"]
peptidesCPTAC3 <- MSnbase::MSnSet(exprs=exprs(peptidesCPTAC3), fData=fData(peptidesCPTAC3), pData=exp_annotation)

Check whether our manually preprocessed peptidesCPTAC3 is equal to peptidesCPTAC2.

all(exprs(peptidesCPTAC2)==exprs(peptidesCPTAC3), na.rm=TRUE)
all(fData(peptidesCPTAC2)==fData(peptidesCPTAC3), na.rm=TRUE)
all(pData(peptidesCPTAC2)==pData(peptidesCPTAC3), na.rm=TRUE)

3. Convert the MSnSet object peptidesCPTAC2 into a protdata object proteinsCPTAC.

Caution: this step might take some time! (about 5 minutes on our system). The accession element denotes by which column the individual peptide observations should be grouped.

system.time(proteinsCPTAC <- MSnSet2protdata(peptidesCPTAC2, accession="Proteins"))

4. Main Analysis

4.1. Fit the model

When your goal is to analyze the CPTAC data in the recommended way (i.e. a robust ridge model with variance shrinkage and M estimation), your model should be constructed look as follows:

system.time(modelCPTAC_RR <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed=c("condition","lab"), random=c("Sequence","run"), shrinkage.fixed=NULL, weights="Huber", squeezeVar=TRUE))

Caution: model fitting takes some time with large datasets! On our computer, fitting the robust ridge model took about 13 - 18 minutes.

4.2. Save your results

Saving results is completely analogous to the Francisella example.

4.3. Comparing different types of models

In this section, we fit different types of models to the CPTAC data, analogous to what has been done in Goeminne et al. (2016) in order to compare their performances. We compare an ordinary least squares (OLS) model with and without M estimation (Huber weights) to a ridge model with and without M estimation. As we also want to assess the impact of the empirical Bayes variance estimation, we set squeezeVar=FALSE so that no variance squeezing is performed yet.

In order to do an objective comparison between different approaches, we chose here not to include the "run" effect, because this leads to massive overfitting in the ordinary least squares approaches. This effect will mostly be equal to zero anyways because each repeat in the CPTAC dataset is in fact a technical repeat and therefore the variability in peptide intensities between samples will be very small. When doing a standard robust ridge analysis on true biological data, we advise however to include the "run" effect as a random effect. As the simultanous shrinkage of fixed effects was not yet implemented at the time, we provide "lab" as a random effect in order to get results that are as close as possible to those of Goeminne et al. (2016).

Fit an OLS model (shrinkage.fixed=c(0,0,0,0)) without M estimation (weights=NULL).

system.time(modelCPTAC_Lm <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed=c("condition","Sequence","lab"), random=NULL, shrinkage.fixed=c(0,0,0,0), weights=NULL, squeezeVar=FALSE))

Fit an OLS model (shrinkage.fixed=c(0,0,0,0)) with M estimation (weights=Huber).

system.time(modelCPTAC_Lm_M <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed=c("condition","Sequence","lab"), random=NULL, shrinkage.fixed=c(0,0,0,0), weights="Huber", squeezeVar=FALSE))

Fit a ridge model (shrinkage.fixed=c(0,1)) without M estimation (weights=NULL).

system.time(modelCPTAC_Ridge <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed="condition", random=c("Sequence","lab"), shrinkage.fixed=c(0,1), weights=NULL, squeezeVar=FALSE))

Fit a ridge model (shrinkage.fixed=c(0,1)) with M estimation (weights=Huber).

system.time(modelCPTAC_Ridge_M <- fit.model(protdata=proteinsCPTAC, response="quant_value", fixed="condition", random=c("Sequence","lab"), shrinkage.fixed=c(0,1), weights="Huber", squeezeVar=FALSE))

As you noticed, an OLS model can be fit by setting shrinkage.fixed=c(0,0,0,0). This indicates that neither the intercept and none of our three fixed effects should be shrunken. The ridge model is here specified by setting shrinkage.fixed=c(0,1). This indicates that the intercept should not be shrunken, but the fixed effect "condition" should be shrunken. Alternatively, one could also set shrinkage.fixed=NULL as the shrinkage of all fixed effects except for the intercept is the default behaviour of MSqRob.

4.4. Inspect the models

attr(getModels(modelCPTAC_RR[1]),"MSqRob_levels")

MSqRob always tries to fit a model, but some models are overparameterized because too many parameters are fit on too few observations. These models have convergence problems and can be removed from the data prior to estimating p-values. This is only relevant for models that perform shrinkage or use any kind of random effect.

Extract the data as a list from the protdata object proteinsCPTAC.

data <- getData(proteinsCPTAC)

4.5. Avanced and optional: check which models would fail to converge by building a linear mixed effects model.

convergence <- lapply(data, function(x){return(tryCatch(lme4::lFormula(formula("quant_value~1+(1|condition)+(1|lab)+(1|Sequence)"),x,control = lme4::lmerControl()), error=function(e){
  return(NA)
}))})

Get the indices of these overparameterized models.

na_indices <- which(is.na(convergence))

See which proteins you removed.

getAccessions(proteinsCPTAC)[na_indices]

Both UPS proteins in this list are only identified by one peptide.

getData(proteinsCPTAC["P69905ups|HBA_HUMAN_UPS"])
getData(proteinsCPTAC["P41159ups|LEP_HUMAN_UPS"])

Remove the overparameterized models from the protLM object modelCPTAC_RR and from the data.

modelCPTAC_RR_cleaned <- modelCPTAC_RR[-na_indices]
data <- data[-na_indices]

5. Statistical inference

Construct the contrast matrix L for inferring on the fold changes of interest.

L <- makeContrast(contrasts=c("condition6B-condition6A","condition6C-condition6A","condition6D-condition6A","condition6E-condition6A","condition6C-condition6B","condition6D-condition6B","condition6E-condition6B","condition6D-condition6C","condition6E-condition6C","condition6E-condition6D"),
                  levels=c("condition6A","condition6B","condition6C","condition6D","condition6E"))

Make the DA rankings for the ordinary least squares model (no ridge, no squeezed variances, no M estimation).

#modelCPTAC_Lm <- squeezePars(modelCPTAC_Lm, squeezeVar=FALSE)
lm_normal <- test.protLMcontrast(modelCPTAC_Lm, L)
lm_normal <- prot.p.adjust(lm_normal)
lm_normal <- prot.signif(lm_normal)

Make the DA rankings for the ordinary least squares model with squeezed variances (no ridge, no M estimation).

modelCPTAC_Lm_Sq <- squeezePars(modelCPTAC_Lm, squeezeVar=TRUE)
lm_Sq <- test.protLMcontrast(modelCPTAC_Lm_Sq, L)
lm_Sq <- prot.p.adjust(lm_Sq)
lm_Sq <- prot.signif(lm_Sq)

Make the DA rankings for the ordinary least squares model with squeezed variances and M estimation (no ridge).

modelCPTAC_Lm_M_Sq <- squeezePars(modelCPTAC_Lm_M, squeezeVar=TRUE)
lm_SqM <- test.protLMcontrast(modelCPTAC_Lm_M_Sq, L)
lm_SqM <- prot.p.adjust(lm_SqM)
lm_SqM <- prot.signif(lm_SqM)

Make the DA rankings for the ridge regression model (no squeezed variances, no M estimation).

#modelCPTAC_Ridge <- squeezePars(modelCPTAC_Ridge, squeezeVar=FALSE)
ridge <- test.protLMcontrast(modelCPTAC_Ridge, L)
ridge <- prot.p.adjust(ridge)
ridge <- prot.signif(ridge)

Make the DA rankings for the ridge regression model with squeezed variances (no M estimation).

modelCPTAC_Ridge_Sq <- squeezePars(modelCPTAC_Ridge, squeezeVar=TRUE)
ridgeSq <- test.protLMcontrast(modelCPTAC_Ridge_Sq, L)
ridgeSq <- prot.p.adjust(ridgeSq)
ridgeSq <- prot.signif(ridgeSq)

Make the DA rankings for the ridge regression model with squeezed variances and M estimation. This is our robust ridge (RR) approach (the default MSqRob approach, although we included "lab" here as a random effect and we did not include a "run" effect).

modelCPTAC_Ridge_M_Sq <- squeezePars(modelCPTAC_Ridge_M, squeezeVar=TRUE)
ridgeSqM <- test.protLMcontrast(modelCPTAC_Ridge_M, L)
ridgeSqM <- prot.p.adjust(ridgeSqM)
ridgeSqM <- prot.signif(ridgeSqM)

Our standard approach is to calculate the degrees of freedom via the trace of the Hat matrix. One could as well use custom degrees of freedom, for example in order to work more conservatively. Indeed, if we consider the peptides as pseudo-replicates, our experimental units are the different spike-in conditions in each lab. The number of degrees of freedom is then the number of combinations lab x condition minus the number of parameters estimated for lab (i.e. the number of labs minus 1) minus the number of parameters estimated for condition (i.e. the number of conditions minus 1) minus 1 for the intercept. We show an example for the modelCPTAC_RR model.

custom_dfs <- sapply(data, function(x){return(length(unique(paste0(x$condition,x$lab)))-(length(unique(x$condition))-1)-(length(unique(x$lab))-1)-1)})
custom_dfs

ridgeSqM_custom <- test.protLMcontrast(modelCPTAC_RR, L, custom_dfs = custom_dfs)
ridgeSqM_custom <- prot.p.adjust(ridgeSqM_custom)
ridgeSqM_custom <- prot.signif(ridgeSqM_custom)

6. Diagnostic plots and tables

In order to be able to make diagnostic plots and tables, load the required functions from our GitHub page.

source("https://raw.githubusercontent.com/statOmics/MSqRob/master/vignettes/plotfunctions.R")

6.1. Make diagnostic tables

Extra: add a column containing TRUE if the corresponding protein is a spiked-in UPS1 protein and FALSE if it is a yeast protein. For this, we use the addTPcol function. pattern indicates for the pattern in the protein name that points to a true positive and TPcolname is the name to be given to the added column.

lm_normal <- addTPcol(lm_normal, pattern="UPS", TPcolname="ups")
lm_Sq <- addTPcol(lm_Sq, pattern="UPS", TPcolname="ups")
lm_SqM <- addTPcol(lm_SqM, pattern="UPS", TPcolname="ups")
ridge <- addTPcol(ridge, pattern="UPS", TPcolname="ups")
ridgeSq <- addTPcol(ridgeSq, pattern="UPS", TPcolname="ups")
ridgeSqM <- addTPcol(ridgeSqM, pattern="UPS", TPcolname="ups")

The makeSummary function allows to evaluate the performance of all methods on the CPTAC dataset. For this, we need to create a vector that contains the theoretical log2 fold changes of the spiked-in UPS1 proteins.

upratio <- c(1.565597,3.137504,4.744161,6.321928,1.571906,
             3.178564,4.756331,1.60665,3.18442,1.577767)

Create the summary objects.

summary_lm <- makeSummary(lm_normal, upratio, TPcol="ups")
summary_lmSq <- makeSummary(lm_Sq, upratio, TPcol="ups")
summary_lmSqM <- makeSummary(lm_SqM, upratio, TPcol="ups")
summary_ridge <- makeSummary(ridge, upratio, TPcol="ups")
summary_ridgeSq <- makeSummary(ridgeSq, upratio, TPcol="ups")
summary_ridgeSqM <- makeSummary(ridgeSqM, upratio, TPcol="ups")

Results

summary_lm
summary_lmSq
summary_lmSqM
summary_ridge
summary_ridgeSq
summary_ridgeSqM

6.2. Make ROC curves

colors <- c(1,2,3,4,5,6)
resultobjectlist <- list(lm_normal,lm_Sq,lm_SqM,ridge,ridgeSq,ridgeSqM)

truth <- lm_normal[[1]][,"ups"]
names(truth) <- rownames(lm_normal[[1]])
summary_objectlist <- lapply(resultobjectlist, function(x){return(makeSummary(x, upratio))})

plotROC(resultobjectlist, summary_objectlist=summary_objectlist, proteins=proteinsCPTAC, truth=truth, colors=colors, plotSVG=FALSE)

References

Argentini, A., Goeminne, L. J. E., Verheggen, K., Hulstaert, N., Staes, A., Clement, L. and Martens, L. (2016) moFF: a robust and automated approach to extract peptide ion intensities. Nature Methods 13, pp 964–966.

Bolstad, B. M., Irizarry, R. A., Astrand, M., and Speed, T. P. (2003) A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2), pp 185-193. http://bmbolstad.com/misc/normalize/normalize.html

Goeminne, L. J. E., Argentini, A., Martens, L., and Clement, L. (2015) Summarization vs Peptide-Based Models in Label-Free Quantitative Proteomics: Performance, Pitfalls, and Data Analysis Guidelines. Journal of Proteome Research 14(6), pp 2457-2465.

Goeminne, L. J. E., Gevaert, K., and Clement, L. (2016) Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics. Molecular & Cellular Proteomics 15(2), pp 657-668.

Griss, J., Jones, A.R., Sachsenberg, T., Walzer, M., Gatto, L., Hartler, J., Thallinger, G.G., Salek, R.M., Steinbeck, C., Neuhauser, N., Cox, J., Neumann, S., Fan, J., Reisinger, F., Xu, Q.W., Del Toro, N., Pérez-Riverol, Y., Ghali, F., Bandeira, N., Xenarios, I., Kohlbacher, O., Vizcaíno, J.A. & Hermjakob, H. (2014) The mzTab data exchange format: communicating mass-spectrometry-based proteomics and metabolomics experimental results to a wider audience. Molecular and Cellular Proteomics 13(10), pp 2765-75.

Navarro P., Kuharev J., Gillet, L. C., Bernhardt, O. M., MacLean, B., Röst, H. L., Tate, S. A., Tsou, C., Reiter, L., Distler, U., Rosenberger, G., Perez-Riverol, Y., Nesvizhskii, A. I., Aebersold, R., and Tenzer, S. (2016) A multicenter study benchmarks software tools for label-free proteome quantification. Nature Biotechnology.

Paulovich AG, Billheimer D, Ham A-JL, et al. (2010) Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance. Molecular & Cellular Proteomic 9(2), pp 242-254.

Ramond, E., Gesbert, G., Guerrera, I. C., Chhuon, C., Dupuis, M., Rigard, M., Henry, T., Barel, M., and Charbit, A. (2015) Importance of host cell arginine uptake in Francisella phagosomal escape and ribosomal protein amounts. Molecular & Cellular Proteomics 14, pp 870–881.

statOmics/MSqRob documentation built on Dec. 8, 2022, 6 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

statOmics/MSqRob Robust statistical inference for quantitative LC-MS proteomics

In statOmics/MSqRob: Robust statistical inference for quantitative LC-MS proteomics

A first notice

Introduction

Downloading, installing and loading MSqRob

Installing the package

Loading MSqRob

The MSqRobData package

The Shiny App

Francisella tularensis dataset (design with pseudoreplication)

1. Importing data

1.1. Importing MaxQuant data

1.2. Importing other data types

2. Preprocessing

2.1. Create an experimental annotation data frame.

2.2. Preprocess the data using the preprocess_MaxQuant function.

3. Make diagnostic plots

4. Convert the MSnSet object peptidesFranc2 into a protdata object proteinsFranc.

Advanced: add information on the design after creating your protdata object

5. Inspect the data for a specific protein via a detailed protein plot

6. Fit the model

5.1. Inspect the model

6.2 Save your results

7. Statistical inference

Advanced: importing as a data frame

CPTAC dataset (randomized complete block design)

1. Import MaxQuant's peptides.txt file

2. Preprocessing

2.1. Add information on the design: create an experimental annotation data frame.

2.2. Preprocess the data using the preprocess_MaxQuant function.

2.3. Advanced alternative: do the preprocessing yourself.

3. Convert the MSnSet object peptidesCPTAC2 into a protdata object proteinsCPTAC.

4. Main Analysis

4.1. Fit the model

4.2. Save your results

4.3. Comparing different types of models

4.4. Inspect the models

4.5. Avanced and optional: check which models would fail to converge by building a linear mixed effects model.

5. Statistical inference

6. Diagnostic plots and tables

6.1. Make diagnostic tables

6.2. Make ROC curves

References

R Package Documentation

Browse R Packages

We want your feedback!

statOmics/MSqRob
Robust statistical inference for quantitative LC-MS proteomics