In bartongroup/Proteus: Downstream analysis of the MaxQuant output

library(proteus)
library(knitr)
library(ggplot2)
knitr::opts_chunk$set(echo = TRUE)

This tutorial demonstrates how to analyse data from label free MS/MS experiment in Proteus. We use an example data set from an unpublished experiment by Katharina Trunk, Sarah Coulthurst, Julien Peltier and Matthias Trost. The data are distributed in a package proteusLabelFree, which needs installing and loading first:

library(proteusLabelFree, warn.conflicts=FALSE)
data(proteusLabelFree)

devtools::install_github("bartongroup/proteusLabelFree")
library(proteusLabelFree)
data(proteusLabelFree)

Quick start

Here is a minimal example of data analysis in Proteus. You will need the evidence file from MaxQuant and a very simple text file with metadata (see below how to create it). Then, you can get from the evidence file to differential expression in just a few steps.

library(proteus)
evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
metadataFile <- system.file("extdata", "metadata.txt", package="proteusLabelFree")

evi <- readEvidenceFile(evidenceFile)
meta <- read.delim(metadataFile, header=TRUE, sep="\t")
pepdat <- makePeptideTable(evi, meta)
prodat <- makeProteinTable(pepdat)
prodat.med <- normalizeData(prodat)
res <- limmaDE(prodat.med)
plotVolcano_live(prodat.med, res)

Input data

Files

Proteus needs one input file: the evidence file from MaxQuant.

In this tutorial we are also going to use a metadata file. It should be created manually as a tab-delimited text file with a header containing several columns of metadata. The mandatory columns are "experiment", "measure", "sample" and "condition" (see section [Metadata]).

Alternatively, a metadata data frame (with the same columns) can be created within the R session.

Read evidence file

Function readEvidenceFile() reads MaxQuant evidence file, filters out contaminants and reverse sequences. In this example we use the evidence file from the package proteusLabelFree (you need to install it if you want to follow this tutorial):

evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
evi <- readEvidenceFile(evidenceFile)

evi is a data frame with selected columns from the evidence file:

head(evi)

It can be quite large:

dim(evi)
format(object.size(evi), units="Mb")

Columns in evidence file

MaxQuant evidence files come with column names that can differ, depending on the MaxQuant version and the parameters used in its run. Many of these column names contain spaces and other characters (e.g. "/") that make R data processing awkward. Therefore, upon reading the evidence file, Proteus renames its columns to simpler names, conforming with R variable name restrictions. Also, many of the evidence columns are not needed in Proteus processing, so these are not kept to save memory.

The selection and naming of columns is controlled by two parameters in function readEvidenceFile(): measure.cols and data.cols. Proteus comes with two predefined lists called measureColumns and evidenceColumns, which are used as defaults. They can be modified, as necessary.

Measurements

The default named list measureColumns describes evidence file columns that contain measurements. In case of label-free experiment there is only one such column, called Intensity.

str(measureColumns)

Different measure columns are needed for TMT and SILAC experiments. See the appropriate vignettes (TMT and SILAC) for more details.

Other required columns

The default named list evidenceColumns describes the minimal set of columns needed for further processing (in addition to measure columns).

str(evidenceColumns)

The names (sequence, modified_sequence, and so on) are used internally in the package and should not be changed. The values (Sequence, Modified sequence) are the actual column names in the evidence file and can be adjusted if different naming convention is used.

Updating column names

We suggest not to change the default list, but create a copy, which can be modified. Let's say we want to read column m/z as well:

myColumns <- c(evidenceColumns, mz="m/z")

If, for example, your evidence file contains a column named Leading Razor Protein (each word starting with a capital), we will have to change this value in the list:

myColumns$protein <- "Leading Razor Protein"

The new column list can be then used with readEvidenceFile:

evi_mz <- readEvidenceFile(evidenceFile, data.cols=myColumns)

To quickly check what the column names are in a given evidence file, we provide a simple function to do this:

evidenceFile <- system.file("extdata", "evidence.txt.gz", package="proteusLabelFree")
evidence.columns <- readColumnNames(evidenceFile)
evidence.columns

Metadata

We also need metadata. In our example it is stored in the file in the proteusLabelFree package:

metadataFile <- system.file("extdata", "metadata.txt", package="proteusLabelFree")
meta <- read.delim(metadataFile, header=TRUE, sep="\t")

It contains the design of our experiment:

meta

This metadata information will be attached to every peptide and protein object used by Proteus.

Metadata columns

Metadata object should be a data frame with at least these columns: experiment (not always necessary), measure, sample and condition.

Column experiment should contain the same values as in the Experiment column in the evidence file. In case of multiplexed experiments it is possible to have an evidence file without the experiment column. In such case it should not be included in metadata (see TMT vignette for more details). Here, there are 14 unique experiments. We can see all unique experiment names in the evidence data using unique function:

unique(evi$experiment)

Column measure refers to columns with measurements in the evidence file. In case of the label-free experiment there is only one measure column, "Intensity". In multiplexed data (e.g. TMT) there can be several measurements columns.
Column sample should contain (short) names corresponding to a given experiment and measure. For clarity, we recommend a short name consisting of a condition and replicate (in a simple design where such decomposition is possible). Sample identifies a given experiment/measure uniquely, so sample names must be unique.
Column condition contains condition names. This information will be used in differential expression. In our example there are two conditions, hence two values in this column.

Other columns can be added and used for the downstream analysis. Here, we added a replicate column, but other information, in particular describing batch effects, can be very useful.

Peptide data

Create a peptide dataset

We can now create a peptide data object from the evidence and metadata. The function makePeptideTable() aggregates peptide data form the evidence into a single table, where rows correspond to peptide sequences and columns correspond to samples (as provided in metadata). Where there are multiple peptides corresponding to the same sequence/sample (e.g., with different charges) their intensities are added in label-free data (we note that this might not be the best approach due to possible high variance in peptide data - more to come soon).

pepdat <- makePeptideTable(evi, meta)

pepdat is an object of class proteusData. It consists of the intensity table (pepdat$tab) and additional information. The first fife rows and columns from the intensity table are

pepdat$tab[1:5, 1:5]

We can use generic summary function to see more information about pepdat.

summary(pepdat)

Number of peptides

Function plotCount() plots, as the name suggests, peptide count in each sample. This is the number of non-zero peptide intensities per sample.

plotCount(pepdat)

Jaccard similarity

Function plotDetectionSimilarity() calculates Jaccard similarity between each pair of samples and plots its distribution. The similarity is based on detection. For a pair of samples it compares the number of detected peptides in both samples (intersection) divided by the total number of peptides detected in both samples (union). This measure of similarity is a number between 0 and 1.

plotDetectionSimilarity(pepdat, bin.size = 0.02)

Distance matrix

Function plotDistanceMatrix() calculates the Pearson's correlation coefficient for each pair of samples and plots a heatmap:

plotDistanceMatrix(pepdat)

Clustering

We can use plotClustering() to see a dendrogram of the peptide data set:

plotClustering(pepdat)

PCA

Function plotPCA() creates a principal-component analysis plot.

plotPCA(pepdat)

Removing bad data

The last sample (B-7) in the example data looks odd. Lets say we call it "bad" and want to remove from analysis. The easiest way to do this is to modify the metadata.

Warning: this is only an example. You should exercise caution when removing any of your data.

meta.clean <- meta[which(meta$sample  != 'B-7'),]

Now, we create a new "clean" peptide data set from evidence data:

pepdat.clean <- makePeptideTable(evi, meta.clean)

The new data pepdat.clean will contain only samples included in the metadata, that is

as.character(meta.clean$sample)

Clustering confirms that the offending sample is gone.

plotClustering(pepdat.clean)

Protein data

There are many approaches to aggregating peptides into proteins. For simplicity, we assign peptides to proteins based on the Leading Razor Protein. For label-free data, we quantify protein abundances using a simple, but robust method of high-flyers (Silva et al. 2006).

Create protein dataset

makeProteinTable() creates a protein data set from the peptide data. The result is a proteusData object containing protein intensity table and other information.

prodat <- makeProteinTable(pepdat.clean)

Again, we can use a generic summary function to see its properties.

summary(prodat)

Normalization

Finally, we need to normalize data to account for variation of intensity between samples. The function normalizeData() can normalize peptide or protein data. Up to this point, we haven't applied any normalization. The default normalization is to the median. After this step, median sample intensities will be equal.

prodat.med <- normalizeData(prodat)

The second parameter to normalizeData() is norm.fun and it points to a normalizing function. By default, this is normalizeMedian, but other normalizations can be used. For example, it works with normalizeQuantiles from limma package:

prodat.quant <- normalizeData(prodat, norm.fun=limma::normalizeQuantiles)

The function plotSampleDistributions() can be used to compare intensity distributions for each normalization.

plotSampleDistributions(prodat, title="Not normalized", fill="condition", method="violin")

plotSampleDistributions(prodat.med, title="Median normalization", fill="condition", method="violin")

plotSampleDistributions(prodat.quant, title="Quantile normalization", fill="condition", method="violin")

Mean-variance relationship

Some statistics (like mean and variance across replicate in each condition) are stored directly in the prodat object at the moment of creation. We can apply plotMV() function to plot the mean-variance relationship.

plotMV(prodat.med, with.loess=TRUE)

Protein clustering

We can use the same function plotClustering() to see the dendrogram for the proteins.

plotClustering(prodat.med)

Protein annotations

So far, we have analysed data using protein identifiers as provided in the evidence file. We need to annotate proteins and add more information, for example, UniProt ID, gene name, protein description. To do this, we need to prepare a data frame containing a column called protein with the original protein identifiers, as in the evidence file. Additional columns in this data frame will contain all necessary annotations. The function annotateProteins can merge such data frame into a proteusData object and the annotations can be used later to identify proteins and perform further downstream analysis (e.g. functional enrichment).

In the case of our example data protein identifiers look like sp|P00546|CDK1_YEAST. Hence, they contain a UniProt ID. We can use this information to grab annotations from UniProt servers.

First, we use strsplit to extract UniProt IDs:

luni <- lapply(as.character(prodat$proteins), function(prot) {
 if(grepl("sp\\|", prot)) {
   uniprot <- unlist(strsplit(prot, "|", fixed=TRUE))[2]
   c(prot, uniprot)
 }
})
ids <- as.data.frame(do.call(rbind, luni))
names(ids) <- c("protein", "uniprot")

This creates a data frame with two columns:

head(ids)

So, for each protein we have found its UniProt ID. A helper function fetchFromUniProt gets basic annotations from UniProt servers. Here we use the column uniprot containing r nrow(ids) UniProt IDs to fetch basic annotations (gene names and protein names) from UniProt. This might take a while!

annotations <- fetchFromUniProt(ids$uniprot, verbose=TRUE)

This is what the content of this data frame looks like:

head(annotations)

In order to feed these data into a proteusData object we need a column with the original protein identifiers. We can get this by merging the two data frames:

annotations.id <- merge(ids, annotations, by.x="uniprot", by.y="id")
annotations.id <- unique(annotations.id)

The second command is to remove duplicated entries. This is the data frame we need. The function annotateProteins will add it to the prodat.med object.

prodat.med <- annotateProteins(prodat.med, annotations.id)

The annotations are merged into prodat based on the protein column in the annotation data frame. They are stored in prodat$annotation and rows of this data frame correspond to the order of proteins in the intensity table.

Obviously, not every data set follows the identifier convention in this example. For any data set the user will need to create a data frame with the appropriate annotations, using available resources. The only requirement is that the annotation data frame contains a column called protein and this column contains protein identifiers as provided in the evidence file and available from the proteusData object in the field proteins.

Alternative processing

Modified sequence

Peptide intensities are aggregated based on the Sequence column in the evidence file. That is, the intensity for a peptide (for a given sample) is the sum (or other user-specified function) of the intensities for all entries for this sequence and experiment in the evidence file. Instead, we can use the Modified sequence column and focus on protein modifications. Then, the aggregation will be done for a given modified sequence.

Let us recall, that the evidence object in Proteus package is a data frame with column names as defined by the data.cols parameter in readEvidenceFile function. These column names are different from the original names in the evidence file, so they conform with R object naming standards. Here are the column names we have in evi:

names(evi)

Therefore, if we want to base peptide aggregation on the modified sequence, we need to specify the appropriate column name when creating peptide table, by the parameter sequence.col. The usual protein aggregation will follow:

pepdat.mod <- makePeptideTable(evi, meta, sequence.col = "modified_sequence")
prodat.mod <- makeProteinTable(pepdat.mod)

The default value of sequence.col is "sequence", so if no parameter is given, peptides will be aggregated based on their (unmodified) sequence.

Protein groups

By default protein intensities are aggregated from peptide intensities based on the Leading razor protein column from the evidence file (this column is renamed to protein by readEvidenceFile for simplicity). This means that for a given protein makeProteinTable would aggregate data from all peptides having this protein ID in the Leading razor protein column. We can change this default approach and, instead of razor proteins, we can use protein groups, as listed in Proteins column of the evidence file (this column is renamed to protein_group by readEvidenceFile to avoid confusion). The choice of column to aggregate protein data is controlled by parameter protein.col in the makePeptideTable function.

It might look surprising that protein/group selection is done at the peptide level, but this is where peptide-to-protein conversion table is build and stored in the output object (have a look at pepdat$pep2prot - it's a data frame with two columns, peptide and protein). We want to have it in the peptide object, in case peptide-level analysis needs to find out which proteins or protein groups these peptides belong to. The peptide-to-protein table is then used by makeProteinTable to aggregate proteins. Hence, if we want to aggregate proteins by protein group rather than razor proteins (which is the default), we need to do the following:

pepdat.group <- makePeptideTable(evi, meta, protein.col="protein_group")
prodat.group <- makeProteinTable(pepdat.group)

As a result, the object prodat.group will contain not individual proteins, but protein groups.

Evidence-to-peptide aggregation

Evidence file often contains multiple entries for the same peptide sequence and experiment. By default (and this is also done by MaxQuant when creating peptides table) these entries are summed. In proteus the aggregation method is controlled by the parameter aggregate.fun and can be easily modified. aggregate.fun points to a function that performs aggregation across all entries. The default function is aggregateSum, but it can be replaced with any user-provided function, if necessary.

Aggregator function

The aggregator function should have the following form:

function(wp, ...)

The input for this function is a matrix wp with peptide intensities. The ellipsis (...) indicates additional parameters passed to the function. Columns of the matrix wp correspond to samples and rows correspond to different entries from the evidence file. For example, for a peptide with sequence AASESIKVGDPFDESTFQGAQTSQMQLNK we have the following matrix:

evitab.example

There are up to 6 evidence entries per sample for this sequence. The output of the aggregator function should be a vector of aggregated intensities. In case of the default aggregator, the output is the sum of columns of the above matrix:

aggregateSum(evitab.example)

The elements of this vector correspond to the samples - columns of the input matrix. The peptide creator function, makePeptideTable uses the aggregator function to combine data for each peptide.

The other peptide aggregator function provided in the package is aggregateMedian. It calculates median of each column. We recommend using this function with SILAC data. You can do this by specifying aggregate.fun=aggregateMedian in the makePeptideTable call. See SILAC vignette for details.

User-defined aggregation function

Let us assume that instead of the default sums we want to find the maximum intensity across all entries for each peptide. We are not saying that this is a better approach, we just want to show an example of how to create a user-define peptide aggregator. The function we need takes a two-dimensional matrix as an input and returns the maximum of each column. It can be encoded as

aggregateMax <- function(wp) {
  s <- apply(wp, 2, function(x) max(x, na.rm=TRUE))
  return(as.vector(s))
}

Please note that na.rm=TRUE is necessary to ignore data gaps. Otherwise the returned vector would consist mostly of NAs. When applied to our example matrix the function returns

aggregateMax(evitab.example)

Now, we can use it to create an alternative peptide table:

pepdat.max <- makePeptideTable(evi, meta, aggregate.fun=aggregateMax)

The result will be somehow different from the default summing approach.

Peptide-to-protein aggregation

By default makeProteinTable aggregates peptides to proteins using the high-flyer method of (Silva et al. 2006). The protein intensity is the mean intensity of 3 top-intensity peptides associated with the given protein (or group of proteins). The peptide-to-protein aggregation method is controlled by the parameter aggregate.fun in makeProteinTable. It points to an aggregator function.

The aggregator function has exactly the same format as in evidence-to-peptide aggregator (see [above][Evidence-to-peptide aggregation]). That is, it takes a two-dimensional matrix (columns are samples, rows are peptides) and calculates an aggregated vector. There are three aggregators in the package: the default aggregateHifly plus the same functions we used for evidence-to-peptide aggregation, aggregateSum and aggregateMedian.

A user-defined function can be used to aggregate proteins. Please see [section on evidence-to-peptide aggregation][Evidence-to-peptide aggregation] for details.

Reading MaxQuant's protein groups file

MaxQuant output contains a file with aggregated protein data, usually called proteinGroups.txt. Protein intensities in this file are summed over protein groups. It is possible to read these data directly into Proteus and skip peptide and protein aggregation steps.

proteinGroupsFile <- system.file("extdata", "proteinGroups.txt.gz", package="proteusLabelFree")
prot.MQ <- readProteinGroups(proteinGroupsFile, meta)

This creates a proteusData object with all necessary information: that is, intensity table, metadata, protein names. This object can be visualized and analysed using Proteus functions.

Just like with evidence data, we might have to specify which columns need to be read from the file. There are two arguments in readProteinGroups that control this: measure.cols and data.cols.

measure.cols specifies columns with measurements. The default value for this argument is a named list or vector:

setNames(paste("Intensity", meta$sample), meta$sample)

Values of this list are actual column names in the file (Intensity A-1, Intensity A-2, and so on), names of the list are sample names. If columns with measurements in the protein groups file are named following a different convention, this argument will have to be changed. Make sure this is a named list with names corresponding to sample names.

data.cols specifies additional columns to be read. It defaults to a global list proteinColumn. This is a another named list, containing the following elements:

str(proteinColumns)

Again, if the majority protein ID column is named differently, this will have to be changed in data.cols argument. The three names of this list (protein, reverse and contaminant) are used internally and should not be changed or removed.

Differential expression

We suggest using limma package to do differential expression. Package Proteus contains a simple wrapper to limma, that takes a proteusData object as input.

res <- limmaDE(prodat.med, sig.level=0.05)

This function creates a data frame with DE results.

head(res)

Before limma is called, intensity data are transformed using the transform.fun function (a parameter of limmaDE). The default value for this transformation is log10. Therefore, by default, the column logFC in the output data frame contains $\log_{10}(M_1/M_2)$, where $M_k$ represent the mean of condition $k$. If you need log2-based fold change, you can use transform.fun=log2.

The significant column indicates significantly differentially expressed proteins, based on the Benjamini-Hochberg corrected p-values (column adj.P.Val) and the significance level defined when calling limmaDE (sig.level=0.05). It allows for simple filtering of the significant results (below we show only the most important columns, for simplicity):

res[which(res$significant), c("protein", "logFC", "adj.P.Val")]

Note: limmaDE requires exactly two conditions to do a differential expression on. When data contain more conditions, a parameter conditions is required to select a pair:

res <- limmaDE(prodat.med, conditions=c("A", "B"))

For more complicated designs we recommend using limma functions directly.

Proteins present in only one condition

Sometimes data for a given protein is missing entirely from a condition (that is all replicates are NA in the intensity table). In such cases, differential expression with limma returns NA log-fold-change and p-value. However, a protein detected in one condition and not detected in the other condition might be interesting if a non-detection is due to very low abundance. We can easily find such proteins using the look-up table detect in an proteusData object. It contains logical columns for each condition, with TRUE indicating that the protein was detected (in at least one replicate) and FALSE when it was not detected (in any replicate).

head(prodat$detect)

The first protein, mut-yEGFP is not detected in B condition. The proteins detected in only one condition can be found using a logical expression:

only.A <- which(prodat$detect$`A` & !prodat$detect$B)
only.B <- which(!prodat$detect$`A` & prodat$detect$B)

We can list their identifiers from the proteins field in the protein object, as in this example for condition A:

as.character(prodat$proteins[only.A])

Visualization

Proteus provides with several functions to visualize protein data and the results of the differential expression.

Fold-change-intensity plot:

plotFID(prodat.med)

Volcano plot:

plotVolcano(res)

P-value distribution plot:

plotPdist(res)

Individual proteins

We can also look at individual protein. The function plotIntensities plots intensities of individual samples per condition, or other selected quantity (e.g. batch). Here is one of the up-regulated proteins.

plotIntensities(prodat.med, id='sp|P26263|PDC6_YEAST', log=TRUE)

A better understanding of the protein's behaviour might be gained via function plotProtPeptides, which shows intensities of individual peptides and replicates for the given protein.

plotProtPeptides(pepdat.clean, 'sp|P26263|PDC6_YEAST', prodat.med)

Interactive plots

Proteus offers interactive versions of volcano and fold-change/intensity plots. They use Shiny framework to build an interactive local web page in your browser. They are called (from command line):

plotVolcano_live(prodat.med, res)

and

plotFID_live(prodat.med, res)

We strongly recommend to build protein annotations before running live functions.

bartongroup/Proteus documentation built on April 22, 2023, 5:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

bartongroup/Proteus Downstream analysis of the MaxQuant output