Processing quantitative proteomics data with QFeatures


Reading data as QFeatures

We are going to use a subset of the CPTAC study 6 containing conditions A and B [@Paulovich:2010]. The peptide-level data, as processed by MaxQuant [@Cox:2008] is available in the msdata package:

basename(f <- msdata::quant(pattern = "cptac", full.names = TRUE))

From the names of the columns, we see that the quantitative columns, starting with "Intensity." (note the space!) are at positions 56 to 61.

(i <- grep("Intensity\\.", names(read.delim(f))))

We now read these data using the readQFeatures function. The peptide level expression data will be imported into R as an instance of class QFeatures named cptac with an assay named peptides. We also use the fnames argument to set the row-names of the peptides assay to the peptide sequences.

cptac <- readQFeatures(f, ecol = i, sep = "\t", name = "peptides", fnames = "Sequence")

Encoding the experimental design

Below we update the sample (column) annotations to encode the two groups, 6A and 6B, and the original sample numbers.

cptac$group <- rep(c("6A", "6B"), each = 3)
cptac$sample <- rep(7:9, 2)

Filtering out contaminants and reverse hits

filterFeatures(cptac, ~ Reverse == "")
filterFeatures(cptac, ~ Potential.contaminant == "")
cptac <- cptac %>%
    filterFeatures(~ Reverse == "") %>%
    filterFeatures(~ Potential.contaminant == "")

Removing up unneeded feature variables

The spreadsheet that was read above contained numerous variables that are returned by MaxQuant, but not necessarily necessary in the frame of a downstream statistical analysis.


The only ones that we will be needing below are the peptides sequences and the protein identifiers. Below, we store these variables of interest and filter them using the selectRowData function.

rowvars <- c("Sequence", "Proteins", "Leading.razor.protein")
cptac <- selectRowData(cptac, rowvars)

Managing missing values

Missing values can be very numerous in certain proteomics experiments and need to be dealt with carefully. The first step is to assess their presence across samples and features. But before being able to do so, we need to replace 0 by NA, given that MaxQuant encodes missing data with a 0 using the zeroIsNA function.

cptac <- zeroIsNA(cptac, i = seq_along(cptac))
nNA(cptac, i = seq_along(cptac))

The output of the nNA function tells us that

In this dataset, we have such a high number of peptides without any data because the 6 samples are a subset of a larger dataset, and these peptides happened to be absent in groups A and B. Below, we use filterNA to remove all the peptides that contain one or more missing values by using pNA = 0 (which also is the default value).

cptac <- filterNA(cptac, i = seq_along(cptac), pNA = 0)

I we wanted to keep peptides that have up to 90% of missing values, corresponsing in this case to those that have only one value (i.e 5/6 percent of missing values), we could have set pNA to 0.9.


The impute method can be used to perform missing value imputation using a variety of imputation methods. The method takes an instance of class QFeatures (or a SummarizedExperiment) as input, an a character naming the desired method (see ?impute for the complete list with details) and returns a new instance of class QFeatures (or SummarizedExperiment) with imputed data.

As described in more details in [@Lazar:2016], there are two types of mechanisms resulting in missing values in LC/MSMS experiments.

MAR and MCAR values can be reasonably well tackled by many imputation methods. MNAR data, however, requires some knowledge about the underlying mechanism that generates the missing data, to be able to attempt data imputation. MNAR features should ideally be imputed with a left-censor (for example using a deterministic or probabilistic minimum value) method. Conversely, it is recommended to use hot deck methods (for example nearest neighbour, maximum likelihood, etc) when data are missing at random.

x <- assay(impute(se_na2, "zero"))
x[x != 0] <- 1
heatmap.2(x, col = c("lightgray", "black"),
          scale = "none", dendrogram = "none",
          trace = "none", keysize = 0.5, key = FALSE,
          RowSideColors = ifelse(rowData(se_na2)$randna, "orange", "brown"),
          ColSideColors = rep(c("steelblue", "darkolivegreen"), each = 8))

It is anticipated that the identification of both classes of missing values will depend on various factors, such as feature intensities and experimental design. Below, we use perform mixed imputation, applying nearest neighbour imputation on the r sum(rowData(se_na2)$randna) features that are assumed to contain randomly distributed missing values (if any) (yellow on figure \@ref(fig:miximp)) and a deterministic minimum value imputation on the r sum(!rowData(se_na2)$randna) proteins that display a non-random pattern of missing values (brown on figure \@ref(fig:miximp)).

Data transformation

When analysing continuous data using parametric methods (such as t-test or linear models), it is often necessary to log-transform the data. The figure below (left) show that how our data is mainly composed of small values with a long tail of larger ones, which is a typical pattern of quantitative omics data.

Below, we use the logTransform function to log2-transform our data. This time, instead of overwriting the peptides assay, we are going to create a new one to contain the log2-transformed data.

cptac <- addAssay(cptac,
                  name = "peptides_log")
par(mfrow = c(1, 2))


Assays in QFeatures objects can be normalised with the normalize function. The type of normalisation is defined by the method argument; below, we use quantiles normalisation, store the normalised data into a new experiment, and visualise the resulting data.

cptac <- addAssay(cptac,
                  normalize(cptac[["peptides_log"]], method = "quantiles"),
                  name = "peptides_norm")
par(mfrow = c(1, 2))

Feature aggregation

At this stage, it is possible to directly use the peptide-level intensities to perform a statistical analysis [@Goeminne:2016], or aggregate the peptide-level data into protein intensities, and perform the differential expression analysis at the protein level.

To aggregate feature data, we can use the aggregateFeatures function that takes the following inputs:

cptac <- aggregateFeatures(cptac, i = "peptides_norm", fcol = "Proteins", name = "proteins")

We obtain a final 1125 quantified proteins in the new proteins assay. Below, we display the quantitation data for the first 6 proteins and their respective variables. The latter shown that number of peptides that were using during the aggregation step (.n column).


We can get a quick overview of this .n variable by computing the table below, that shows us that we have 405 proteins that are based on a single peptides, 230 that are based on two, 119 that are based on three, ... and a single protein that is the results of aggregating 44 peptides.


Let's choose P02787ups|TRFE_HUMAN_UPS and visualise its expression pattern in the 2 groups at the protein and peptide level.

longFormat(cptac["P02787ups|TRFE_HUMAN_UPS", ]) %>% %>%
    mutate(group = ifelse(grepl("A", colname), "A", "B")) %>%
    mutate(sample = sub("Intensity\\.", "", colname)) %>%
    ggplot(aes(x = sample, y = value, colour = rowname, shape = group)) +
    geom_point() +
    facet_grid(~ assay)


Session information {-}


References {-}

Try the QFeatures package in your browser

Any scripts or data that you put into this service are public.

QFeatures documentation built on Nov. 8, 2020, 6:51 p.m.