knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Label-Free Quantification (LFQ) is the simplest form of quantitative proteomics, in which different samples are quantified in separate MS runs.
Since each sample is run separately, different peptides will be quantified in each sample and the peptide intensities may not be directly comparable between samples. One solution to the higher burden of missing values is 'match-between-runs' [@http://zotero.org/users/5634351/items/87YIDK69], or the functionally equivalent 'Minora' algorithm employed by Proteome Discoverer (PD). These algorithms use the observed retention times of MS1 ions which were successfully spectrum matched in one sample to identify the likely peptide sequence of MS1 ions that could not be spectrum matched in another sample.
Despite the pitfalls of LFQ, the data analysis is still relatively straightforward, though there are steps that need some careful consideration and quality control assessment.
Load the required libraries.
library(ggplot2) library(MSnbase) library(biobroom) library(camprotR) library(Proteomics.analysis.data) library(dplyr) library(tidyr)
We start with the peptide-level output from PD. We recommend staring from peptide-level PD output for LFQ data, as this will allow you to perform QC at the peptide-level and then summarise to protein-level abundance in a more appropriate manner than PD does by default, which is to simply sum all PSMs passing filters.
The data we are using are from an experiment designed to identify RNA-binding proteins (RBPs) in the U-2 OS cell line using the OOPS method [@http://zotero.org/users/5634351/items/EJHWF46N]. A comparison of RNase +/- is used to separate RBPs from background non-specific proteins. Four replicate experiments were performed, with the RNase +/- experiments performed from the same OOPS interface. For each LFQ run, approximately the same quantity of peptides were injected, based on quantification of peptide concentration post trypsin digestion. This data has not published but the aim of the experiment is equivalent to Figure 2e in the original OOPS paper.
The data we will use is available through the Proteomics.analysis.data
package.
pep_data <- read.delim( system.file("extdata", 'OOPS_RNase_LFQ', 'LFQ_OOPS_RNase_PeptideGroups.txt', package = "Proteomics.analysis.data"))
We can explore the structure of the input data using str
. We see that we have a data.frame
with r nrow(pep_data)
rows and r ncol(pep_data)
columns. The most important columns to us are:
str(pep_data)
Discussion 1
Examine the column names and answer the following:
How do the numerical values in the Abundance.F*.Sample columns relate to the conditions (RNase +/-)?
Solution
# It's not possible to tell!
Solution end
The InputFiles.txt
file from PD can be used to determine the relationship between the Abundance.F*.Sample columns and the samples. If you are in any doubt, you can also check with the Proteomics Facility.
The accompanying LFQ_OOPS_RNase_InputFiles.txt
file is also available through the Proteomics.analysis.data
package.
inputfiles <- read.delim( system.file("extdata", 'OOPS_RNase_LFQ', 'LFQ_OOPS_RNase_InputFiles.txt', package = "Proteomics.analysis.data")) print(inputfiles %>% filter(Study.File.ID!=''))
Here, we see that F17-20 are 'R_neg', e.g RNase - and F21-24 are 'R_positive', e.g RNase +.
We could parse the InputFiles.txt
file to generate the sample information, but
for this experiment, it's small enough that it can be easily generated manually.
sample_data <- data.frame( File = paste0("F", 17:24), Sample = paste0(rep(c("RNase_neg", "RNase_pos"), each = 4), ".", 1:4), Condition = rep(c("RNase_neg", "RNase_pos"), each = 4), Replicate = rep(1:4, 2) ) # Displaying the table in a nicer format knitr::kable(sample_data, align = "cccc", format = "html", table.attr = "style='width:30%;'")
To simplify the process of reading in the data and performing initial filtering, we will use camprotR::parse_features
. This function will read in the data and remove contaminant proteins and features without quantification data. Contaminant proteins were defined using the cRAP database and provided to PD. We need to obtain their accessions and provide these to camprotR::parse_features
. Below, we parse the cRAP FASTA to extract the IDs for the cRAP proteins, in both 'cRAP' format and Uniprot IDs for these proteins.
The file we will use, cRAP_20190401.fasta.gz
, is again available through the Proteomics.analysis.data
package.
crap_fasta_inf <- system.file( "extdata", "cRAP_20190401.fasta.gz", package = "Proteomics.analysis.data" ) # Load the cRAP FASTA used for the PD search crap_fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA") # Extract the UniProt accessions associated with each cRAP protein crap_accessions <- crap_fasta %>% pull(desc) %>% stringr::str_extract_all(pattern="(?<=\\|).*?(?=\\|)") %>% unlist()
We can then supply these cRAP protein IDs to camprotR::parse_features()
which will remove features (i.e. peptides in this case) which may originate from contaminants, as well as features which don't have a unique master protein.
See ?parse_features
for further details, including the removal of 'associated cRAP' for conservative contaminants removal.
pep_data_flt <- camprotR::parse_features( pep_data, level = 'peptide', crap_proteins = crap_accessions, )
From the above, we can see that we have started with r nrow(pep_data)
'features' (peptides) from r length(unique(pep_data$Master.Protein.Accessions))
master proteins across all samples. After removal of contaminants and peptides that can't be assigned to a unique master protein, we have r nrow(pep_data_flt)
peptides remaining from r length(unique(pep_data_flt$Master.Protein.Accessions))
master proteins.
We now store the filtered peptide data in an MSnSet, the standard data object for proteomics in R.
This object contains 3 elements:
See the vignette("msnset", package="camprotR")
for more details.
# Create expression matrix with peptide abundances (exprs) and # human readable column names exprs_data <- pep_data_flt %>% select(matches("Abundance")) %>% setNames(sample_data$Sample) %>% as.matrix() # Create data.frame with sample metadata (pData) pheno_data <- sample_data %>% select(-File) %>% tibble::column_to_rownames(var = "Sample") # Create data.frame with peptide metadata (fData) feature_data <- pep_data_flt %>% select(-matches("Abundance")) # Create MSnSet pep <- MSnbase::MSnSet(exprs = exprs_data, fData = feature_data, pData = pheno_data)
First of all, we want to inspect the peptide intensity distributions.
Exercise 1
Plot the distributions of intensities for each sample. What do you conclude?
Hints:
- You can access the assay data using
exprs(pep)
- You will need to log-transform the abundances to make them interpretable
Solution
log(pep, base=2) %>% exprs() %>% boxplot()
Solution end
The above code give us a crude representation. We could make this prettier, but camprotR
already has a function to explore the quantification distributions, plot_quant
, which we can use instead. Below, we plot boxplots and density plot to inspect the abundance distributions.
pep %>% log(base = 2) %>% camprotR::plot_quant(method = 'box') pep %>% log(base = 2) %>% camprotR::plot_quant(method = 'density')
We expect these to be approximately equal and any very low intensity sample would be a concern that would need to be further explored. Here, we can see that there is some clear variability, but no sample with very low intensity.
Next, we consider the missing values, using MSnbase::plotNA
to give us a quick graphical overview. This function shows the number of features with each level of data completeness ('Individual features') and the overall proportion of missing values in the dataset (Full dataset). The number of features with an acceptable level of missingness (specified by pNA
) is also highlighted on the plot.
Note that MSnbase::plotNA
assumes the object contains protein-level data and names the x-axis accordingly. Here, we update the plot aesthetics and rename the x-axis.
p <- MSnbase::plotNA(pep, pNA = 0) + camprotR::theme_camprot(border = FALSE, base_family = 'sans', base_size = 10) + labs(x = 'Peptide index')
print(p)
Exercise 2
We have used
MSnbase::plotNA
to assess the missing values but it's straightforward to do this ourselves directly from theprot_res
object.
- How many values are missing in total?
- What fraction of values are missing?
- How many missing values are there in each sample?
- How many peptides have no missing values?
Hint: You can use
is.na
directly on theMSnSet
and it is equivalent to callingis.na(exprs(obj))
Solution
sum(is.na(pep)) #1 mean(is.na(pep)) #2 colSums(is.na(pep)) #3 sum(rowSums(is.na(pep))==0) #4
Solution end
So, from the r nrow(pep)
peptides, just r sum(rowSums(is.na(pep))==0)
have quantification values in all 8 samples. This is not a surprise for LFQ, since each sample is prepared and run separately.
We can also explore the structure of the missing values further using an 'upset' plot.
Here, we use the naniar
package for this.
missing_data <- pep %>% exprs() %>% data.frame() naniar::gg_miss_upset(missing_data, sets = paste0(colnames(pep), '_NA'), keep.order = TRUE, nsets = 10)
So in this case, we can see that the most common missing value patterns are:
RNase negative replicate 4 had slightly lower overall peptide intensities and appears to be somewhat of an outlier. In this case, we will retain the sample but in other cases, this may warrant further exploration and potentially removal of a sample.
We don't have internal benchmark proteins we can normalise against, so we will only be able to assess protein abundances relative to the protein present in each sample. Since we injected the same quantity of peptides for each sample, your intuition may be that there should be no reason to normalise and to do so risks removing true biological variance.
Discussion 2
What technical explanations can you think of which would explain the differences in intensity?
Solution
# 1. Incorrect total peptide quantification leading to under/over-injection of peptides # 2. Differences between separate MS runs (especially when lots of samples are being processed)
Solution end
The technical explanations are compelling. Futhermore, we can only assess relative quantification since we have unknown losses of material in the sample processing. Thus, it's reasonable to normalise the abundances.
Here we will apply median normalisation such that all column (sample) medians match the grand median. In MSnbase::normalise
, this is called diff.median
. Since the peptide intensities are log-Gaussian distributed, we log~2~-transform them before performing the normalisation.
Median normalisation is a relatively naive form of normalisation, since we are only applying a transformation using a single correction factor for each sample. This is most likely to be appropriate when the samples being compared are similar to one another. Arguably, in this case our samples are more distinct since we are comparing OOPS samples +/- RNase and we could at least explore using a more sophisticated normalisation such as Variance Stabilising Normalisation (VSN). For a more complete discussion of proteomics normalisation, see [@http://zotero.org/users/5634351/items/ZG3ASMKX]. However, bear in mind that this paper is applying the normalisations to the protein-level abundances.
pep_norm <- pep %>% log(base = 2) %>% MSnbase::normalise('diff.median') pep_norm %>% camprotR::plot_quant(method = 'density')
Before we can summarise to protein-level abundances, we need to exclude peptides with too many missing values. Here, peptides with more than 4/8 missing values are discarded, using MSnbase::filterNA()
. We also need to remove proteins without at least three peptides. We will use camprotR::restrict_features_per_protein()
which will replace quantification values with NA
if the sample does not have two quantified peptides for a given protein. Note that this means we have to repeat the filtering process since we are adding missing values.
pep_restricted <- pep_norm %>% # Maximum 4/8 missing values MSnbase::filterNA(pNA = 4/8) %>% # At least two peptides per protein camprotR::restrict_features_per_protein(min_features = 3, plot = FALSE) %>% # Repeat the filtering since restrict_features_per_protein will replace some values with NA MSnbase::filterNA(pNA = 4/8) %>% camprotR::restrict_features_per_protein(min_features = 3, plot = FALSE)
We can then re-inspect the missing values. Note that we have reduced the overall number of peptides to r nrow(pep_restricted)
.
p <- MSnbase::plotNA(pep_restricted, pNA = 0) + camprotR::theme_camprot(border = FALSE, base_family = 'sans', base_size = 15) + labs(x = 'Peptide index')
print(p)
We can now summarise to protein-level abundance. Below, we use 'robust' summarisation [@http://zotero.org/users/5634351/items/FZN3QTTZ] with MSnbase::combineFeatures()
. This returns a warning about missing values that we can ignore here since the robust method is inherently designed to handle missing values. See MsCoreUtils::robustSummary()
and this publication for further details about the robust method.
prot_robust <- pep_restricted %>% MSnbase::combineFeatures( # group the peptides by their master protein id groupBy = fData(pep_restricted)$Master.Protein.Accessions, method = 'robust', maxit = 1000 # Ensures convergence for MASS::rlm )
We can then re-inspect the missing values at the protein level. So, we have quantification for r nrow(prot_robust)
proteins, of which r nrow(filterNA(prot_robust, pNA = 0))
are fully quantified across all 8 samples. The most common missing values pattern remains missing in just RNase negative replicate 4.
p <- MSnbase::plotNA(prot_robust, pNA = 0) + camprotR::theme_camprot(border = FALSE, base_family = 'sans', base_size = 15)
print(p)
naniar::gg_miss_upset(data.frame(exprs(prot_robust)), sets = paste0(colnames(prot_robust), '_NA'), keep.order = TRUE, nsets = 10)
We have now processed our peptide-level LFQ abundances and obtained protein-level abundances, from which we can perform our downstream analyses.
Below, we save the protein level objects to disk, so we can read them
back into memory in downstream analyses. We use saveRDS
to save them in compressed R binary format.
saveRDS(prot_robust, 'results/lfq_prot_robust.rds')
For a comparison between robust
and maxLFQ
for summarisation to protein-level
abundances, see the notebook
Alternatives for summarising to protein-level abundance - MaxLFQ
For a demonstration of an alternative normalisation approach where one has a strong prior belief, see Normalisation to a prior expectation
We will also save the pep_restricted
object, since this is used in Alternatives for summarising to protein-level abundance - MaxLFQ
saveRDS(pep_restricted, 'results/lfq_pep_restricted.rds')
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.