knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Preamble

Quantitative proteomics using isobaric tagging such as Tandem Mass Tags (TMT) has a considerable benefit over Label-Free Quantification (LFQ) in that up to 18 samples can be quantified for each Peptide Spectrum Match (PSM). This greatly reduces the extent of missing values that may be present between different MS runs due to the limited number of ions that can be fragmented in each run and the associated issue of peptides being identified in only a subset of runs [@http://zotero.org/users/5634351/items/ME322P2A]. As such, it standardises the features quantified in each sample, simplifying the comparison between samples and increasing quantification accuracy of summarised features such as proteins.

However, TMT does suffer from ratio compression, which should be avoiding by performing quantification with SPS MS3 [@http://zotero.org/users/5634351/items/ZK69WYZ2]. Since quantification is performed in MS3, the TMT ions being quantified are therefore relatively low intensity and as such, removal of low signal:noise PSMs is recommended prior to summarisation to protein-level abundances. In addition, PSMs with high estimated co-isolation/interference should be removed, since the quantification values will be less accurate.

Here, we will assess the following:

We will then:

library(camprotR)
library(MSnbase)
library(ggplot2)
library(tidyr)
library(dplyr)

Input data

We start by reading in a file containing PSM-level output from Proteome Discoverer (PD). This data comes from a published benchmark experiment where yeast peptides were spiked into human peptides at 3 known amounts to provide ground truth fold changes (see below). For more details, see [@http://zotero.org/users/5634351/items/LG3W8G4T]

The data we will use is available through the Proteomics.analysis.data package.

psm_data <- read.delim(
  system.file("extdata", 'benchmark_TMT', 'benchmark_TMT_PSMs.txt.gz',
              package = "Proteomics.analysis.data"))

The first step is to remove contaminant proteins. These were defined using the cRAP database. Below, we parse the cRAP fasta to extract the IDs for the cRAP proteins, in both 'cRAP' format and Uniprot IDs for these proteins.

crap_fasta_inf <- system.file(
  "extdata", "cRAP_20190401.fasta.gz", 
  package = "Proteomics.analysis.data"
)

# Load the cRAP FASTA used for the PD search
crap.fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA")

# Extract the non cRAP UniProt accessions associated with each cRAP protein
crap.accessions <- crap.fasta %>% 
  pull(desc) %>% 
  stringr::str_extract_all(pattern = "(?<=\\|).*?(?=\\|)") %>% 
  unlist()

We can then supply these cRAP protein IDs to parse_features which will remove features which may originate from contaminants, as well as features which don't have a unique master protein. See ?parse_features for further details, including the removal of 'associated cRAP'.

psm_data_flt <- parse_features(
  psm_data, 
  crap_proteins = crap.accessions, 
  TMT = TRUE, 
  level = 'PSM'
)

We now store the filtered PSM data in an MSnSet, the standard data object for proteomics in R. See the vignette("msnset", package="camprotR") for more details.

# Abundance columns for TMT PD-output start with Abundance 
abundance_cols <- colnames(psm_data_flt)[grepl('Abundance.', colnames(psm_data_flt))]

psm.e <- as.matrix(psm_data_flt[, abundance_cols])
psm.f <- psm_data_flt[, setdiff(colnames(psm_data_flt), abundance_cols)]

# update the column names to remove the 'Abundance.` prefix
colnames(psm.e) <- gsub('Abundance.', '', colnames(psm.e))

# we don't have 'phenotype' data to add so we just define the 
# 'expression' data and 'feature' data

psm.p <- data.frame(spike=rep(factor(c('x1', 'x2', 'x6')), c(4,3,3)), row.names=colnames(psm.e))

psm <- MSnbase::MSnSet(exprs = psm.e, fData = psm.f, pData=psm.p)

Plot intensity distributions

plot_quant(log(psm, base=2), method='density')

Removing low quality PSMs

We want to remove low Signal:Noise (S:N) PSMs, since the quantification values will be less accurate and there will be more missing values. We can inspect the relationship between S:N and missing values using the plot_missing_SN function.

Note that where the signal:noise > 5, there are far fewer missing values.

plot_missing_SN(psm, bins = 40)

We can also look into this relationship at the tag level using plot_missing_SN_per_sample. In this case, there is no tag which appears to have a high proportion of missing values when signal:noise > 5. If there were, this may warrant further exploration, e.g was one of the sample preparations inadequate such that fewer peptides were labeled?

plot_missing_SN_per_sample(psm, bins = 40)

Based on the above, we will filter the PSMs to only retain those with S:N > 5 using filter_TMT_PSMs. Using the same function, we will also remove PSMs with interference/co-isolation >50%.

psm_filt_sn_int <- filter_TMT_PSMs(psm, inter_thresh = 50, sn_thresh = 5)

Summarising to protein-level abundances

Now that we have inspected the PSM-level quantification and filtered the PSMs, we can summarise the PSMs to protein-level abundances.

Discussion

  • How do you think we should summarise the protein-level abundances from the PSM-level quantification?
  • What downsides do you perceive in this approach?

Solution

# PSM to protein-level summarisation is much less problematic for TMT quantification
# and many approaches are valid.
#
# In most cases, there are few missing values and summarisation by naive methods
# such as mean/median/sum is adequate
#
# - Mean/sum is is sensitive to outliers
# - Median is less sensitive to outliers, but sensitive to the intensities
# of PSMs around the median intensity.
# 
# PSM-level intensities can be across orders of magnitude, with the most intense
# likely to be the most accurate. Mean/sum summarisation is largely drive by a 
# subset of high intensity PSMs. Sum has a subtle added advantage that the it
# is also higher for proteins with more PSMs. This can be useful in the downstream
# statistical analysis, as shall see. Sum is a sensible default
# summarisation for TMT data
#
# In the (rare) cases where there are a lot of missing values, the 'robust'
# summarisation approach in MSnbase::combineFeatures() may be appropriate
# (following log-transformation) since this handles missing data appropriately

Solution end

For PSM to protein summarisation, we will use naive 'sum' summarisation (MSnbase::combineFeatures(method = 'sum')). This approach does not appropriately handle missing values, since it either returns NA if any value is missing, or, with na.rm=TRUE included, replaces NA with zero where there is at least one finite quantification value for a protein. As such, we will remove the few PSMs with any missing values

psm_filt_sn_int_missing <- psm_filt_sn_int %>% 
  MSnbase::filterNA()

Typically, one removes proteins with a single PSM, the so-called 'one-hit wonders', on the basis that these are more likely to be false positive identifications, and the quantification is only drawn from a single observation.

psm_filt_sn_int_missing_n_features <- psm_filt_sn_int_missing %>%
  camprotR::restrict_features_per_protein(min_features=2, plot=FALSE)

Below, we perform the summarisation.

protein <- psm_filt_sn_int_missing %>%
  MSnbase::combineFeatures(
    groupBy = fData(psm_filt_sn_int_missing)$Master.Protein.Accession,
    method = 'sum')

Finally, we assess the quantification distribution and normalise the protein-level abundances

plot_quant(log(protein, base=2), method='density')

For this dataset, we see that the protein-level intensity distributions are very similar. This benchmark experiment was performed by generating 10 samples from defined mixes of 2 peptide samples (human and yeast). As such, we expect near identical distributions for all samples from the same group. In a typical TMT experiment, you may see more variable distributions.

Regardless, the next step should be to normalise the protein-level intensities.

Here we will apply median normalisation such that all column (sample) medians match the grand median. In MSnbase::normalise, this is called diff.median. Since the intensities are log-Gaussian distributed, we log~2~-transform them before performing the normalisation.

Median normalisation is a relatively naive form of normalisation, since we are only applying a transformation using a single correction factor for each sample. This is most likely to be appropriate when the samples being compared are similar to one another, which is the case here.

protein_norm <- MSnbase::normalise(log(protein, base=2), method='diff.median')

plot_quant(protein_norm, method='density')

Remember that we can check the processing information for our MSnSet if we are in doubt about the processing. Here, it tells us that we log2 transformed and then used diff.median normalisation.

processingData(protein_norm)

Now we have filtered our PSM-level quantification, summarised to protein-level and normalised. We can use this object to perform downstream visualisation, data exploration and statistical analysis etc.

We save the object to disk so we can read it back into memory when we need it

saveRDS(psm_filt_sn_int_missing, './results/psm_filt.rds')
saveRDS(protein_norm, './results/tmt_protein.rds')
length(setdiff(fData(psm_filt_sn_int_missing)$Master.Protein.Accessions,
               fData(protein_norm)$Master.Protein.Accessions))

To add - TMM normalisation

References



MRCToxBioinformatics/Proteomics_data_analysis documentation built on July 7, 2024, 6:44 p.m.