In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Preamble

Label-Free Quantification (LFQ) is the simplest form of quantitative proteomics, in which different samples are quantified in separate MS runs.

Since each sample is run separately, different peptides will be quantified in each sample and the peptide intensities may not be directly comparable between samples. One solution to the higher burden of missing values is 'match-between-runs' [@http://zotero.org/users/5634351/items/87YIDK69], or the functionally equivalent 'Minora' algorithm employed by Proteome Discoverer (PD). These algorithms use the observed retention times of MS1 ions which were successfully spectrum matched in one sample to identify the likely peptide sequence of MS1 ions that could not be spectrum matched in another sample.

Despite the pitfalls of LFQ, the data analysis is still relatively straightforward, though there are steps that need some careful consideration and quality control assessment.

Load dependencies

Load the required libraries.

library(ggplot2)
library(MSnbase)
library(biobroom)
library(camprotR)
library(Proteomics.analysis.data)
library(dplyr)
library(tidyr)

Input data

We start with the peptide-level output from PD. We recommend staring from peptide-level PD output for LFQ data, as this will allow you to perform QC at the peptide-level and then summarise to protein-level abundance in a more appropriate manner than PD does by default, which is to simply sum all PSMs passing filters.

The data we are using are from an experiment designed to identify RNA-binding proteins (RBPs) in the U-2 OS cell line using the OOPS method [@http://zotero.org/users/5634351/items/EJHWF46N]. A comparison of RNase +/- is used to separate RBPs from background non-specific proteins. Four replicate experiments were performed, with the RNase +/- experiments performed from the same OOPS interface. For each LFQ run, approximately the same quantity of peptides were injected, based on quantification of peptide concentration post trypsin digestion. This data has not published but the aim of the experiment is equivalent to Figure 2e in the original OOPS paper.

The data we will use is available through the Proteomics.analysis.data package.

pep_data <- read.delim(
  system.file("extdata", 'OOPS_RNase_LFQ', 'LFQ_OOPS_RNase_PeptideGroups.txt',
              package = "Proteomics.analysis.data"))

We can explore the structure of the input data using str. We see that we have a data.frame with r nrow(pep_data) rows and r ncol(pep_data) columns. The most important columns to us are:

Sequence: The sequence of the peptide
Modifications: The detected peptide modifications, including variable, e.g induced modifications such as oxidation
Master.Protein.Accessions: The assigned master protein(s)
Abundance.F*.Sample: Columns with the peptide intensities

str(pep_data)

Discussion 1

Examine the column names and answer the following:

How do the numerical values in the Abundance.F*.Sample columns relate to the conditions (RNase +/-)?

Solution

# It's not possible to tell!

Solution end

The InputFiles.txt file from PD can be used to determine the relationship between the Abundance.F*.Sample columns and the samples. If you are in any doubt, you can also check with the Proteomics Facility.

The accompanying LFQ_OOPS_RNase_InputFiles.txt file is also available through the Proteomics.analysis.data package.

inputfiles <- read.delim(
  system.file("extdata", 'OOPS_RNase_LFQ', 'LFQ_OOPS_RNase_InputFiles.txt',
              package = "Proteomics.analysis.data"))
print(inputfiles %>% filter(Study.File.ID!=''))

Here, we see that F17-20 are 'R_neg', e.g RNase - and F21-24 are 'R_positive', e.g RNase +.

We could parse the InputFiles.txt file to generate the sample information, but for this experiment, it's small enough that it can be easily generated manually.

sample_data <- data.frame(
  File = paste0("F", 17:24),
  Sample = paste0(rep(c("RNase_neg", "RNase_pos"), each = 4), ".", 1:4),
  Condition = rep(c("RNase_neg", "RNase_pos"), each = 4),
  Replicate = rep(1:4, 2)
)

# Displaying the table in a nicer format
knitr::kable(sample_data,
             align = "cccc",
             format = "html",
             table.attr = "style='width:30%;'")

Parse PeptideGroups.txt file

To simplify the process of reading in the data and performing initial filtering, we will use camprotR::parse_features. This function will read in the data and remove contaminant proteins and features without quantification data. Contaminant proteins were defined using the cRAP database and provided to PD. We need to obtain their accessions and provide these to camprotR::parse_features. Below, we parse the cRAP FASTA to extract the IDs for the cRAP proteins, in both 'cRAP' format and Uniprot IDs for these proteins.

The file we will use, cRAP_20190401.fasta.gz, is again available through the Proteomics.analysis.data package.

crap_fasta_inf <- system.file(
  "extdata", "cRAP_20190401.fasta.gz",
  package = "Proteomics.analysis.data"
)

# Load the cRAP FASTA used for the PD search
crap_fasta <- Biostrings::fasta.index(crap_fasta_inf, seqtype = "AA")

# Extract the UniProt accessions associated with each cRAP protein
crap_accessions <- crap_fasta %>%
  pull(desc) %>%
  stringr::str_extract_all(pattern="(?<=\\|).*?(?=\\|)") %>%
  unlist()

We can then supply these cRAP protein IDs to camprotR::parse_features() which will remove features (i.e. peptides in this case) which may originate from contaminants, as well as features which don't have a unique master protein.

See ?parse_features for further details, including the removal of 'associated cRAP' for conservative contaminants removal.

pep_data_flt <- camprotR::parse_features(
  pep_data,
  level = 'peptide',
  crap_proteins = crap_accessions,
)

From the above, we can see that we have started with r nrow(pep_data) 'features' (peptides) from r length(unique(pep_data$Master.Protein.Accessions)) master proteins across all samples. After removal of contaminants and peptides that can't be assigned to a unique master protein, we have r nrow(pep_data_flt) peptides remaining from r length(unique(pep_data_flt$Master.Protein.Accessions)) master proteins.

Convert to MSnSet

We now store the filtered peptide data in an MSnSet, the standard data object for proteomics in R.

This object contains 3 elements:

A quantification data matrix (rows=features, e.g peptides/proteins, columns=samples)
Feature data (rows=features, columns=feature annotations, e.g peptide master protein assignment)
Experimental details (rows=samples, columns=experimental details, e.g treatment)

See the vignette("msnset", package="camprotR") for more details.

# Create expression matrix with peptide abundances (exprs) and
# human readable column names
exprs_data <- pep_data_flt %>%
  select(matches("Abundance")) %>%
  setNames(sample_data$Sample) %>%
  as.matrix()

# Create data.frame with sample metadata (pData)
pheno_data <- sample_data %>%
  select(-File) %>%
  tibble::column_to_rownames(var = "Sample")

# Create data.frame with peptide metadata (fData)
feature_data <- pep_data_flt %>%
  select(-matches("Abundance"))

# Create MSnSet
pep <- MSnbase::MSnSet(exprs = exprs_data,
                       fData = feature_data,
                       pData = pheno_data)

QC peptides

First of all, we want to inspect the peptide intensity distributions.

Exercise 1

Plot the distributions of intensities for each sample. What do you conclude?

Hints:

You can access the assay data using exprs(pep)

You will need to log-transform the abundances to make them interpretable

Solution

log(pep, base=2) %>% exprs() %>% boxplot()

Solution end

The above code give us a crude representation. We could make this prettier, but camprotR already has a function to explore the quantification distributions, plot_quant, which we can use instead. Below, we plot boxplots and density plot to inspect the abundance distributions.

pep %>%
  log(base = 2) %>%
  camprotR::plot_quant(method = 'box')

pep %>%
  log(base = 2) %>%
  camprotR::plot_quant(method = 'density')

We expect these to be approximately equal and any very low intensity sample would be a concern that would need to be further explored. Here, we can see that there is some clear variability, but no sample with very low intensity.

Missing values

Next, we consider the missing values, using MSnbase::plotNA to give us a quick graphical overview. This function shows the number of features with each level of data completeness ('Individual features') and the overall proportion of missing values in the dataset (Full dataset). The number of features with an acceptable level of missingness (specified by pNA) is also highlighted on the plot.

Note that MSnbase::plotNA assumes the object contains protein-level data and names the x-axis accordingly. Here, we update the plot aesthetics and rename the x-axis.

p <- MSnbase::plotNA(pep, pNA = 0) +
  camprotR::theme_camprot(border = FALSE, base_family = 'sans', base_size = 10) +
  labs(x = 'Peptide index')

print(p)

Exercise 2

We have used MSnbase::plotNA to assess the missing values but it's straightforward to do this ourselves directly from the prot_res object.

How many values are missing in total?

What fraction of values are missing?

How many missing values are there in each sample?

How many peptides have no missing values?

Hint: You can use is.na directly on the MSnSet and it is equivalent to calling is.na(exprs(obj))

Solution

sum(is.na(pep)) #1
mean(is.na(pep)) #2
colSums(is.na(pep)) #3
sum(rowSums(is.na(pep))==0) #4

Solution end

So, from the r nrow(pep) peptides, just r sum(rowSums(is.na(pep))==0) have quantification values in all 8 samples. This is not a surprise for LFQ, since each sample is prepared and run separately.

We can also explore the structure of the missing values further using an 'upset' plot. Here, we use the naniar package for this.

missing_data <- pep %>%
  exprs() %>%
  data.frame()

naniar::gg_miss_upset(missing_data,
                      sets = paste0(colnames(pep), '_NA'),
                      keep.order = TRUE,
                      nsets = 10)

So in this case, we can see that the most common missing value patterns are:

Missing in just RNase negative replicate 4
Missing in all samples.
Missing in all the other samples, except RNase negative replicate 4

RNase negative replicate 4 had slightly lower overall peptide intensities and appears to be somewhat of an outlier. In this case, we will retain the sample but in other cases, this may warrant further exploration and potentially removal of a sample.

Normalise peptide intensities

We don't have internal benchmark proteins we can normalise against, so we will only be able to assess protein abundances relative to the protein present in each sample. Since we injected the same quantity of peptides for each sample, your intuition may be that there should be no reason to normalise and to do so risks removing true biological variance.

Discussion 2

What technical explanations can you think of which would explain the differences in intensity?

Solution

# 1. Incorrect total peptide quantification leading to under/over-injection of peptides
# 2. Differences between separate MS runs (especially when lots of samples are being processed)

Solution end

The technical explanations are compelling. Futhermore, we can only assess relative quantification since we have unknown losses of material in the sample processing. Thus, it's reasonable to normalise the abundances.

Here we will apply median normalisation such that all column (sample) medians match the grand median. In MSnbase::normalise, this is called diff.median. Since the peptide intensities are log-Gaussian distributed, we log~2~-transform them before performing the normalisation.

Median normalisation is a relatively naive form of normalisation, since we are only applying a transformation using a single correction factor for each sample. This is most likely to be appropriate when the samples being compared are similar to one another. Arguably, in this case our samples are more distinct since we are comparing OOPS samples +/- RNase and we could at least explore using a more sophisticated normalisation such as Variance Stabilising Normalisation (VSN). For a more complete discussion of proteomics normalisation, see [@http://zotero.org/users/5634351/items/ZG3ASMKX]. However, bear in mind that this paper is applying the normalisations to the protein-level abundances.

pep_norm <- pep %>%
  log(base = 2) %>%
  MSnbase::normalise('diff.median')  

pep_norm %>%
  camprotR::plot_quant(method = 'density')

Summarising to protein-level abundance

Before we can summarise to protein-level abundances, we need to exclude peptides with too many missing values. Here, peptides with more than 4/8 missing values are discarded, using MSnbase::filterNA(). We also need to remove proteins without at least three peptides. We will use camprotR::restrict_features_per_protein() which will replace quantification values with NA if the sample does not have two quantified peptides for a given protein. Note that this means we have to repeat the filtering process since we are adding missing values.

pep_restricted <- pep_norm %>%
  # Maximum 4/8 missing values
  MSnbase::filterNA(pNA = 4/8) %>%

  # At least two peptides per protein
  camprotR::restrict_features_per_protein(min_features = 3, plot = FALSE) %>%

  # Repeat the filtering since restrict_features_per_protein will replace some values with NA
  MSnbase::filterNA(pNA = 4/8) %>%

  camprotR::restrict_features_per_protein(min_features = 3, plot = FALSE)

We can then re-inspect the missing values. Note that we have reduced the overall number of peptides to r nrow(pep_restricted).

p <- MSnbase::plotNA(pep_restricted, pNA = 0) +
  camprotR::theme_camprot(border = FALSE, base_family = 'sans', base_size = 15) +
  labs(x = 'Peptide index')

print(p)

Summarising to protein-level abundances

We can now summarise to protein-level abundance. Below, we use 'robust' summarisation [@http://zotero.org/users/5634351/items/FZN3QTTZ] with MSnbase::combineFeatures(). This returns a warning about missing values that we can ignore here since the robust method is inherently designed to handle missing values. See MsCoreUtils::robustSummary() and this publication for further details about the robust method.

prot_robust <- pep_restricted %>%
  MSnbase::combineFeatures(
    # group the peptides by their master protein id
    groupBy = fData(pep_restricted)$Master.Protein.Accessions,
    method = 'robust',
    maxit = 1000  # Ensures convergence for MASS::rlm
  )

We can then re-inspect the missing values at the protein level. So, we have quantification for r nrow(prot_robust) proteins, of which r nrow(filterNA(prot_robust, pNA = 0)) are fully quantified across all 8 samples. The most common missing values pattern remains missing in just RNase negative replicate 4.

p <- MSnbase::plotNA(prot_robust, pNA = 0) +
  camprotR::theme_camprot(border = FALSE, base_family = 'sans', base_size = 15)

print(p)

naniar::gg_miss_upset(data.frame(exprs(prot_robust)),
                      sets = paste0(colnames(prot_robust), '_NA'),
                      keep.order = TRUE,
                      nsets = 10)

We have now processed our peptide-level LFQ abundances and obtained protein-level abundances, from which we can perform our downstream analyses.

Below, we save the protein level objects to disk, so we can read them back into memory in downstream analyses. We use saveRDS to save them in compressed R binary format.

saveRDS(prot_robust, 'results/lfq_prot_robust.rds')

Optional extra sections on normalisation and summarisation

For a comparison between robust and maxLFQ for summarisation to protein-level abundances, see the notebook Alternatives for summarising to protein-level abundance - MaxLFQ

For a demonstration of an alternative normalisation approach where one has a strong prior belief, see Normalisation to a prior expectation

We will also save the pep_restricted object, since this is used in Alternatives for summarising to protein-level abundance - MaxLFQ

saveRDS(pep_restricted, 'results/lfq_pep_restricted.rds')

Session info {.unlisted .unnumbered}

sessionInfo()

References {.unlisted .unnumbered}

MRCToxBioinformatics/Proteomics_data_analysis documentation built on July 7, 2024, 6:44 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MRCToxBioinformatics/Proteomics_data_analysis
Basic proteomics data analysis tutorials

In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

Preamble

Load dependencies

Input data

Discussion 1

Parse PeptideGroups.txt file

Convert to MSnSet

QC peptides

Exercise 1

Missing values

Exercise 2

Normalise peptide intensities

Discussion 2

Summarising to protein-level abundance

Summarising to protein-level abundances

Optional extra sections on normalisation and summarisation

Session info {.unlisted .unnumbered}

References {.unlisted .unnumbered}

R Package Documentation

Browse R Packages

We want your feedback!

MRCToxBioinformatics/Proteomics_data_analysis Basic proteomics data analysis tutorials

In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

Preamble

Load dependencies

Input data

Discussion 1

Parse PeptideGroups.txt file

Convert to MSnSet

QC peptides

Exercise 1

Missing values

Exercise 2

Normalise peptide intensities

Discussion 2

Summarising to protein-level abundance

Summarising to protein-level abundances

Optional extra sections on normalisation and summarisation

Session info {.unlisted .unnumbered}

References {.unlisted .unnumbered}

R Package Documentation

Browse R Packages

We want your feedback!

MRCToxBioinformatics/Proteomics_data_analysis
Basic proteomics data analysis tutorials