In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete.

t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For LFQ, log-transformed abundances can be assumed to be Gaussian distributed. When we have one condition variable and we are comparing between two values variable in an LFQ experiment (e.g samples are treatment or control), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma [@http://zotero.org/users/5634351/items/6KTXTWME]: Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.
DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU]: limma assumes there is a relationship between protein abundance and variance. This is usually the case, although with LFQ the relationship between variance and the number of peptides may be stronger.

Here, we will perform statistical analyses on LFQ data.

These are examples only and the code herein is unlikely to be directly applicable to your own dataset.

Load dependencies

Load the required libraries.

library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)

Input data

Here, we will start with the LFQ data processed in Data processing and QC of LFQ data. Please see the previous notebook for details of the experimental design and aim and data processing.

First, we read in the protein-level ratios obtained in the above notebooks.

lfq_protein <- readRDS('./results/lfq_prot_robust.rds')

Testing for differential abundance

In brief, we wish to determine the proteins which are significantly depleted by RNase treatment.

We will use three approaches: - paired t-test - moderated paired t-test (limma) - moderated paired t-test (DEqMS)

T-test

To perform a t-test for each protein, we want to extract the quantification values in a long 'tidy' format and then re-structure so we have one column each for RNase +/-. We can do this using the biobroom package

We will also filter out proteins which are not present in both samples in at least 3/4 replicates.

lfq_protein_tidy <- lfq_protein %>%
  biobroom::tidy.MSnSet() %>%
  separate(sample.id, into=c(NA, 'RNase', 'replicate')) %>%
  pivot_wider(names_from=RNase, values_from=value) %>%
  filter(is.finite(neg), is.finite(pos)) %>%
  group_by(protein) %>%
  filter(length(protein)>=3)

As an example of how to run a single t-test, let's subset to a single protein. First, we extract the quantification values for this single protein

example_protein <- 'A5YKK6'

lfq_protein_tidy_example <- lfq_protein_tidy %>% filter(protein==example_protein)

print(lfq_protein_tidy_example)

Then we use t.test to perform the t-test. We are giving two vectors of values and the switch paired=TRUE so that a paired two sample t-test is performed.

t.test.example <- t.test(
  lfq_protein_tidy_example$pos,
  lfq_protein_tidy_example$neg,
  alternative='two.sided',
  var.equal=FALSE,
  paired=TRUE)

print(t.test.example)

We can use tidy from the broom package to return the t-test results in a tidy tibble. The value of this will be seen in the next code chunk.

head(broom::tidy(t.test.example))

We can now apply a t-test to every protein using dplyr group and do, making use of tidy.

t.test.all <- lfq_protein_tidy %>%
  group_by(protein) %>%
  do(tidy(t.test(.$pos, .$neg, paired=TRUE, alternative='two.sided')))

Here are the results for the t-test for the example protein. As we can see, the 'estimate' column in t.text.res.all is the mean log2 ratio. The 'statistic' column is the t-statistic and the 'parameter' column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.

print(t.test.example)
t.test.all %>% filter(protein==example_protein)

When you are performing a lot of statistical tests at the same time, it's recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.

There is a clear peak for very low p-values (<0.05) and an approximately uniform distribution across the rest of the p-value range, which is what we want.

hist(t.test.all$p.value)

Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).

Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.

t.test.all$padj <- p.adjust(t.test.all$p.value, method='BH')

At an FDR of 1%, we have r sum(t.test.all$padj<0.01) proteins with a significant difference.

sum(t.test.all$padj<0.01)

Moderated t-test (limma)

Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma [@http://zotero.org/users/5634351/items/6KTXTWME] is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.

We will first reconstruct an MSnset from our filtered data since it's easier to work with limma using this standard proteomics object.

filtered_exprs <- lfq_protein_tidy %>%
  pivot_longer(cols=c(neg, pos), names_to='RNase') %>%
  mutate(sample=paste0('RNase_', RNase, '.', replicate)) %>%
  pivot_wider(names_from=sample, values_from=value, id_cols=protein) %>%
  tibble::column_to_rownames('protein') %>%
  as.matrix()

filtered_lfq_protein <- MSnSet(exprs=filtered_exprs,
                               fData=fData(lfq_protein)[rownames(filtered_exprs),],
                               pData=pData(lfq_protein)[colnames(filtered_exprs),])

Next, we create the MArrayLM object and a design model. We then supply these to limma::lmFit to fit the linear model according to the design and then use limma::eBayes to compute moderated test statistics.

exprs_for_limma <- exprs(filtered_lfq_protein)

# Performing the equivalent of a two-sample t-test
condition <- pData(filtered_lfq_protein)$Condition
replicate <- pData(filtered_lfq_protein)$Replicate

limma_design <- model.matrix(formula(~replicate+condition))

limma_fit <- lmFit(exprs_for_limma, limma_design)
limma_fit <- eBayes(limma_fit, trend=TRUE)

We can visualise the relationship between the average abundance and the variance using the limma::plotSA function.

limma::plotSA(limma_fit)

Discussion 1

How would you interpret the plot above?

Solution

# Surprisingly, there's no clear relationship between protein abundance and variance

Solution end

Despite the lack of a strong relationship between protein abundance and variance, we will continue with limma regardless, since it will still increase the effective degrees of freedom with which the gene-wise variances are estimated. In this case, the variances will be shrunk towards a similar value, regardless of the mean protein abundance.

We can extract a results table like so

# use colnames(limma_fit$coefficients) to identify the coefficient names
limma_results <- topTable(limma_fit, n=Inf, coef='conditionRNase_pos')

Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a 'volcano' plot to visualise this.

table(limma_results$adj.P.Val<0.01)


limma_results %>%
  ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) +
  geom_point() +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'RNase +/- Sig.') +
  labs(x = 'RNase +/- (Log2)', y = '-log10(p-value)')

Discussion 2

Given the experimental design, would you expect proteins to have signficant changes in both directions?

How would you interpret the fold changes identified with respect to the absolute protein abundances?

Does your answer to question 2 affect your expectation in question 1?

Solution

# 1. We are starting from a single OOPS interface and adding +/- RNase, then recollecting the interface. RBPs should be depleted, but there is no way for any protein to be enriched by this process. Thus, we wouldn't expect proteins with a positive RNase +/- ratio!
# 2. The protein abundances are relative to the total amount of protein in each sample. Thus, the fold changes are relative to the overall fold-change in the amount of protein in each sample. For example, if there is a global loss of protein in RNase + samples, a positive RNase +/- ratio only represents a increase relative to the global loss, and could be a negative RNase +/- ratio in absolute terms!
# 3. We are seeing positive RNase +/- ratios because there is a global difference in the amount of protein and we are using a relative protein abundance quantification approach.

Solution end

We can now compare the results from the t-test and the moderated t-test (limma). Below, we update the column names so it's easier to see which column comes from which test and then merge the two test results.

lfq_compare_tests <- merge(
  setNames(limma_results, paste0('limma.', colnames(limma_results))),
  setNames(t.test.all, paste0('t.test.', colnames(t.test.all))),
  by.x='row.names',
  by.y='t.test.protein')

Exercise

Compare the effect size estimates from t-test vs the moderated t-tests. Can you explain what you observe?

Hints:

For the t-test, you want the 't.test.estimate' column

For the moderated t-test, you can use the 'limma.logFC' column

Solution

ggplot(lfq_compare_tests) +
  aes(t.test.estimate, limma.logFC) +
  geom_point() +
  geom_abline(slope=1) +
  theme_camprot(border=FALSE) +
  labs(x='t-test logFC', y='limma logFC')

# The logFC are the same! Remember that limma is not changing the underlying data,
# just moderating the test statistics.

Solution end

We can also compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with limma than the standard t-test.

p <- ggplot(lfq_compare_tests) +
  aes(log10(t.test.p.value), log10(limma.P.Value)) +
  geom_point() +
  geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
  theme_camprot(border=FALSE) +
  labs(x='T-test log10(p-value)', y='limma log10(p-value)')


print(p)

Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using the t-test, there are r sum(lfq_compare_tests$t.test.padj<0.01) significant differences, but with limma r sum(lfq_compare_tests$limma.P.Value<0.01) proteins have a significant difference.

lfq_compare_tests %>%
  group_by(t.test.padj<0.01,
           limma.P.Value<0.01) %>%
  tally()

Moderated t-test (DEqMS)

limma assumes there is a relationship between protein abundance and variance. This is usually the case, although we have seen above that this isn't so with our data. For LFQ, the relationship between variance and the number of peptides may be stronger.

DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU], is an alternative to limma, which you can think of as an extension of limma [@http://zotero.org/users/5634351/items/6KTXTWME] specifically for proteomics, which uses the number of peptides rather than mean abundance to share information between proteins.

The analysis steps are taken from the DEqMS vignette. We start from the MArrayLM we created for limma analysis and then simply add a $count column to the MArrayLM object and use the spectraCounteBayes function to perform the Bayesian shrinkage using the count column, which describes the number of pepitdes per protein. This is contrast to limma, which uses the $Amean column, which describes the mean protein abundance.

To define the $count column, we need to summarise the number of peptides per protein. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count column is the minimum number of peptides per protein.

filtered_lfq_protein_long <- filtered_lfq_protein %>%
  exprs() %>%
  data.frame() %>%
  tibble::rownames_to_column('Master.Protein.Accessions') %>%
  pivot_longer(cols=-Master.Protein.Accessions, values_to='abundance', names_to='sample')

lfq_pep_res <- readRDS('results/lfq_pep_restricted.rds')

# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_pep_count <- camprotR::count_features_per_protein(lfq_pep_res) %>%
  merge(filtered_lfq_protein_long, by=c('Master.Protein.Accessions', 'sample')) %>%
  filter(is.finite(abundance)) %>%  # We only want to consider samples with a ratio quantified
  group_by(Master.Protein.Accessions) %>%
  summarise(min_pep_count = min(n))

# add the min peptide count
limma_fit$count <- min_pep_count$min_pep_count

And now we run spectraCounteBayes from DEqMS to perform the statistical test.

# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(limma_fit))

Below, we inspect the peptide count vs variance relationship which DEqMS is using in the statistical test.

In this case the relationship between peptide count and variance is not clear at all. We press on regardless.As with the limma analysis, the variance will be shrunk towards a global mean rather than one informed by the number of peptides

# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")

Below, we summarise the number of proteins with statistically different abundance in RNase +/- and plot a 'volcano' plot to visualise this.

deqms_results <- outputResult(efit_deqms, coef_col=3)


table(deqms_results$sca.adj.pva<0.01)


deqms_results %>%
  ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
  geom_point() +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'RNase +/- Sig.') +
  labs(x = 'RNase +/- (Log2)', y = '-log10(p-value)')

We can compare the results of limma and DEqMS by considering the number of significant differences. Note that the limma results are also contained with the results from DEqMS. The $t, $P.Value and $adj.P.Val columns are from limma. The columns prefixed with sca are the from DEqMS.

deqms_results %>%
  group_by(limma_sig=adj.P.Val<0.01,
           DEqMS_sig=sca.adj.pval<0.01) %>%
  tally()

We can compare the results of limma and DEqMS by considering the p-values. Note that here, they are very well correlated. This is because the methods failed to identify a strong trend between the mean abundance (limma) or number of peptides (DEqMS) and the variance. Thus, both shrunk the variance towards a global mean and similarly increased the effective degrees of freedom.

deqms_results %>%
  ggplot() +
  aes(P.Value, sca.P.Value) +
  geom_point() +
  geom_abline(slope=1) +
  theme_camprot(border=FALSE) +
  labs(x='limma p-value', y='DEqMS p-value') +
  scale_x_log10() +
  scale_y_log10()

Finally, at this point we can save any one of the data.frames containing the statistical test results, either to a compressed format (rds) to read back into a later R notebook, or a flatfile (.tsv) to read with e.g excel.

# These lines are not run and are examples only

saveRDS(deqms_results, 'filename_to_save_to.rds')
write.csv(deqms_results, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)