In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Statistical tests

There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means exhaustive.

t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For SILAC, log-transformed ratios can be assumed to be Gaussian distributed. When we have one condition variable in a SILAC experiment, and we are comparing between two values (e.g SILAC ratio is a comparison between treatment and control), a one-sample t-test is appropriate. If the SILAC ratio captures one condition variable, with another condition variable across samples (e.g ratio is treatment vs control and samples are from two different cell lines), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma [@http://zotero.org/users/5634351/items/6KTXTWME]: Proteomics experiments are typically lowly replicated (e.g n << 10). Variance estimates are therefore inaccurate. limma is an R package that extends the t-test/ANOVA/linear model testing framework to enable sharing of information across features (here, proteins) to update the variance estimates. This decreases false positives and increases statistical power.
DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU]: limma assumes there is a relationship between protein abundance and variance. This is usually the case, although for SILAC
The relationship with variance is often stronger with the number of peptides
The quantification value is the ratio, not protein abundance and there is no reason to believe there will be a relationship between the mean ratio and variance

As such, DEqMS should be preferred over limma.

Here, we will perform statistical analyses on SILAC data.

These are examples only and the code herein is unlikely to be directly applicable to your own dataset.

Load dependencies {.unlisted .unnumbered}

Load the required libraries.

library(camprotR)
library(ggplot2)
library(MSnbase)
library(DEqMS)
library(limma)
library(dplyr)
library(tidyr)
library(ggplot2)
library(broom)
library(biobroom)

Input data

Here, we will start with the SILAC data processed in Processing and QC of SILAC data. Please see the previous notebook for details of the experimental design and aim and data processing.

In brief, we wish to determine the proteins which are significantly enriched by UV crosslinking (CL) vs non-CL control (NC).

First, we read in the protein-level ratios obtained in the above notebooks. We will filter out proteins which are not present in at least 3/4 replicates.

silac_protein <- readRDS('results/prot_res.rds') %>%
  filterNA(pNA = 1/4) # max 1/4 missing.

Testing for differential abundance

We will use two approaches:

one sample t-test
moderated one sample t-test (DEqMS)

T-test

To perform a t-test for each protein, we want to extract the quantification values in a long 'tidy' format. We can do this using the biobroom package

silac_protein_tidy <- silac_protein %>%
  biobroom::tidy.MSnSet() %>%
  filter(is.finite(value))

As an example of how to run a single t-test, let's subset to a single protein. First, we extract the quantification values for this single protein

example_protein <- 'O00159'

silac_protein_tidy_example <- silac_protein_tidy %>%
  filter(protein==example_protein)

print(silac_protein_tidy_example)

Then we use t.test to perform the t-test. Since we are giving only a single vector of values, a one sample t-test is performed.

t.test.res <- t.test(silac_protein_tidy_example$value,
                     alternative='two.sided')

print(t.test.res)

We can use tidy from the broom package to return the t-test results in a tidy tibble. The value of this will be seen in the next code chunk.

tidy(t.test.res)

We can now apply a t-test to every protein using dplyr group and do, making use of tidy.

t.test.res.all <- silac_protein_tidy %>%
  group_by(protein) %>%
  do(tidy(t.test(.$value, alternative='two.sided')))

Here are the results for the t-test for the example protein. As we can see, the 'estimate' column in t.text.res.all is the mean log2 ratio. The 'statistic' column is the t-statistic and the 'parameter' column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.

print(t.test.res)
t.test.res.all %>% filter(protein==example_protein)

When you are performing a lot of statistical tests at the same time, it's recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.

hist(t.test.res.all$p.value, 20)

Discussion 1

What would you conclude from the p-value distribution above?

Solution

# Here, we have so many significant tests that the uniform distribution is hard to assess!
# Note that, beyond the clear peak for very low p-values (<0.05), we also have a
# slight skew towards low p-values in the range 0.05-0.15.
# This may indicate insufficient statistical power to detect some proteins that are truly enriched upon CL.

Solution end

Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).

Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.

t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH')
table(t.test.res.all$padj<0.01)

At an FDR of 1%, we have r sum(t.test.res.all$padj<0.01) proteins with a significant difference.

Finally, a differential abundance test wouldn't be complete without a volcano plot!

Note that all proteins with a statistically significant change are increased in CL, so it doesn't resemble the typical volcano plot.

t.test.res.all %>%
  ggplot(aes(x = estimate, y = -log10(p.value), colour = padj < 0.01)) +
  geom_point() +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
  labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')

Discussion 2

SILAC ratios are equivalent to the absolute abundance ratios between two or more samples. If this experiment had been performed using LFQ or TMT, which quantify the relative abundances in each sample, how would you expect the above volcano plot to look?

Solution

# It would like be much more symetrical, since the trend towards increased abundance
# in CL would be masked by the comparison of the relative abundances in CL and NC.
# Protein which are highly enriched with CL would have reduced ratios and those with more slight
# enrichments may even appear to be relatively depleted!

Solution end

Moderated t-test (DEqMS)

To identify proteins with significantly increased abundance in CL vs NC, we will use DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU], which you can think of as an extension of limma [@http://zotero.org/users/5634351/items/6KTXTWME] specifically for proteomics.

The analysis steps are taken from the DEqMS vignette. We first want to create an MArrayLM object as per normal limma analysis and then add a $count column to the MArrayLM object and use the spectraCounteBayes function to perform the Bayesian shrinkage using the count column, which describes the number of pepitdes per protein. This is contrast to limma, which uses the $Amean column, which describes the mean CL:NC ratio.

First, we create the MArrayLM object.

dat <- silac_protein  %>%
  exprs()

# Performing the equivalent of a one-sample t-test, so the design matrix is just an intercept
design <- cbind(Intercept = rep(1, ncol(dat)))

fit <- lmFit(dat, design)
efit <- eBayes(fit)

Next, we define the $count column, which DEqMS will use. In the DEqMS paper, they suggest that the best summarisation metric to use is the minimum value across the samples, so our count column is the minimum number of peptides per protein.

protein_ratio_long <- silac_protein %>%
  exprs() %>%
  data.frame() %>%
  tibble::rownames_to_column('Master.Protein.Accessions') %>%
  pivot_longer(cols=-Master.Protein.Accessions, values_to='protein_ratio', names_to='sample')

pep_res <- readRDS('results/pep_res.rds')

# Obtain the min peptide count across the samples and determine the minimum value across
# samples
min_pep_count <- camprotR::count_features_per_protein(pep_res) %>%
  merge(protein_ratio_long, by=c('Master.Protein.Accessions', 'sample')) %>%
  filter(is.finite(protein_ratio)) %>%  # We only want to consider samples with a ratio quantified
  group_by(Master.Protein.Accessions) %>%
  summarise(min_pep_count = min(n))

# add the min peptide count
efit$count <- min_pep_count$min_pep_count

And now we run spectraCounteBayes from DEqMS to perform the statistical test.

# run DEqMS
efit_deqms <- suppressWarnings(spectraCounteBayes(efit))

Below, we inspect the peptide count vs variance relationship which DEqMS is using in the statistical test.

In this case the relationship between peptide count and variance is only really apparent when the minimum number of peptides from which a protein-level ratio is obtained is very low. Typically, one might remove quantification values where there is just a single peptide for a protein. Here, we have kept them, and we can refer to the count column in the final results object if we want to check the minimum number of peptides observed per sample.

# Diagnostic plots
VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")

Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a 'volcano' plot to visualise this.

Here, all proteins with a statistically significant change are increased in CL, so the plot looks more like a fire-hose than a volcano!

deqms_results <- outputResult(efit_deqms, coef_col = 1)

deqms_results %>%
  ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) +
  geom_point() +
  theme_camprot(border=FALSE, base_size=15) +
  scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') +
  labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')

Finally, we can explore and, if we wish, export our results. Importantly, the $t, $P.Value and $adj.P.Val columns are from limma. The columns prefixed with sca are the from DEqMS.

head(deqms_results)

We can now compare the results from the t-test and the moderate t-test (DEqMS). Below, we update the column names so it's easier to see which column comes from which test.

colnames(t.test.res.all) <- paste0('t.test.', colnames(t.test.res.all))
colnames(deqms_results) <- paste0('deqms.', colnames(deqms_results))

And then merge the two test results.

silac_compare_tests <- merge(deqms_results,
      t.test.res.all,
      by.x='row.names',
      by.y='t.test.protein')

Exercise

Compare the effect size estimates from the two methods with a scatter plot. Can you explain what you observe?

Hints:

For the t-test, you want the 't.test.estimate' column

For the moderated t-test, you want the 'deqms.logFC' column

Solution

ggplot(silac_compare_tests) +
  aes(t.test.estimate, deqms.logFC) +
  geom_point() +
  geom_abline(slope=1) +
  theme_camprot(border=FALSE) +
  labs(x='t.test mean ratio', y='DEqMS mean ratio')

# The ratios are the same. DEqMS (and limma) are moderating the test statistics, but not the estimated fold-change.

Solution end

We can also compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with DEqMS than the standard t-test.

p <- ggplot(silac_compare_tests) +
  aes(log10(t.test.p.value), log10(deqms.P.Value)) +
  geom_point() +
  geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) +
  theme_camprot(border=FALSE) +
  labs(x='T-test log10(p-value)', y='DEqMS log10(p-value)')

print(p)

Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using DEqMS more than doubles the number of null hypothesis rejections. Only 3 proteins are significant with a t-test but not the moderated t-test.

silac_compare_tests %>%
  group_by(t.test.padj<0.01,
           deqms.sca.adj.pval<0.01) %>%
  tally()

Finally, at this point we can save any one of the data.frames containing the statistical test results, either to a compressed format (rds) to read back into a later R notebook, or a flatfile (.tsv) to read with e.g excel.

# These lines are not run and are examples only

 # Saving to R binary format. Can read back into memory with readRDS().
saveRDS(silac_compare_tests, 'filename_to_save_to.rds')

# Saving to tab-separated (human-readable) flatfile
write.csv(silac_compare_tests, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)

Extended Exercise (Optional)

Compare the number of significant differences (at 1% FDR) for each method, stratifying by the number of peptides per protein. An example of how to visualise this is below, though you do not need to replicate this visualisation precisely!

How do you interpet this?

Hints:

For the number of peptides, you want the deqms.count column. You need to bin this into reasonable bins. See e.g ?Hmisc::cut2 for this.

Solution

# Start by tallying as in code chunk above
silac_compare_tests %>%
  group_by('min_peptides'=Hmisc::cut2(deqms.count, cuts=c(1:4,10)),
           't-test'=t.test.padj<0.01,
           'deqms'=deqms.sca.adj.pval<0.01) %>%
  tally() %>%

  # make a single column describing yes/no significant with each method and
  # rename to make it more intuitive
  mutate(compare_tests=recode(interaction(`t-test`, deqms),
                              'FALSE.FALSE'='Neither',
                              'TRUE.TRUE'='Both',
                              'TRUE.FALSE'='t-test',
                              'FALSE.TRUE'='DEqMS')) %>%
  mutate(compare_tests=factor(compare_tests, levels=c('t-test', 'DEqMS', 'Both', 'Neither'))) %>%

  ggplot(aes(min_peptides, n, fill=compare_tests)) +
  geom_bar(stat='identity',position='fill') +
  geom_text(position=position_fill(vjust=0.5), aes(label=n)) +
  theme_camprot(border=FALSE, base_size=15) +
  scale_fill_manual(values=c(get_cat_palette(3), 'grey'), name='') +
  labs(x='Minimum peptides per sample', y='Fraction of proteins')

# Note that the 3 proteins which are only significant with the t-test have a
# minimum of 1/2 peptides per sample. These are likely to be poorly quantified
# and falsely identified as CL-enriched.