knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete:
t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For TMT, log-transformed abundances can be assumed to be Gaussian distributed. When we have one condition variable and we are comparing between two values variable in a TMT experiment (e.g samples are treatment or control), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma
[@http://zotero.org/users/5634351/items/6KTXTWME]:
Proteomics experiments are typically lowly replicated (e.g n << 10).
Variance estimates are therefore inaccurate. limma
is an R package that extends
the t-test/ANOVA/linear model testing framework to enable sharing of information
across features (here, proteins) to update the variance estimates. This decreases
false positives and increases statistical power.
DEqMS
[@http://zotero.org/users/5634351/items/RTM6NFVU]:
limma assumes there is a relationship between protein abundance and
variance. This is usually the case, although the relationship with variance is
sometimes stronger with the number of peptide spectrum matches (for TMT experiments),
These are examples only and the code herein is unlikely to be directly applicable to your own dataset.
Load the required libraries.
library(camprotR) library(ggplot2) library(MSnbase) library(DEqMS) library(limma) library(dplyr) library(tidyr) library(ggplot2) library(broom) library(biobroom) library(uniprotREST)
Here, we will start with the TMT data processed in DQC PSM-level quantification and summarisation to protein-level abundance. Please see the previous notebook for details of the experimental design and aim and data processing.
As a reminder, this data comes from a published benchmark experiment where yeast peptides were spiked into human peptides at 3 known amounts to provide ground truth fold changes (see below). For more details, see: [@http://zotero.org/users/5634351/items/LG3W8G4T]
First, we read in the protein-level quantification data.
tmt_protein <- readRDS('./results/tmt_protein.rds')
To keep things simple, we will just focus on the comparison between the 2x and 6x yeast spike-in samples (the last 6 TMT tags).
tmt_protein <- tmt_protein[,1:7] # this is need to make sure the spike factor doesn't contain unused levels (x6) pData(tmt_protein)$spike <- droplevels(pData(tmt_protein)$spike)
We will use two approaches to identify proteins with significant differences in abundance: - two-sample t-test - moderated two-sample t-test (limma) - moderated two-sample t-test (DEqMS)
To perform a t-test for each protein, we want to extract the quantification values in a long 'tidy' format. We can do this using the biobroom package
tmt_protein_tidy <- tmt_protein %>% biobroom::tidy.MSnSet(addPheno=TRUE) %>% # addPheno=TRUE adds the pData so we have the sample information too filter(is.finite(value))
As an example of how to run a single t-test, let's subset to a single protein. First, we extract the quantification values for this single protein
example_protein <- 'P40414' tmt_protein_tidy_example <- tmt_protein_tidy %>% filter(protein==example_protein) %>% select(-sample) print(tmt_protein_tidy_example)
Then we use t.test
to perform the t-test.
t.test.res <- t.test(formula=value~spike, data=tmt_protein_tidy_example, alternative='two.sided') print(t.test.res)
We can use tidy
from the broom
package to return the t-test results in
a tidy tibble. The value of this will be seen in the next code chunk.
tidy(t.test.res)
We can now apply a t-test to every protein using dplyr group
and do
, making use of tidy
.
t.test.res.all <- tmt_protein_tidy %>% group_by(protein) %>% do(tidy(t.test(formula=value~spike, data=., alternative='two.sided')))
Here are the results for the t-test for the example protein. As we can see, the 'estimate' column in t.text.res.all
is the mean protein abundance. The 'statistic' column is the t-statistic and the 'parameter' column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.
print(t.test.res) t.test.res.all %>% filter(protein==example_protein)
When you are performing a lot of statistical tests at the same time, it's recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.
Here, we have so many significant tests that the uniform distribution is hard to assess. Note that, beyond the clear peak for very low p-values (<0.05) we also have a slight skew towards low p-values in the range 0.05-0.2. This may indicate insufficient statistical power to detect some proteins that are truly differentially abundant.
hist(t.test.res.all$p.value, 20)
Discussion 1
What would you conclude from the p-value distribution above?
Solution
# Here, we have so many significant tests that the uniform distribution is hard to assess! # Note that, beyond the clear peak for very low p-values (<0.05), we also have a # slight skew towards low p-values in the range 0.05-0.2. # This may indicate insufficient statistical power to detect some proteins that # are truly differentially abundant.
Solution end
Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).
Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.
t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH') table(t.test.res.all$padj<0.01)
At an FDR of 1%, we have r sum(t.test.res.all$padj<0.01)
proteins with a significant difference.
Proteomics experiments are typically lowly replicated (e.g n << 10).
Variance estimates are therefore inaccurate. limma
[@http://zotero.org/users/5634351/items/6KTXTWME] is an R package that extends
the t-test/ANOVA/linear model testing framework to enable sharing of information
across features (here, proteins) to update the variance estimates. This decreases
false positives and increases statistical power.
Next, we create the MArrayLM
object and a design model. We then supply these to
limma::lmFit
to fit the linear model according to the design and then use
limma::eBayes
to compute moderated test statistics.
exprs_for_limma <- exprs(tmt_protein) # Performing the equivalent of a two-sample t-test spike <- pData(tmt_protein)$spike limma_design <- model.matrix(formula(~spike)) limma_fit <- lmFit(exprs_for_limma, limma_design) limma_fit <- eBayes(limma_fit, trend=TRUE)
We can visualise the relationship between the average abundance and the variance using the limma::plotSA
function.
limma::plotSA(limma_fit)
Discussion 2
How would you interpret the plot above?
Solution
# There's a really clear relationship between protein abundance and variance!
Solution end
In this case, the variances will be shrunk towards a value that depends on the mean protein abundance vs variance trend
We can extract a results table like so.
# use colnames(limma_fit$coefficients) to identify the coefficient names limma_results <- topTable(limma_fit, n=Inf, coef='spikex2')
Below, we summarise the number of proteins with statistically different abundance in 2x vs 1x and plot a 'volcano' plot to visualise this.
table(limma_results$adj.P.Val<0.01) limma_results %>% ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) + geom_point(size=0.5) + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') + labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)')
t.test.res.all %>% filter(protein==example_protein) limma_results[example_protein,]
Discussion 3
Given the experimental design, would you expect proteins to have signficant changes in both directions?
Solution
# Yes! We are mixing known quantities of human and yeast proteins together such # that yeast proteins increase 2-fold in abundance between 2x and 1x samples, and human # proteins decrease in abundance (to balance out the total amount of protein in the samples)
Solution end
It would make more sense to split the volcano plot by the species from which the protein derived. We can obtain this information from uniprot like so.
species <- uniprot_map( ids = rownames(tmt_protein), from = "UniProtKB_AC-ID", to = "UniProtKB", fields = "organism_name", ) %>% rename(c('UNIPROTKB'='From'))
Exercise 1
Merge the
species
andlimma_results
data.frames
and re-plot the volcano plot, with one panel for each species. An example of the desired output is shown belowHint: see
?facet_wrap
Solution
limma_results %>% merge(species, by.x='row.names', by.y='UNIPROTKB') %>% ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) + geom_point(size=0.5) + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') + labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)') + facet_wrap(~(gsub('\\(.*', '', Organism)))
Solution end
We can now compare the results from the t-test and the moderated t-test (limma). Below, we update the column names so it's easier to see which column comes from which test and then merge the two test results.
tmt_compare_tests <- merge( setNames(limma_results, paste0('limma.', colnames(limma_results))), setNames(t.test.res.all, paste0('t.test.', colnames(t.test.res.all))), by.x='row.names', by.y='t.test.protein')
Below, we can compare the p-values from the two tests. Note that the p-value is almost
always lower for the moderated t-test with limma
than the standard t-test.
p <- ggplot(tmt_compare_tests) + aes(log10(t.test.p.value), log10(limma.P.Value)) + geom_point(size=0.2, alpha=0.2) + geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) + theme_camprot(border=FALSE) + labs(x='T-test log10(p-value)', y='limma log10(p-value)') print(p)
Finally, we can compare the number of proteins with a significant difference
(Using 1% FDR threshold) according to each test. Using the t-test, there are r sum(tmt_compare_tests$t.test.padj<0.01)
significant differences, but with limma r sum(tmt_compare_tests$limma.P.Value<0.01)
proteins have a significant difference.
tmt_compare_tests %>% group_by(t.test.padj<0.01, limma.P.Value<0.01) %>% tally()
limma assumes there is a relationship between protein abundance and variance. This is usually the case, although we have seen above that this isn't so with our data. For LFQ, the relationship between variance and the number of peptides may be stronger.
DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU], is an alternative to limma, which you can think of as an extension of limma [@http://zotero.org/users/5634351/items/6KTXTWME] specifically for proteomics, which uses the number of peptides rather than mean abundance to share information between proteins.
The analysis steps are taken from the
DEqMS vignette.
We start from the MArrayLM
we created for limma
analysis and then simply
add a $count
column to the MArrayLM
object and use the spectraCounteBayes
function to perform the Bayesian shrinkage using the count column, which describes
the number of pepitdes per protein. This is contrast to limma
, which uses the
$Amean
column, which describes the mean protein abundance.
To define the $count
column, we need to summarise the number of PSMs per protein.
In the DEqMS paper, they suggest that the best summarisation metric to use is the
minimum value across the samples, so our count
column is the minimum number of
PSMs per protein.
tmt_psm_res <- readRDS('./results/psm_filt.rds') # Obtain the min peptide count across the samples and determine the minimum value across # samples min_psm_count <- camprotR::count_features_per_protein(tmt_psm_res) %>% merge(tmt_protein_tidy, by.x=c('Master.Protein.Accessions', 'sample'), by.y=c('protein', 'sample')) %>% group_by(Master.Protein.Accessions) %>% summarise(min_psm_count = min(n)) # add the min peptide count limma_fit$count <- min_psm_count$min_psm_count
And now we run spectraCounteBayes
from DEqMS
to perform the statistical test.
# run DEqMS efit_deqms <- suppressWarnings(spectraCounteBayes(limma_fit))
Below, we inspect the peptide count vs variance relationship which DEqMS
is
using in the statistical test.
# Diagnostic plots VarianceBoxplot(efit_deqms, n = 30, xlab = "PSMs")
Below, we summarise the number of proteins with statistically different abundance in RNase +/- and plot a 'volcano' plot to visualise this.
deqms_results <- outputResult(efit_deqms, coef_col=2) table(deqms_results$sca.adj.pval<0.01) deqms_results %>% merge(species, by.x='row.names', by.y='UNIPROTKB') %>% ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) + geom_point(size=0.5) + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = '2x vs 1x Sig.') + labs(x = '2x vs 1x (Log2)', y = '-log10(p-value)') + facet_wrap(~(gsub('\\(.*', '', Organism)))
We can compare the results of limma and DEqMS by considering the number of significant differences. Note that the limma results are also contained with the results from DEqMS. The $t
, $P.Value
and $adj.P.Val
columns are from limma
. The columns prefixed with sca
are the from DEqMS
.
deqms_results %>% group_by(limma_sig=adj.P.Val<0.01, DEqMS_sig=sca.adj.pval<0.01) %>% tally()
We can compare the results of limma and DEqMS by considering the p-values.
deqms_results %>% ggplot() + aes(P.Value, sca.P.Value) + geom_point(size=0.5, alpha=0.5) + geom_abline(slope=1, colour=get_cat_palette(2)[2], size=1, linetype=2) + theme_camprot(border=FALSE) + labs(x='limma p-value', y='DEqMS p-value') + scale_x_log10() + scale_y_log10()
Discussion 4
The p-values from limma and DEqMS are very well correlated. This is despite the two methods using a different depedent variable to shrink the variance (limma = mean expression, DEqMS = the number of PSMs)
Solution
# This suggests that the dependent variables are likely to be highly correlated, # which makes sense since we are using sum summarisation from PSM -> protein, # so proteins with more PSMs will likely have a higher abundance. # We can check this like so deqms_results %>% ggplot() + aes(Hmisc::cut2(count, g=20), AveExpr) + geom_boxplot() + theme_camprot(border=FALSE) + labs(x='PSMs', y='Average abundance') + theme(axis.text.x=element_text(angle=45, vjust=1, hjust=1))
Solution end
Finally, at this point we can save any one of the data.frames
containing the statistical test results, either to a compressed format (rds
) to read back into a later R notebook, or a flatfile (.tsv
) to read with e.g excel.
# These lines are not run and are examples only saveRDS(deqms_results, 'filename_to_save_to.rds') write.csv(deqms_results, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.