knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means exhaustive.
t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For SILAC, log-transformed ratios can be assumed to be Gaussian distributed. When we have one condition variable in a SILAC experiment, and we are comparing between two values (e.g SILAC ratio is a comparison between treatment and control), a one-sample t-test is appropriate. If the SILAC ratio captures one condition variable, with another condition variable across samples (e.g ratio is treatment vs control and samples are from two different cell lines), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma
[@http://zotero.org/users/5634351/items/6KTXTWME]:
Proteomics experiments are typically lowly replicated (e.g n << 10).
Variance estimates are therefore inaccurate. limma
is an R package that extends
the t-test/ANOVA/linear model testing framework to enable sharing of information
across features (here, proteins) to update the variance estimates. This decreases
false positives and increases statistical power.
DEqMS
[@http://zotero.org/users/5634351/items/RTM6NFVU]: limma assumes there is a relationship between protein abundance and
variance. This is usually the case, although for SILAC
The relationship with variance is often stronger with the number of peptides
As such, DEqMS
should be preferred over limma.
Here, we will perform statistical analyses on SILAC data.
These are examples only and the code herein is unlikely to be directly applicable to your own dataset.
Load the required libraries.
library(camprotR) library(ggplot2) library(MSnbase) library(DEqMS) library(limma) library(dplyr) library(tidyr) library(ggplot2) library(broom) library(biobroom)
Here, we will start with the SILAC data processed in Processing and QC of SILAC data. Please see the previous notebook for details of the experimental design and aim and data processing.
In brief, we wish to determine the proteins which are significantly enriched by UV crosslinking (CL) vs non-CL control (NC).
First, we read in the protein-level ratios obtained in the above notebooks. We will filter out proteins which are not present in at least 3/4 replicates.
silac_protein <- readRDS('results/prot_res.rds') %>% filterNA(pNA = 1/4) # max 1/4 missing.
We will use two approaches:
To perform a t-test for each protein, we want to extract the quantification values in a long 'tidy' format. We can do this using the biobroom
package
silac_protein_tidy <- silac_protein %>% biobroom::tidy.MSnSet() %>% filter(is.finite(value))
As an example of how to run a single t-test, let's subset to a single protein. First, we extract the quantification values for this single protein
example_protein <- 'O00159' silac_protein_tidy_example <- silac_protein_tidy %>% filter(protein==example_protein) print(silac_protein_tidy_example)
Then we use t.test
to perform the t-test. Since we are giving only a single
vector of values, a one sample t-test is performed.
t.test.res <- t.test(silac_protein_tidy_example$value, alternative='two.sided') print(t.test.res)
We can use tidy
from the broom
package to return the t-test results in
a tidy tibble. The value of this will be seen in the next code chunk.
tidy(t.test.res)
We can now apply a t-test to every protein using dplyr group
and do
, making use of tidy
.
t.test.res.all <- silac_protein_tidy %>% group_by(protein) %>% do(tidy(t.test(.$value, alternative='two.sided')))
Here are the results for the t-test for the example protein. As we can see, the 'estimate' column in t.text.res.all
is the mean log2 ratio. The 'statistic' column is the t-statistic and the 'parameter' column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.
print(t.test.res) t.test.res.all %>% filter(protein==example_protein)
When you are performing a lot of statistical tests at the same time, it's recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.
hist(t.test.res.all$p.value, 20)
Discussion 1
What would you conclude from the p-value distribution above?
Solution
# Here, we have so many significant tests that the uniform distribution is hard to assess! # Note that, beyond the clear peak for very low p-values (<0.05), we also have a # slight skew towards low p-values in the range 0.05-0.15. # This may indicate insufficient statistical power to detect some proteins that are truly enriched upon CL.
Solution end
Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).
Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.
t.test.res.all$padj <- p.adjust(t.test.res.all$p.value, method='BH') table(t.test.res.all$padj<0.01)
At an FDR of 1%, we have r sum(t.test.res.all$padj<0.01)
proteins with a significant difference.
Finally, a differential abundance test wouldn't be complete without a volcano plot!
Note that all proteins with a statistically significant change are increased in CL, so it doesn't resemble the typical volcano plot.
t.test.res.all %>% ggplot(aes(x = estimate, y = -log10(p.value), colour = padj < 0.01)) + geom_point() + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') + labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')
Discussion 2
SILAC ratios are equivalent to the absolute abundance ratios between two or more samples. If this experiment had been performed using LFQ or TMT, which quantify the relative abundances in each sample, how would you expect the above volcano plot to look?
Solution
# It would like be much more symetrical, since the trend towards increased abundance # in CL would be masked by the comparison of the relative abundances in CL and NC. # Protein which are highly enriched with CL would have reduced ratios and those with more slight # enrichments may even appear to be relatively depleted!
Solution end
To identify proteins with significantly increased abundance in CL vs NC, we will use DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU], which you can think of as an extension of limma [@http://zotero.org/users/5634351/items/6KTXTWME] specifically for proteomics.
The analysis steps are taken from the
DEqMS vignette.
We first want to create an MArrayLM
object as per normal limma
analysis and
then add a $count
column to the MArrayLM
object and use the spectraCounteBayes
function to perform the Bayesian shrinkage using the count column, which describes
the number of pepitdes per protein. This is contrast to limma
, which uses the
$Amean
column, which describes the mean CL:NC ratio.
First, we create the MArrayLM
object.
dat <- silac_protein %>% exprs() # Performing the equivalent of a one-sample t-test, so the design matrix is just an intercept design <- cbind(Intercept = rep(1, ncol(dat))) fit <- lmFit(dat, design) efit <- eBayes(fit)
Next, we define the $count
column, which DEqMS
will use. In the DEqMS paper,
they suggest that the best summarisation metric to use is the minimum value across
the samples, so our count
column is the minimum number of peptides per protein.
protein_ratio_long <- silac_protein %>% exprs() %>% data.frame() %>% tibble::rownames_to_column('Master.Protein.Accessions') %>% pivot_longer(cols=-Master.Protein.Accessions, values_to='protein_ratio', names_to='sample') pep_res <- readRDS('results/pep_res.rds') # Obtain the min peptide count across the samples and determine the minimum value across # samples min_pep_count <- camprotR::count_features_per_protein(pep_res) %>% merge(protein_ratio_long, by=c('Master.Protein.Accessions', 'sample')) %>% filter(is.finite(protein_ratio)) %>% # We only want to consider samples with a ratio quantified group_by(Master.Protein.Accessions) %>% summarise(min_pep_count = min(n)) # add the min peptide count efit$count <- min_pep_count$min_pep_count
And now we run spectraCounteBayes
from DEqMS
to perform the statistical test.
# run DEqMS efit_deqms <- suppressWarnings(spectraCounteBayes(efit))
Below, we inspect the peptide count vs variance relationship which DEqMS
is
using in the statistical test.
In this case the relationship between peptide count and variance is only really
apparent when the minimum number of peptides from which a protein-level ratio is
obtained is very low. Typically, one might remove quantification values where there is
just a single peptide for a protein. Here, we have kept them, and we can refer to
the count
column in the final results object if we want to check the minimum
number of peptides observed per sample.
# Diagnostic plots VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")
Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a 'volcano' plot to visualise this.
Here, all proteins with a statistically significant change are increased in CL, so the plot looks more like a fire-hose than a volcano!
deqms_results <- outputResult(efit_deqms, coef_col = 1) deqms_results %>% ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) + geom_point() + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'CL vs NC Sig.') + labs(x = 'CL vs NC (Log2)', y = '-log10(p-value)')
Finally, we can explore and, if we wish, export our results. Importantly, the $t
, $P.Value
and $adj.P.Val
columns are from limma
. The columns prefixed with sca
are the from DEqMS
.
head(deqms_results)
We can now compare the results from the t-test and the moderate t-test (DEqMS). Below, we update the column names so it's easier to see which column comes from which test.
colnames(t.test.res.all) <- paste0('t.test.', colnames(t.test.res.all)) colnames(deqms_results) <- paste0('deqms.', colnames(deqms_results))
And then merge the two test results.
silac_compare_tests <- merge(deqms_results, t.test.res.all, by.x='row.names', by.y='t.test.protein')
Exercise
Compare the effect size estimates from the two methods with a scatter plot. Can you explain what you observe?
Hints:
- For the t-test, you want the 't.test.estimate' column
- For the moderated t-test, you want the 'deqms.logFC' column
Solution
ggplot(silac_compare_tests) + aes(t.test.estimate, deqms.logFC) + geom_point() + geom_abline(slope=1) + theme_camprot(border=FALSE) + labs(x='t.test mean ratio', y='DEqMS mean ratio') # The ratios are the same. DEqMS (and limma) are moderating the test statistics, but not the estimated fold-change.
Solution end
We can also compare the p-values from the two tests. Note that the p-value is almost always lower for the moderated t-test with DEqMS than the standard t-test.
p <- ggplot(silac_compare_tests) + aes(log10(t.test.p.value), log10(deqms.P.Value)) + geom_point() + geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) + theme_camprot(border=FALSE) + labs(x='T-test log10(p-value)', y='DEqMS log10(p-value)') print(p)
Finally, we can compare the number of proteins with a significant difference (Using 1% FDR threshold) according to each test. Using DEqMS more than doubles the number of null hypothesis rejections. Only 3 proteins are significant with a t-test but not the moderated t-test.
silac_compare_tests %>% group_by(t.test.padj<0.01, deqms.sca.adj.pval<0.01) %>% tally()
Finally, at this point we can save any one of the data.frames
containing the statistical test results, either to a compressed format (rds
) to read back into a later R notebook, or a flatfile (.tsv
) to read with e.g excel.
# These lines are not run and are examples only # Saving to R binary format. Can read back into memory with readRDS(). saveRDS(silac_compare_tests, 'filename_to_save_to.rds') # Saving to tab-separated (human-readable) flatfile write.csv(silac_compare_tests, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)
Extended Exercise (Optional)
Compare the number of significant differences (at 1% FDR) for each method, stratifying by the number of peptides per protein. An example of how to visualise this is below, though you do not need to replicate this visualisation precisely!
How do you interpet this?
Hints:
- For the number of peptides, you want the deqms.count column. You need to bin this into reasonable bins. See e.g ?Hmisc::cut2 for this.
Solution
# Start by tallying as in code chunk above silac_compare_tests %>% group_by('min_peptides'=Hmisc::cut2(deqms.count, cuts=c(1:4,10)), 't-test'=t.test.padj<0.01, 'deqms'=deqms.sca.adj.pval<0.01) %>% tally() %>% # make a single column describing yes/no significant with each method and # rename to make it more intuitive mutate(compare_tests=recode(interaction(`t-test`, deqms), 'FALSE.FALSE'='Neither', 'TRUE.TRUE'='Both', 'TRUE.FALSE'='t-test', 'FALSE.TRUE'='DEqMS')) %>% mutate(compare_tests=factor(compare_tests, levels=c('t-test', 'DEqMS', 'Both', 'Neither'))) %>% ggplot(aes(min_peptides, n, fill=compare_tests)) + geom_bar(stat='identity',position='fill') + geom_text(position=position_fill(vjust=0.5), aes(label=n)) + theme_camprot(border=FALSE, base_size=15) + scale_fill_manual(values=c(get_cat_palette(3), 'grey'), name='') + labs(x='Minimum peptides per sample', y='Fraction of proteins') # Note that the 3 proteins which are only significant with the t-test have a # minimum of 1/2 peptides per sample. These are likely to be poorly quantified # and falsely identified as CL-enriched.
Solution end
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.