knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
There are a number of statistical tests/R packages one can use to perform differential abundance testing for proteomics data. The list below is by no means complete.
t-test: If we assume that the quantification values are Gaussian distributed, a t-test may be appropriate. For LFQ, log-transformed abundances can be assumed to be Gaussian distributed. When we have one condition variable and we are comparing between two values variable in an LFQ experiment (e.g samples are treatment or control), a two-sample t-test is appropriate.
ANOVA/linear model: Where a more complex experimental design is involved, an ANOVA or linear model can be used, on the same assumptions at the t-test.
limma
[@http://zotero.org/users/5634351/items/6KTXTWME]:
Proteomics experiments are typically lowly replicated (e.g n << 10).
Variance estimates are therefore inaccurate. limma
is an R package that extends
the t-test/ANOVA/linear model testing framework to enable sharing of information
across features (here, proteins) to update the variance estimates. This decreases
false positives and increases statistical power.
DEqMS
[@http://zotero.org/users/5634351/items/RTM6NFVU]: limma assumes
there is a relationship between protein abundance and variance. This is usually
the case, although with LFQ the relationship between variance and the number of
peptides may be stronger.
Here, we will perform statistical analyses on LFQ data.
These are examples only and the code herein is unlikely to be directly applicable to your own dataset.
Load the required libraries.
library(camprotR) library(ggplot2) library(MSnbase) library(DEqMS) library(limma) library(dplyr) library(tidyr) library(ggplot2) library(broom) library(biobroom)
Here, we will start with the LFQ data processed in Data processing and QC of LFQ data. Please see the previous notebook for details of the experimental design and aim and data processing.
First, we read in the protein-level ratios obtained in the above notebooks.
lfq_protein <- readRDS('./results/lfq_prot_robust.rds')
In brief, we wish to determine the proteins which are significantly depleted by RNase treatment.
We will use three approaches: - paired t-test - moderated paired t-test (limma) - moderated paired t-test (DEqMS)
To perform a t-test for each protein, we want to extract the quantification values in a long 'tidy' format and then re-structure so we have one column each for RNase +/-. We can do this using the biobroom package
We will also filter out proteins which are not present in both samples in at least 3/4 replicates.
lfq_protein_tidy <- lfq_protein %>% biobroom::tidy.MSnSet() %>% separate(sample.id, into=c(NA, 'RNase', 'replicate')) %>% pivot_wider(names_from=RNase, values_from=value) %>% filter(is.finite(neg), is.finite(pos)) %>% group_by(protein) %>% filter(length(protein)>=3)
As an example of how to run a single t-test, let's subset to a single protein. First, we extract the quantification values for this single protein
example_protein <- 'A5YKK6' lfq_protein_tidy_example <- lfq_protein_tidy %>% filter(protein==example_protein) print(lfq_protein_tidy_example)
Then we use t.test
to perform the t-test. We are giving two
vectors of values and the switch paired=TRUE
so that a paired two sample t-test is performed.
t.test.example <- t.test( lfq_protein_tidy_example$pos, lfq_protein_tidy_example$neg, alternative='two.sided', var.equal=FALSE, paired=TRUE) print(t.test.example)
We can use tidy
from the broom
package to return the t-test results in
a tidy tibble. The value of this will be seen in the next code chunk.
head(broom::tidy(t.test.example))
We can now apply a t-test to every protein using dplyr group
and do
, making use of tidy
.
t.test.all <- lfq_protein_tidy %>% group_by(protein) %>% do(tidy(t.test(.$pos, .$neg, paired=TRUE, alternative='two.sided')))
Here are the results for the t-test for the example protein. As we can see, the 'estimate' column in t.text.res.all
is the mean log2 ratio. The 'statistic' column is the t-statistic and the 'parameter' column is the degrees of freedom for the t-statistic. All the values are identical since have performed the exact same test with both approaches.
print(t.test.example) t.test.all %>% filter(protein==example_protein)
When you are performing a lot of statistical tests at the same time, it's recommended practice to plot the p-value distribution. If the assumptions of the test are valid, one expects a uniform distribution from 0-1 for those tests where the null hypothesis should not be rejected. Statistically significant tests will show as a peak of very low p-values. If there are very clear skews in the uniform distribution, or strange peaks other than in the smallest p-value bin, that may indicate the assumptions of the test are not valid, for some or all tests.
There is a clear peak for very low p-values (<0.05) and an approximately uniform distribution across the rest of the p-value range, which is what we want.
hist(t.test.all$p.value)
Since we have performed multiple tests, we want to calculate an adjusted p-value to avoid type I errors (false positives).
Here, are using the Benjamini, Y., and Hochberg, Y. (1995) method to estimate the False Discovery Rate, e.g the proportion of false positives among the rejected null hypotheses.
t.test.all$padj <- p.adjust(t.test.all$p.value, method='BH')
At an FDR of 1%, we have r sum(t.test.all$padj<0.01)
proteins with a significant difference.
sum(t.test.all$padj<0.01)
Proteomics experiments are typically lowly replicated (e.g n << 10).
Variance estimates are therefore inaccurate. limma
[@http://zotero.org/users/5634351/items/6KTXTWME] is an R package that extends
the t-test/ANOVA/linear model testing framework to enable sharing of information
across features (here, proteins) to update the variance estimates. This decreases
false positives and increases statistical power.
We will first reconstruct an MSnset from our filtered data since it's easier to work with limma using this standard proteomics object.
filtered_exprs <- lfq_protein_tidy %>% pivot_longer(cols=c(neg, pos), names_to='RNase') %>% mutate(sample=paste0('RNase_', RNase, '.', replicate)) %>% pivot_wider(names_from=sample, values_from=value, id_cols=protein) %>% tibble::column_to_rownames('protein') %>% as.matrix() filtered_lfq_protein <- MSnSet(exprs=filtered_exprs, fData=fData(lfq_protein)[rownames(filtered_exprs),], pData=pData(lfq_protein)[colnames(filtered_exprs),])
Next, we create the MArrayLM
object and a design model. We then supply these to
limma::lmFit
to fit the linear model according to the design and then use
limma::eBayes
to compute moderated test statistics.
exprs_for_limma <- exprs(filtered_lfq_protein) # Performing the equivalent of a two-sample t-test condition <- pData(filtered_lfq_protein)$Condition replicate <- pData(filtered_lfq_protein)$Replicate limma_design <- model.matrix(formula(~replicate+condition)) limma_fit <- lmFit(exprs_for_limma, limma_design) limma_fit <- eBayes(limma_fit, trend=TRUE)
We can visualise the relationship between the average abundance and the variance using the limma::plotSA
function.
limma::plotSA(limma_fit)
Discussion 1
How would you interpret the plot above?
Solution
# Surprisingly, there's no clear relationship between protein abundance and variance
Solution end
Despite the lack of a strong relationship between protein abundance and variance, we will continue with limma regardless, since it will still increase the effective degrees of freedom with which the gene-wise variances are estimated. In this case, the variances will be shrunk towards a similar value, regardless of the mean protein abundance.
We can extract a results table like so
# use colnames(limma_fit$coefficients) to identify the coefficient names limma_results <- topTable(limma_fit, n=Inf, coef='conditionRNase_pos')
Below, we summarise the number of proteins with statistically different abundance in CL vs NC and plot a 'volcano' plot to visualise this.
table(limma_results$adj.P.Val<0.01) limma_results %>% ggplot(aes(x = logFC, y = -log10(adj.P.Val), colour = adj.P.Val < 0.01)) + geom_point() + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'RNase +/- Sig.') + labs(x = 'RNase +/- (Log2)', y = '-log10(p-value)')
Discussion 2
- Given the experimental design, would you expect proteins to have signficant changes in both directions?
- How would you interpret the fold changes identified with respect to the absolute protein abundances?
- Does your answer to question 2 affect your expectation in question 1?
Solution
# 1. We are starting from a single OOPS interface and adding +/- RNase, then recollecting the interface. RBPs should be depleted, but there is no way for any protein to be enriched by this process. Thus, we wouldn't expect proteins with a positive RNase +/- ratio! # 2. The protein abundances are relative to the total amount of protein in each sample. Thus, the fold changes are relative to the overall fold-change in the amount of protein in each sample. For example, if there is a global loss of protein in RNase + samples, a positive RNase +/- ratio only represents a increase relative to the global loss, and could be a negative RNase +/- ratio in absolute terms! # 3. We are seeing positive RNase +/- ratios because there is a global difference in the amount of protein and we are using a relative protein abundance quantification approach.
Solution end
We can now compare the results from the t-test and the moderated t-test (limma). Below, we update the column names so it's easier to see which column comes from which test and then merge the two test results.
lfq_compare_tests <- merge( setNames(limma_results, paste0('limma.', colnames(limma_results))), setNames(t.test.all, paste0('t.test.', colnames(t.test.all))), by.x='row.names', by.y='t.test.protein')
Exercise
Compare the effect size estimates from t-test vs the moderated t-tests. Can you explain what you observe?
Hints:
- For the t-test, you want the 't.test.estimate' column
- For the moderated t-test, you can use the 'limma.logFC' column
Solution
ggplot(lfq_compare_tests) + aes(t.test.estimate, limma.logFC) + geom_point() + geom_abline(slope=1) + theme_camprot(border=FALSE) + labs(x='t-test logFC', y='limma logFC') # The logFC are the same! Remember that limma is not changing the underlying data, # just moderating the test statistics.
Solution end
We can also compare the p-values from the two tests. Note that the p-value is almost
always lower for the moderated t-test with limma
than the standard t-test.
p <- ggplot(lfq_compare_tests) + aes(log10(t.test.p.value), log10(limma.P.Value)) + geom_point() + geom_abline(slope=1, linetype=2, colour=get_cat_palette(1), size=1) + theme_camprot(border=FALSE) + labs(x='T-test log10(p-value)', y='limma log10(p-value)') print(p)
Finally, we can compare the number of proteins with a significant difference
(Using 1% FDR threshold) according to each test. Using the t-test, there are r sum(lfq_compare_tests$t.test.padj<0.01)
significant differences, but with limma r sum(lfq_compare_tests$limma.P.Value<0.01)
proteins have a significant difference.
lfq_compare_tests %>% group_by(t.test.padj<0.01, limma.P.Value<0.01) %>% tally()
limma assumes there is a relationship between protein abundance and variance. This is usually the case, although we have seen above that this isn't so with our data. For LFQ, the relationship between variance and the number of peptides may be stronger.
DEqMS [@http://zotero.org/users/5634351/items/RTM6NFVU], is an alternative to limma, which you can think of as an extension of limma [@http://zotero.org/users/5634351/items/6KTXTWME] specifically for proteomics, which uses the number of peptides rather than mean abundance to share information between proteins.
The analysis steps are taken from the
DEqMS vignette.
We start from the MArrayLM
we created for limma
analysis and then simply
add a $count
column to the MArrayLM
object and use the spectraCounteBayes
function to perform the Bayesian shrinkage using the count column, which describes
the number of pepitdes per protein. This is contrast to limma
, which uses the
$Amean
column, which describes the mean protein abundance.
To define the $count
column, we need to summarise the number of peptides per protein.
In the DEqMS paper, they suggest that the best summarisation metric to use is the
minimum value across the samples, so our count
column is the minimum number of
peptides per protein.
filtered_lfq_protein_long <- filtered_lfq_protein %>% exprs() %>% data.frame() %>% tibble::rownames_to_column('Master.Protein.Accessions') %>% pivot_longer(cols=-Master.Protein.Accessions, values_to='abundance', names_to='sample') lfq_pep_res <- readRDS('results/lfq_pep_restricted.rds') # Obtain the min peptide count across the samples and determine the minimum value across # samples min_pep_count <- camprotR::count_features_per_protein(lfq_pep_res) %>% merge(filtered_lfq_protein_long, by=c('Master.Protein.Accessions', 'sample')) %>% filter(is.finite(abundance)) %>% # We only want to consider samples with a ratio quantified group_by(Master.Protein.Accessions) %>% summarise(min_pep_count = min(n)) # add the min peptide count limma_fit$count <- min_pep_count$min_pep_count
And now we run spectraCounteBayes
from DEqMS
to perform the statistical test.
# run DEqMS efit_deqms <- suppressWarnings(spectraCounteBayes(limma_fit))
Below, we inspect the peptide count vs variance relationship which DEqMS
is
using in the statistical test.
In this case the relationship between peptide count and variance is not clear at all. We press on regardless.As with the limma analysis, the variance will be shrunk towards a global mean rather than one informed by the number of peptides
# Diagnostic plots VarianceBoxplot(efit_deqms, n = 30, xlab = "Peptides")
Below, we summarise the number of proteins with statistically different abundance in RNase +/- and plot a 'volcano' plot to visualise this.
deqms_results <- outputResult(efit_deqms, coef_col=3) table(deqms_results$sca.adj.pva<0.01) deqms_results %>% ggplot(aes(x = logFC, y = -log10(sca.P.Value), colour = sca.adj.pval < 0.01)) + geom_point() + theme_camprot(border=FALSE, base_size=15) + scale_colour_manual(values = c('grey', get_cat_palette(2)[2]), name = 'RNase +/- Sig.') + labs(x = 'RNase +/- (Log2)', y = '-log10(p-value)')
We can compare the results of limma and DEqMS by considering the number of significant differences. Note that the limma results are also contained with the results from DEqMS. The $t
, $P.Value
and $adj.P.Val
columns are from limma
. The columns prefixed with sca
are the from DEqMS
.
deqms_results %>% group_by(limma_sig=adj.P.Val<0.01, DEqMS_sig=sca.adj.pval<0.01) %>% tally()
We can compare the results of limma and DEqMS by considering the p-values. Note that here, they are very well correlated. This is because the methods failed to identify a strong trend between the mean abundance (limma) or number of peptides (DEqMS) and the variance. Thus, both shrunk the variance towards a global mean and similarly increased the effective degrees of freedom.
deqms_results %>% ggplot() + aes(P.Value, sca.P.Value) + geom_point() + geom_abline(slope=1) + theme_camprot(border=FALSE) + labs(x='limma p-value', y='DEqMS p-value') + scale_x_log10() + scale_y_log10()
Finally, at this point we can save any one of the data.frames
containing the statistical test results, either to a compressed format (rds
) to read back into a later R notebook, or a flatfile (.tsv
) to read with e.g excel.
# These lines are not run and are examples only saveRDS(deqms_results, 'filename_to_save_to.rds') write.csv(deqms_results, 'filename_to_save_to.tsv', sep='\t', row.names=FALSE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.