knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
First we load the package:
library(pair)
pair
pair
comes with two datasets, ups
and yeast
published in @berg2019evaluation (details can be found there).
The ups
data is a global proteomics dataset without reducing peptides to proteins with Chlamydomonas reinhardtii (chlamy) peptides at the same concentration in all samples spiked-in with four different concentrations of the Universal Proteomics Standard Set 1 (UPS1).
The yeast
data was generated by extracting reversibly oxidized cysteines in yeast or chlamy where the yeast extraction was added in two different concentrations and chlamy in the same in all samples.
The yeast data is therefore a lot noisier than the ups
.
The yeast data looks as follows:
yeast
where the identifier
is a column that uniquely maps each cysteine to the observations in ng50_*
or ng100_*
(which are the relative concentrations the yeast cysteines were spiked in with) and the integer at the end indicates different repeated experiments.
The ups data looks as follows:
ups
where the identifier
is as in the yeast dataset but uniquely maps peptides and fmol25_*
, fmol50_*
, and fmol100_*
are different relative concentrations of UPS1 with different technical replicates.
The first step is to normalize the data.
The function plot_norm_box
generates boxplots of the raw data after log$_{2}$ transformation and two different normalization methods. If we first look at the yeast data, then the plot can be generated as follows:
plot_norm_box(yeast, 'identifier')
The data is already has a stable median, but we still do normalization.
This can either be done tmm
as introduced in @robinson2010scaling or with psrn
as introduced in @anders2010differential.
To exemplify we use psrn
and tmm
as follows:
yeast_psrn <- psrn(yeast, 'identifier') yeast_tmm <- tmm(yeast)
Both psrn
and tmm
first normalizes the data and then log$_{2}$ transforms it unless the flag log=FALSE
is used.
For the ups
data we do the same thing:
plot_norm_box(ups, 'identifier')
we might notice that the x-axis labels are overlapping.
Since plot_norm_box
returns a ggplot
object, one can easily rotate the label by using the +
syntax and ggplot2::theme
as follows:
plot_norm_box(ups, 'identifier') + ggplot2::theme( axis.text.x = ggplot2::element_text(angle = 15) )
in this way, any theme of the plot can easily be modified using ggplot2
syntax.
One could also use the plot_target
feature of plot_norm_box
to select only a subset of samples to plot:
plot_norm_box(ups, 'identifier', plot_target = contains('fmol25'))
which supports the tidyselect
package operations for selecting columns.
Likewise, tmm
and psrn
have the flag target
which can be used to select only a subset of columns to normalize.
If left unspecified they will use all numerical columns in the normalization!
They also contain the flag load_info
which then makes the functions return a list with the normalized data, the estimated loading concentrations, and the estimated normalization factors used.
We then normalize the ups
the same way as the yeast
:
ups_psrn <- psrn(ups, 'identifier') ups_tmm <- tmm(ups)
From here on out, most functions in pair
will depend on a design matrix.
A design matrix can be created as follows:
yeast_design <- model.matrix( #Yeast has two conditions (1:2) and three replicates (each = 3) ~0+factor(rep(1:2, each = 3)) ) ups_design <- model.matrix( #ups has three conditions (1:3) and four replicates (each = 4) ~0+factor(rep(1:3, each = 4)) ) #The design matrices has to have the same name as the conditions: colnames(yeast_design) <- paste0('ng', c(50, 100)) colnames(ups_design) <- paste0('fmol', c(25, 50, 100))
The gamma regression and mean-variance scatter plots can then be produced by plot_gamma_regression
.
plot_gamma_regression
plots the trends used for precision weights (later fed to limma
's lmFit
similar to voom
; @law2014voom) on the left and gamma regression used to infer the standard deviation given the mean (used for imputation) on the right.
For yeast
:
plot_gamma_regression(yeast_psrn, yeast_design, 'identifier')
and for ups
:
plot_gamma_regression(ups_psrn, ups_design, 'identifier')
The two individual plots can be generated similarly by calling plot_gamma_weights
or plot_gamma_imputation
.
If one wants to inspect the estimated models that can be done similarly with fit_gamma_regressions
, fit_gamma_imputation
, or fit_gamma_weights
.
To perform multiple imputations and then run limma on each imputation can efficiently be performed by run_pipeline
.
run_pipeline
fits the needed gamma regressions and has a backend to multidplyr
for parallel processing.
This can drastically increase the speed when running 1000 or more imputations.
Here is an example of the yeast
data running 1000 imputations and using 5 parallel workers.
First, we need to generate a contrast matrix that indicates what comparisons to perform.
yeast_contrast <- limma::makeContrasts( contrasts = 'ng100-ng50', levels = yeast_design )
This performs the test if the mean of the cysteines in the ng100
conditions is different from the means in the ng50
condition.
Further, it also means that the LFC will be calculated as ng100 - ng50
.
We are now ready to run the imputation.
yeast_multi_imputation_results <- run_pipeline( data = yeast_psrn, design = yeast_design, contrast_matrix = yeast_contrast, imputations = 1000, workers = 5, id_col = 'identifier', plot_trend = FALSE # we have already looked at the mean-variance plots ) # otherwise, putting this flag to TRUE would generate those plots
yeast_multi_imputation_results
will now be a tibble
with each row containing one imputation and one decision from limma.
yeast_multi_imputation_results
If one wants to access one particular imputation or decision that can be done with $
and [[
as follows.
yeast_multi_imputation_results$imputed_data[[5]] yeast_multi_imputation_results$limma_results[[5]]
which shows the results of the fifth imputation.
Next to perform p-value correction and binomial testing for the imputations we use extract_results
.
yeast_reduced_results <- extract_results( data = yeast_psrn, results = yeast_multi_imputation_results, id_col = 'identifier', alpha = 0.05, abs_lfc = 1, pcor = 'fdr', null_hyp = 0.5 )
To perform a single imputation one can use the single_imputation
function.
yeast_one_imputation <- single_imputation(yeast_psrn, yeast_design, 'identifier')
Next, we fit the gamma regression for estimating the precision weights with fit_gamma_weights
.
yeast_gam_reg <- fit_gamma_weights(yeast_psrn, yeast_design, 'identifier')
We now run a single instance of limma
and calculate the LFC with run_limma_and_lfc
.
yeast_one_imputation_results <- run_limma_and_lfc( yeast_one_imputation, yeast_design, yeast_contrast, yeast_gam_reg, 'identifier' )
One might want to perform some p-value correction on this data.
This can be one with R
's function p.adjust
.
Further, since run_limma_and_lfc
returns the data in tidy format and if there are several comparisons, dplyr
's group_by
and mutate
can performe one correction per conditions as follows:
yeast_one_imputation_results <- yeast_one_imputation_results %>% dplyr::group_by(comparison) %>% dplyr::mutate( p_val = p.adjust(p_val, 'fdr') )
The final decision for the analysis can be plotted with plot_ma
.
plot_ma
returns a ggplot
object so it can easily be ascetically adjusted using ggplot
syntax.
plot_ma(yeast_reduced_results) + # Move legend to below the plot ggplot2::theme( legend.position = 'bottom' )
For the single imputation, we need to include a few more inputs to plot_ma
.
plot_ma( hits = yeast_one_imputation_results, data = yeast_one_imputation, id_col = 'identifier', alpha = .05, abs_lfc = 0.5 ) + # Set color-blind friendly colors ggplot2::scale_color_manual( values = c('grey45', 'darkblue', 'darkred') )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.