In PhilipBerg/PaiR: Imputation and Significance Analysis of Proteomics Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

First we load the package:

library(pair)

Datasets included in `pair`

pair comes with two datasets, ups and yeast published in @berg2019evaluation (details can be found there). The ups data is a global proteomics dataset without reducing peptides to proteins with Chlamydomonas reinhardtii (chlamy) peptides at the same concentration in all samples spiked-in with four different concentrations of the Universal Proteomics Standard Set 1 (UPS1). The yeast data was generated by extracting reversibly oxidized cysteines in yeast or chlamy where the yeast extraction was added in two different concentrations and chlamy in the same in all samples. The yeast data is therefore a lot noisier than the ups. The yeast data looks as follows:

yeast

where the identifier is a column that uniquely maps each cysteine to the observations in ng50_* or ng100_* (which are the relative concentrations the yeast cysteines were spiked in with) and the integer at the end indicates different repeated experiments. The ups data looks as follows:

ups

where the identifier is as in the yeast dataset but uniquely maps peptides and fmol25_*, fmol50_*, and fmol100_* are different relative concentrations of UPS1 with different technical replicates.

Data normalization

The first step is to normalize the data. The function plot_norm_box generates boxplots of the raw data after log$_{2}$ transformation and two different normalization methods. If we first look at the yeast data, then the plot can be generated as follows:

plot_norm_box(yeast, 'identifier')

The data is already has a stable median, but we still do normalization. This can either be done tmm as introduced in @robinson2010scaling or with psrn as introduced in @anders2010differential. To exemplify we use psrn and tmm as follows:

yeast_psrn <- psrn(yeast, 'identifier')
yeast_tmm <- tmm(yeast)

Both psrn and tmm first normalizes the data and then log$_{2}$ transforms it unless the flag log=FALSE is used. For the ups data we do the same thing:

plot_norm_box(ups, 'identifier')

we might notice that the x-axis labels are overlapping. Since plot_norm_box returns a ggplot object, one can easily rotate the label by using the + syntax and ggplot2::theme as follows:

plot_norm_box(ups, 'identifier') +
  ggplot2::theme(
    axis.text.x = ggplot2::element_text(angle = 15)
  )

in this way, any theme of the plot can easily be modified using ggplot2 syntax. One could also use the plot_target feature of plot_norm_box to select only a subset of samples to plot:

plot_norm_box(ups, 'identifier', plot_target = contains('fmol25'))

which supports the tidyselect package operations for selecting columns. Likewise, tmm and psrn have the flag target which can be used to select only a subset of columns to normalize. If left unspecified they will use all numerical columns in the normalization! They also contain the flag load_info which then makes the functions return a list with the normalized data, the estimated loading concentrations, and the estimated normalization factors used. We then normalize the ups the same way as the yeast:

ups_psrn <- psrn(ups, 'identifier')
ups_tmm <- tmm(ups)

Gamma regression for imputation and mean-variance normalization

From here on out, most functions in pair will depend on a design matrix. A design matrix can be created as follows:

yeast_design <- model.matrix(
  #Yeast has two conditions (1:2) and three replicates (each = 3)
  ~0+factor(rep(1:2, each = 3))
)
ups_design <- model.matrix(
  #ups has three conditions (1:3) and four replicates (each = 4)
  ~0+factor(rep(1:3, each = 4))
)
#The design matrices has to have the same name as the conditions:
colnames(yeast_design) <- paste0('ng', c(50, 100))
colnames(ups_design) <- paste0('fmol', c(25, 50, 100))

The gamma regression and mean-variance scatter plots can then be produced by plot_gamma_regression. plot_gamma_regression plots the trends used for precision weights (later fed to limma's lmFit similar to voom; @law2014voom) on the left and gamma regression used to infer the standard deviation given the mean (used for imputation) on the right. For yeast:

plot_gamma_regression(yeast_psrn, yeast_design, 'identifier')

and for ups:

plot_gamma_regression(ups_psrn, ups_design, 'identifier')

The two individual plots can be generated similarly by calling plot_gamma_weights or plot_gamma_imputation. If one wants to inspect the estimated models that can be done similarly with fit_gamma_regressions, fit_gamma_imputation, or fit_gamma_weights.

Data imputation and significance calling

Multiple imputation

To perform multiple imputations and then run limma on each imputation can efficiently be performed by run_pipeline. run_pipeline fits the needed gamma regressions and has a backend to multidplyr for parallel processing. This can drastically increase the speed when running 1000 or more imputations. Here is an example of the yeast data running 1000 imputations and using 5 parallel workers. First, we need to generate a contrast matrix that indicates what comparisons to perform.

yeast_contrast <- limma::makeContrasts(
  contrasts = 'ng100-ng50',
  levels = yeast_design
)

This performs the test if the mean of the cysteines in the ng100 conditions is different from the means in the ng50 condition. Further, it also means that the LFC will be calculated as ng100 - ng50. We are now ready to run the imputation.

yeast_multi_imputation_results <- run_pipeline(
  data = yeast_psrn,
  design = yeast_design,
  contrast_matrix = yeast_contrast,
  imputations = 1000,
  workers = 5,
  id_col = 'identifier',
  plot_trend = FALSE # we have already looked at the mean-variance plots
)                    # otherwise, putting this flag to TRUE would generate those plots

yeast_multi_imputation_results will now be a tibble with each row containing one imputation and one decision from limma.

yeast_multi_imputation_results

If one wants to access one particular imputation or decision that can be done with $ and [[ as follows.

yeast_multi_imputation_results$imputed_data[[5]]
yeast_multi_imputation_results$limma_results[[5]]

which shows the results of the fifth imputation. Next to perform p-value correction and binomial testing for the imputations we use extract_results.

yeast_reduced_results <- extract_results(
  data = yeast_psrn,
  results = yeast_multi_imputation_results,
  id_col = 'identifier',
  alpha = 0.05,
  abs_lfc = 1,
  pcor = 'fdr',
  null_hyp = 0.5
)

Single imputation

To perform a single imputation one can use the single_imputation function.

yeast_one_imputation <- single_imputation(yeast_psrn, yeast_design, 'identifier')

Next, we fit the gamma regression for estimating the precision weights with fit_gamma_weights.

yeast_gam_reg <- fit_gamma_weights(yeast_psrn, yeast_design, 'identifier')

We now run a single instance of limma and calculate the LFC with run_limma_and_lfc.

yeast_one_imputation_results <- run_limma_and_lfc(
  yeast_one_imputation,
  yeast_design, 
  yeast_contrast, 
  yeast_gam_reg, 
  'identifier'
)

One might want to perform some p-value correction on this data. This can be one with R's function p.adjust. Further, since run_limma_and_lfc returns the data in tidy format and if there are several comparisons, dplyr's group_by and mutate can performe one correction per conditions as follows:

yeast_one_imputation_results <- yeast_one_imputation_results %>% 
  dplyr::group_by(comparison) %>% 
  dplyr::mutate(
    p_val = p.adjust(p_val, 'fdr')
  )

Plotting the results

Multiple imputation

The final decision for the analysis can be plotted with plot_ma. plot_ma returns a ggplot object so it can easily be ascetically adjusted using ggplot syntax.

plot_ma(yeast_reduced_results) + # Move legend to below the plot
  ggplot2::theme(
    legend.position = 'bottom'
  )

Single imputation

For the single imputation, we need to include a few more inputs to plot_ma.

plot_ma(
  hits = yeast_one_imputation_results,
  data = yeast_one_imputation,
  id_col = 'identifier',
  alpha = .05,
  abs_lfc = 0.5
) + # Set color-blind friendly colors
  ggplot2::scale_color_manual(
    values = c('grey45', 'darkblue', 'darkred')
  )