In wolski/prolfqua: Proteomics Label Free Quantification

knitr::opts_chunk$set(
  echo = FALSE,
  message = FALSE,
  warning = FALSE,
  fig.width = 5.5,
  fig.height = 3.5,
  fig.align = "center",
  fig.pos = "H"
)

#params <- prr
data <- eval(params$data)

local_project_conf <- params$project_conf
local_config <- eval(params$configuration)
if (!is.null(local_config$table$is_intensity_transformed)) {
  local_config <- prolfqua::old2new(local_config)
}




if (is.null(local_project_conf)) {
  local_project_conf <- list()
}

if (is.null(params$plot_density)) {
  params$plot_density <- TRUE
}

if (is.null(params$plot_sd_vs_mean)) {
  params$plot_sd_vs_mean <- FALSE
}
old.theme <- ggplot2::theme_set(ggplot2::theme_classic())

\clearpage

Introduction

Workunit: r local_project_conf$workunit_Id
Project: r local_project_conf$project_Id
Order : r local_project_conf$order_Id

You were asked to hand in 4 QC samples, to asses the biological, biochemical, and technical variability of your experiments. We did run your samples through the same analysis pipeline, which will be applied in the main experiment. This document summarizes the r if(params$pep){"peptide"}else{"protein"} variability to asses the reproducibility of the biological samples and estimates the sample sizes needed for the main experiment.

Quality Control: Identifications

Here we summarize the number of r if(params$pep){"peptides"}else{"proteins"} measured in the QC experiment. Depending on the type of your sample (e.g., pull-down, supernatant, whole cell lysate) we observe some dozens up to a few thousands of proteins r if(params$pep){", and between a few hundred to up to some few tens of thousands of peptides"}else{""}. While the overall number of r if(params$pep){"proteins and peptides"}else{"proteins"} can highly vary depending of the type of experiment, it is crucial that the number of r if(params$pep){"proteins and peptides"}else{"proteins"} between your biological replicates is similar (reproducibility).

library(rlang)
library(prolfqua)
lfqdata <- prolfqua::LFQData$new(data, local_config)

x <- lfqdata$hierarchy_counts()

prolfqua::table_facade( 
  data.frame(NR = x),
  caption = paste0("Nr of ",
                   if(params$pep){"proteins and peptides"}
                   else
                   {"proteins"} ,
                   " detected in all samples."))

nrSamples <- nrow(lfqdata$factors())
width = max(8, nrSamples*0.2)

(ref:hierarchyCountsSampleBarplot) Number of quantified r if(params$pep){"peptides"}else{"proteins"} per sample.

summarizer <- lfqdata$get_Summariser()
summarizer$plot_hierarchy_counts_sample()

x3 <- lfqdata$summarize_hierarchy()

Quality Control: Quantification

Summary of missing data

Ideally, we identify each r if(params$pep){"peptide"}else{"protein"} in all of the samples. However, because of the limit of detection (LOD) low-intensity r if(params$pep){"peptides"}else{"proteins"} might not be observed in all samples. Ideally, the LOD should be the only source of missingness in biological replicates. The following figures help us to verify the reproducibility of the measurement at the level of missing data.

height <- max(7, nrSamples * 0.1)

(ref:missingFigIntensityHistorgram) Top - intensity distribution of r if(params$pep){"peptides"}else{"proteins"} with 0, 1 etc. missing values. B - number of r if(params$pep){"peptides"}else{"proteins"} with 0, 1, 2 etc. missing value.

plotter <- lfqdata$get_Plotter()
p <- plotter$missigness_histogram()
xx3 <- summarizer$plot_missingness_per_group()
xx4 <- summarizer$plot_missingness_per_group_cumsum()
gridExtra::grid.arrange(p, xx3, xx4, ncol = 1)

(ref:missingnessHeatmap) Heatmap of missing r if(params$pep){"peptide"}else{"protein"} quantifications clustered by sample, black - missing intensities, white - present.

pNAmap <- plotter$NA_heatmap()
width <- max(8, nrSamples * 0.15)

print(pNAmap)

show_text <- !lfqdata$is_transformed()

cat("## Variablity of the raw intensities\n\n")

cat("Without applying any intensity scaling and data preprocessing, the peptide intensities in all samples should be similar. To asses this we plotted the distribution of the peptide intensities in the samples (Figure \\@ref(fig:plotDistributions)) as well as the distribution of the coefficient of variation CV for all peptides in the samples (Figure \\@ref(fig:intensityDistribution)). Table \\@ref(tab:printTable) summarises the CV.")

stats <- lfqdata$get_Stats()
if (params$plot_density) {
  p1 <- stats$density() +
    ggplot2::labs(tag = "A") +
    ggplot2::xlim(0, 150) #+
    #ggplot2::theme(legend.position = "none")
  p2 <- stats$density_median() +
    ggplot2::labs(tag = "B") +
    ggplot2::xlim(0, 150) +
    ggplot2::theme(legend.position = "bottom")
  gridExtra::grid.arrange(p1,p2)
} else {
  p1 <- stats$violin() + ggplot2::labs(tag='A')
  p1
  p2 <- stats$violin_median() + ggplot2::labs(tag='B')
  gridExtra::grid.arrange(p1,p2)
}

if (params$plot_sd_vs_mean) {
  p0 <- stats$stdv_vs_mean() + labs(tag='A')
}

cv_quantiles_res <- stats$stats_quantiles(probs = c(0.5, 0.6, 0.7, 0.8, 0.9))$wide

prolfqua::table_facade(cv_quantiles_res,
                         caption = "Summary of the coefficient of variation (CV) at the 50th, 60th, 70th, 80th and 90th percentile.")

if(params$plot_density){
  p0 <- plotter$intensity_distribution_density() + 
    ggplot2::theme(legend.text = ggplot2::element_text(size=5))
  p0
} else{
  p1 <- plotter$intensity_distribution_violin()
  p1
}

Variability of transformed intensities

We $\log_2$ transformed and applied the prolfqua::robust_scale() transformation to the data. This transformation transforms and scales the data to reduce the variance (Figure \@ref(fig:plotTransformedIntensityDistributions)). Because of this, we can't report CV anymore but report standard deviations (sd). Figure \@ref(fig:sdviolinplots) shows the distribution of the r if(params$pep){"peptide"}else{"protein"} standard deviations while Figure \@ref(fig:sdecdf) shows the empirical cumulative distribution function (ecdf). Table \@ref(tab:printSDTable) summarises the sd. The heatmap in Figure \@ref(fig:correlationHeat) envisages the correlation between the QC samples.

if (show_text) {
  tr <- lfqdata$get_Transformer()
  tr <- tr$log2()
  tr <- tr$robscale()
  dataTransformed <- tr$lfq
} else{
  dataTransformed <- lfqdata
}

(ref:plotTransformedIntensityDistributions) r if(params$pep){"Peptide"}else{"Protein"} intensity distribution after transformation.

plotter <- dataTransformed$get_Plotter()

if (params$plot_density) {
  plotter$intensity_distribution_density() + 
    ggplot2::theme(legend.text = ggplot2::element_text(size = 5))
} else {
  plotter$intensity_distribution_violin()
}

(ref:correlationHeat) Heatmap of r if(params$pep){"peptide"}else{"protein"} intensity correlation between samples.

chmap <- plotter$heatmap_cor()
width <- max(8, nrSamples * 0.15)

print(chmap)

plotter$pairs_smooth(max = 10)

(ref:sdviolinplots) Visualization of r if(params$pep){"peptide"}else{"protein"} standard deviations. A) all. B) - for low (bottom 50) and high intensity (top 50).

st <- dataTransformed$get_Stats()

if (params$plot_density) {
  p1 <- st$density()
    ggplot2::labs(tag = "A") +
    ggplot2::theme(legend.position = "none")
  p2 <-
    st$density_median() +
    ggplot2::labs(tag = "B") +
    ggplot2::theme(legend.position = "bottom")
  gridExtra::grid.arrange(p1, p2)
} else {
  p1 <- st$violin() + 
    ggplot2::labs(tag = "A")
  p2 <- st$violin_median()  + 
    ggplot2::labs(tag = "B")
  gridExtra::grid.arrange(p1, p2)
}

(ref:sdecdf) Visualization of r if(params$pep){"peptide"}else{"protein"} standard deviations as empirical cumulative distribution function. A) all. B) - for low (bottom 50) and high intensity (top 50).

  p1 <- st$density(ggstat = "ecdf")
    ggplot2::labs(tag = "A") +
    ggplot2::theme(legend.position = "none")
  p2 <-
    st$density_median(ggstat = "ecdf") +
    ggplot2::theme(legend.position = "bottom")
  gridExtra::grid.arrange(p1, p2)

if (params$plot_sd_vs_mean) {
  st$stdv_vs_mean()
}

sd_quantile_res2 <- st$stats_quantiles(probs = c(0.5, 0.6, 0.7, 0.8, 0.9))$wide

prolfqua::table_facade(sd_quantile_res2,
  caption = "Summary of the distribution of standard deviations at the 50th, 60th, 70th, 80th and 90th percentile.")

(ref:overviewHeat) Sample and r if(params$pep){"peptide"}else{"protein"} Heatmap.

hm <- plotter$heatmap()
width = max(10, nrSamples * 0.15)

print(hm)

Sample Size Calculation

In the previous section, we estimated the r if(params$pep){"peptide"}else{"protein"} variance using the QC samples. Figure \@ref(fig:sdviolinplots) shows the distribution of the standard deviations. We are using this information, as well as some typical values for the size and the power of the test to estimate the required sample sizes for your main experiment.

An important factor in estimating the sample sizes is the smallest effect size (difference) you are interested in detecting between two conditions, e.g. a reference and a treatment. Smaller biologically significant effect sizes require more samples to obtain a statistically significant result. Typical $log_2$ fold change thresholds are $0.59, 1, 2$ which correspond to a fold change of $1.5, 2, 4$.

Table \@ref(tab:sampleSize) and Figure \@ref(fig:figSampleSize) summarizes how many samples are needed to detect a fold change of $0.5, 1, 2$ at a confidence level of $95\%$ and power of $80\%$, for $50, 60, 70, 80$ and $90\%$ percent of the measured r if(params$pep){"peptides"}else{"proteins"}.

(ref:figSampleSize) Graphical representation of the sample size needed to detect a log fold change greater than delta with a significance level of $0.05$ and power 0.8 when using a t-test to compare means, in $X\%$ of r if(params$pep){"peptides"}else{"proteins"} (x - axis).

sampleSize2 <- st$power_t_test_quantiles(probs = c(0.5,0.75), delta = c(0.59,1,2))
nudgeval <- max(sampleSize2$N) * 0.05

sampleSize2 <- sampleSize2 |> dplyr::mutate(delta = paste0("delta = ", delta))

#stats_res <- summarize_stats(dataTransformed$data, dataTransformed$config)
#xx <- summarize_stats_quantiles(stats_res, local_config,probs = c(0.5,0.75))
#lfq_power_t_test_quantiles_V2(xx$long,delta = c(0.59,1,2))

ggplot2::ggplot(sampleSize2, ggplot2::aes(x = probs, y = N)) +
  ggplot2::geom_bar(stat = "identity", color = "black", fill = "white") +
  ggplot2::geom_text(ggplot2::aes(label = N), nudge_y = nudgeval) + 
  ggplot2::facet_wrap(as.formula(paste0("~ ", "delta + " , paste0( lfqdata$config$table$factor_keys_depth(), collapse = " + " ))))

tmp <- sampleSize2 |> tidyr::pivot_wider(id_cols = c("probs","sdtrimmed", lfqdata$config$table$factor_keys_depth()), names_from =  delta, values_from = N  )

prolfqua::table_facade(tmp,
  caption = "Sample size needed to detect a difference log fold change greater than delta with a significance level of 0.05 and power 0.8 when using a t-test to compare means.")

The power of a test is $1-\beta$, where $\beta$ is the probability of a Type 2 error (failing to reject the null hypothesis when the alternative hypothesis is true). In other words, if you have a $20\%$ chance of failing to detect a real difference, then the power of your test is $80\%$.

The confidence level is equal to $1 - \alpha$, where $\alpha$ is the probability of making a Type 1 Error. That is, alpha represents the chance of a falsely rejecting $H_0$ and picking up a false-positive effect. Alpha is usually set at $5\%$ significance level, for a $95\%$ confidence level.

Fold change: Suppose you are comparing a treatment group to a placebo group, and you will be measuring some continuous response variable which, you hypothesize, will be affected by the treatment. We can consider the mean response in the treatment group, $\mu_1$, and the mean response in the placebo group, $\mu_2$. We can then define $\Delta = \mu_1 - \mu_2$ as the mean difference. The smaller the difference you want to detect, the larger the required sample size.

Appendix

prolfqua::table_facade(
  dataTransformed$factors(),
  caption = "Mapping of raw file names to sample names used throughout this report.")

caption <- paste("Number of quantified ",
                 if (params$pep) { "peptides and proteins" } else {"peptides" }, " per sample.")

st <- dataTransformed$get_Summariser()
prolfqua::table_facade(
  st$hierarchy_counts_sample()
  , caption = caption)