knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Purpose

This vignette demonstrates how the package can be used to create descriptive statistics or quality control plots for a dataset. First, we read a "proteinGroups.txt" file from the MaxQuant software, and next, we specify a prolfqua configuration. Afterward, the data is transformed and normalized, and plots describing the data are generated. Finally, we estimate how many samples are necessary to properly quantify a two-fold change for $90%$ of all the proteins with a power of $0.8$.

Loading data

Generate Simulated data. To learn how to create prolfqua LFQData objects see the vignette: creating configurations.

simdata <- prolfqua::sim_lfq_data_peptide_config()
lfqdata <- prolfqua::LFQData$new(simdata$data,simdata$config)
lfqdata$remove_small_intensities()

You can convert the data into a data frame in a wide format, where the intensities of each sample occupy their columns.

lfqdata$to_wide()$data[1:3,1:8]

Visualization of not normalized data

Next we show how the data is distributed before transformation and normalization.

lfqplotter <- lfqdata$get_Plotter()
lfqplotter$intensity_distribution_density()

Visualization of missing data

Often it also makes sense to look into the part of the data that is not quantified in all samples. Therefore several functions for the missingess are available. In the NA_heatmap function it is possible to see if some missing proteins are specific for one particular dilution or if they appear randomly in all samples.

nah <- lfqplotter$NA_heatmap()

Prints missing heatmap:

nah

Also in the next two figures we show how many missing values are found in the respective conditions and how many proteins do not have missing data and how complete the measurements are for each group.

lfqdata$get_Summariser()$plot_missingness_per_group()
lfqdata$get_Summariser()$plot_missingness_per_group_cumsum()

In the missigness_histogram we can check if there is a dependency between NAs (and the number of NAs for earch protein) with respect to the protein intensity.

lfqplotter$missigness_histogram()

Computing standard deviations, mean and CV.

Other important statistics that can be easily calculated from the object are coefficient of variation, means and standard deviations.

stats <- lfqdata$get_Stats()
prolfqua::table_facade( stats$stats_quantiles()$wide, paste0("quantile of ",stats$stat ))
stats$density_median()

In the next figure we show the dependency of the standard deviation with respect to the mean intensitiy of the proteins.

stdm_raw <- stats$stdv_vs_mean(size = 10000) +
  ggplot2::scale_x_log10() +
  ggplot2::scale_y_log10()
stdm_raw

Normalize protein intensities

Next, we want to normalize the data by first $\log_2$ transforming it and then z-scaling. The $log_2$ stabilizes the variance, while the z-scaling removes systematic differences from the samples.

lt <- lfqdata$get_Transformer()
transformed <- lt$log2()$robscale()$lfq
transformed$config$table$is_response_transformed

Again we want to look into the distribution of the intensities in our samples after normalization.

pl <- transformed$get_Plotter()
pl$intensity_distribution_density()

plots matrix scatter plot in the upper right matrix and some statistics in the lower left matrix.

pl$pairs_smooth()

plots protein heatmap:

p <- pl$heatmap() 
p 

We can also look at the correlation among the samples or look at a PCA plot.

hc <- pl$heatmap_cor()
hc
pl$pca()

prints the standard deviations:

stats <- transformed$get_Stats()
prolfqua::table_facade(stats$stats_quantiles()$wide, "Standard deviations")

plots density of the standard deviation

stats$density_median()

Check for heteroskedasticity. After transformation, the standard deviation should be independent of the mean intensity.

stdm_trans <- stats$stdv_vs_mean(size = 10000) + ggplot2::scale_x_log10() + ggplot2::scale_y_log10()
stdm_trans
#gridExtra::grid.arrange(stdm_raw, stdm_trans, nrow=1)

Estimate sample size

For the sample size estimation we will now focus on only one condition and the different bio-reps from the a group In this condition all samples have the same concentration and we want to check variability within this group. Like this we can estimate how many samples are necessary per group to quantify properly a two-fold change (delta of 1) for 90% of all the proteins with a power of 0.8 and significance level of 0.05.

First, we need to filter our data for group_ A only.

transformedA <- transformed$get_copy()
transformedA$data <- transformedA$data |> dplyr::filter(group_  == "A")
stats <- transformedA$get_Stats()
stats$density(ggstat = "ecdf")

The table indicates that for this kind of data - if a two-fold change for 90% of all the proteins should be accurately quantified with a power of 0.8 (at 0.05 significance level) one would need to have 10 samples in each group.

sampleSize <- stats$power_t_test_quantiles() |> 
  dplyr::filter(group_ != "All")
prolfqua::table_facade(sampleSize, "Sample sizes. delta - Effect size, N - samplesize")

The table summarises visually how many samples are needed for different expected differences (in log2-scale) and different portions of all proteins.

sampleSize |>
  ggplot2::ggplot(ggplot2::aes(x = factor(probs) , y = N)) +
  ggplot2::facet_wrap(~delta) +
  ggplot2::geom_bar(stat = "identity")

It is also possible to get the statistics for each protein in each group. The information includes group size, number of observations, standard deviation, variance, mean.

stats$stats() |> head()

You can also estimate the sample size needed to perform differential expression analysis for each protein, given the power, delta, and sig.level. The next figure shows the distribution of needed sample sizes for all proteins with a power of 0.8 at signifiance level 0.05 for a two fold change (delta = 1).

x <- stats$power_t_test(delta = 1,power = 0.8, sig.level = 0.05)
x <- x |> dplyr::filter(group_ == "A") |> dplyr::arrange(desc(N),protein_Id)
prolfqua::table_facade(x[1:7,], caption = "Sample size for each protein")
x |> ggplot2::ggplot(ggplot2::aes(x = N)) +
  ggplot2::geom_histogram() +
  ggplot2::facet_wrap(~delta) +
  ggplot2::xlim(0,100)

The prolfqua package is described in [@Wolski2022.06.07.494524].

Session Info

sessionInfo()

References



wolski/prolfqua documentation built on Oct. 31, 2024, 9:22 a.m.