print(params)
knitr::opts_chunk$set(echo = FALSE, message = FALSE, warning = FALSE,
                      fig.path = params$output_figure)
library(dplyr)
library(stringr)
library(MassExpression)
library(plotly)

CompleteIntensityExperiment <- params$listInt$CompleteIntensityExperiment
IntensityExperiment <- params$listInt$IntensityExperiment
design <- colData(IntensityExperiment)

comparisonExperiments <-
    listComparisonExperiments(CompleteIntensityExperiment)

This QC Report is designed to help Scientists quickly assess the several aspect of experiment quality, There are 4 categories in the QC report:

  1. Experiment Design and Summary of Results: overview of the experiment design used in the analysis and summary of differential expression results.
  2. Experiment Health: this includes dimensionality reduction plots using all and differentially expressed (DE) proteins; distribution of the coeffiecient of variations by conditions and sample correlations using all and DE proteins.
  3. Feature Completedness: this includes the number of missing values by samples and by protein.
  4. Normalisation and Imputation: this includes intensity distribution of raw, normalised (when requested) and imputed intensities.

\clearpage

1. Experiment Design and Summary of Results

design <- as_tibble(design)
design$SampleName <- make.names(design$SampleName)
design <- design %>% tidyr::unite(SampleNameInPlots, Condition, Replicate,
                                                      sep="_", remove=FALSE) %>%
  dplyr::select(-Replicate, everything())

design <- design %>% dplyr::select(SampleName, Condition,
                SampleNameInPlots, everything())


knitr::kable(design, row.names = FALSE)

\clearpage

The table below reports the number of proteins (Number of Proteins column) considered for each pairwise comparison analysis (proteins with more than 50% missing values across samples are removed) and the number of differentially expressed (DE) proteins detected in each comparison (Number DE Proteins column). A protein is defined DE if the adjusted P Value is less than 0.05. No threshold is applied on the log ratio to define a protein as DE. The overall total number of proteins included in the experiment is r nrow(rowData(CompleteIntensityExperiment)).

stats_one_comp <- function(se){
  stats <- as_tibble(rowData(se))

  total_proteins_in_experiment <- nrow(stats)
  total_proteins_de <- sum(stats$ADJ.PVAL < 0.05)
  list(total_proteins_in_experiment=total_proteins_in_experiment, total_proteins_de=total_proteins_de)
}

n_proteins <- sapply(1:length(comparisonExperiments), function(exp) stats_one_comp(comparisonExperiments[[exp]]))
names_experiments <- names(comparisonExperiments)
colnames(n_proteins) <- names_experiments
n_proteins <- data.frame(t(n_proteins))
n_proteins$Comparison <- rownames(n_proteins)
rownames(n_proteins) <- NULL
n_proteins <- n_proteins %>% rename(`Number of Proteins` = total_proteins_in_experiment,
                                    `Number DE Proteins` = total_proteins_de) %>%
  select(Comparison, `Number of Proteins`, `Number DE Proteins`)

knitr::kable(n_proteins, row.names = FALSE, caption="Summary of differential expression results across comparisons.")

2. Experiment Health

A. Dimensionality reduction {.tabset .tabset-fade .tabset-pills}

Principal Component Analysis details

The Principal Component Analysis (PCA) plot is used to visualise differences between samples that are induced by their intensity profiles. PCA transforms high-dimensional data, like thousands of measured proteins or peptides intensities, into a reduced set of dimensions. The first two dimensions explain the greatest variability between the samples and they are a useful visual tool to confirm known clustering of the samples or to identify potential problems in the data.

This section displays two PCA plots:

  • Using all intensities (imputed and normalised, when requested) used for the differential expression (DE) analysis.
  • Including only differentially expressed (DE) proteins. The DE proteins are defined as those proteins with an adjusted p-value < 0.05, where the p-value is the one of the limma ANOVA test which tests for differences using all categories of the condition of interest jointly. At least 5 DE proteins are required to produce this plot.

For a healthy experiment we expect:

  • Technical Replicates to cluster tightly together.
  • Biological Replicates to cluster more than non replicates.
  • Clustering of the condition of interest should be visible

If unexpected clusters occur or replicates don't cluster together it can be due to extra variability introduced by factors such as technical processing, other unexplored biological differences, sample swaps etc... The interpretation and trust in the the differential expression results should take these consideration into a account. If you think that the samples in the experiment show largely unexpected patterns, it is advisable to request support from an analyst.

A scree plot shows the amount of variance explained by each dimension extracted by PCA. A high degree of variance in the first few dimensions may suggest large differences between your samples.

## PCA
p=plot_chosen_pca_experiment(CompleteIntensityExperiment, format = params$format)
p[[1]]
p[[2]]
p=plot_chosen_pca_experiment(CompleteIntensityExperiment,
                             format = params$format,
                             auto_select_features = "de")

if(is.null(p[[1]])){
  text <- "Not enough differentially expressed proteins to produce a PCA plot."
}

r if(is.null(p[[1]])) print(text)

B. Quantitative values CV distributions {.tabset .tabset-fade .tabset-pills}

Coefficient of Variation details

The Coefficient of Variation (CV) or Relative Standard Deviation, is calculated by the ratio of the standard deviation to the mean. It is used to measure the precision of a measure, in this case protein/peptide intensity. The plot below shows the distribution of the CVs by experimental conditions where each CV is calculated by protein and by experimental condition. The CV is displayed as %CV, which is the percentage of the mean represented by the standard deviation.

For a healthy experiment we expect:

  • The distribution of the CVs across conditions to be mostly overlapping, e.g. similar modes
  • The modes of the CVs not to be too high, ideally not above 50%

If the distributions show worringly large %CV, this could affect the quality of the differential expression analysis.

p=plot_condition_cv_distribution(IntensityExperiment)

if (params$format == "pdf"){
  p[[1]]
} else {
  ggplotly(p[[1]]) %>% plotly::config(displayModeBar = T,
                        modeBarButtons = list(list('toImage')),
                        displaylogo = F)
}
cv_data <- p[[2]]
cv_data <- as_tibble(cv_data) %>% group_by(Condition) %>%
  summarise(`Median CV %` = round(median(cv, na.rm = TRUE)*100))

knitr::kable(cv_data, row.names = FALSE)

C. Sample correlations

Correlation plots details

The correlation plot shows the Pearson's correlation between the samples in the experiment. Hierarchical clustering is adopted to order the samples in the matrix. Clustering of samples with high correlation aids with the visual inspection of similarity between samples.

This section displays two correlation plots:

  • Using all intensities (imputed and normalised, when requested) used for the DE analysis.
  • Including only DE proteins. The DE proteins are defined as those proteins with an adjusted p-value < 0.05, where the p-value is the one of the limma ANOVA test which tests for differences using all categories of the condition of interest jointly. At least 5 DE proteins are required to produce this plot.

For a healthy experiment we expect:

  • Technical Replicates to have high correlations.
  • Biological Replicates to have higher correlations than non replicates.

plot_samples_correlation_matrix(CompleteIntensityExperiment)
p=plot_samples_correlation_matrix(CompleteIntensityExperiment, onlyDEProteins = TRUE)
if(is.null(p)){
  text <- "Not enough differentially expressed proteins to produce a correlation plot."
}

r if(is.null(p)) print(text)

3. Feature completedness {.tabset .tabset-fade .tabset-pills}

Data completedness details

The amount of missing values can be affected by the biological condition or by technical factors and it can vary largely between experiments.

For a healthy experiment we expect:

  • The distribution of available measurements by replicate to be similar across replicates, especially within the same biological conditions

There isn't a strict threshold to look for in terms of minimum % of available measurements. However, an unusually low value in one or a few replicates can be symptomatic of technical problems and should be taken into account when interpreting the final differential expression results.

By sample

p <- plot_replicate_measured_values(IntensityExperiment, title = NULL)
if (params$format == "pdf"){
  p
} else {
  ggplotly(p, tooltip = c("y")) %>% plotly::config(displayModeBar = T,
                                                  modeBarButtons = list(list('toImage')),
                                                  displaylogo = F)
}

By protein

p <- plot_protein_missingness(IntensityExperiment, title = NULL)
if (params$format == "pdf"){
  p
} else {
  ggplotly(p, tooltip = c("y")) %>% plotly::config(displayModeBar = T,
                                                  modeBarButtons = list(list('toImage')),
                                                  displaylogo = F)
}

4. Identifications

Identifications Details

Identifications of proteins is a measure of the number of non missing measurements by replicate.

Low counts in a run may suggest a systematic flaw in the experiment that needs to be addressed prior to interpretation.

p <- plot_n_identified_proteins_by_replicate(IntensityExperiment)

if (params$format == "pdf"){
  p
} else {
ggplotly(p, tooltip = c("y")) %>% plotly::config(displayModeBar = T,
                                                  modeBarButtons = list(list('toImage')),
                                                  displaylogo = F)
}

5. Normalisation and Imputation

Distributions of raw, normalised (when requested), and imputed intensities

It is useful to inspect and compare the distributions of the intensities to identify samples with largely unusual distributions.

The sections reported here show:

  • The boxplots of the log2 intensities before any normalisation or imputation is applied. Zero (missing) values are not included.
  • The boxplots of relative log expression (RLE) values before and after normalisation, when the latter is requested. The RLE values for a protein are obtained by centering intensities to the protein medians, where the median is computed using only available intensities, i.e. non zero values.
  • The distribution of the imputed and not imputed intensities

For more details on each plot, inspect each section.

A. Intensities distribution (before and after normalisation) {.tabset .tabset-fade .tabset-pills}

Raw intensities

Log2 raw intensities distributions

Missing values are not considered when creating the boxplot. Zero intensities are considered as missing values.

normalised <- FALSE
if(metadata(CompleteIntensityExperiment)$NormalisationAppliedToAssay != "None"){
  normalised <- TRUE
}

# Plot RLE of log2 raw intensity as well as RLE of normalised
p_raw <- plot_log_measurement_boxplot(IntensityExperiment,
                                      format = "pdf",
                                      title = "log2 Raw Intensities")
p_raw

RLE (median centered protein intensities)

Relative Log Expression distributions

It is useful to inspect the distribution of the Relative Log Expression (RLE) values to identify samples with largely unusual distributions. The RLE values for a protein are obtained by centering intensities to the protein medians, where the median is computed using only available intensities, i.e. non zero values. The RLE is computed on the log-transformed data before and after applying normalisation, when required.

For a healthy experiment we expect:

  • The RLE boxplots to have a similar median - centered around zero - across all samples
  • The RLE boxplots to have a similar width of the boxes across samples

If some samples show large deviations from the expected behaviour, it can be symptomatic of problems in the pre-processing of those samples.

normalised <- FALSE
if(metadata(CompleteIntensityExperiment)$NormalisationAppliedToAssay != "None"){
  normalised <- TRUE
}

# Plot RLE of log2 raw intensity as well as RLE of normalised
p_raw <- plot_rle_boxplot(IntensityExperiment, CompleteIntensityExperiment,
                    includeImputed = FALSE,
                    plotRawRLE = TRUE,
                    title = "RLE of log2 Raw Intensities",
                    format = "pdf")
p_raw
if(normalised){
  # Plot RLE of log2 raw intensity as well as RLE of normalised
  p_norm <- plot_rle_boxplot(IntensityExperiment, CompleteIntensityExperiment,
                      includeImputed = FALSE, plotRawRLE = FALSE,
                      title = "RLE of Normalised log2 Intensities",
                      format = "pdf")

  p_norm
}

B. Imputed vs actual intensities {.tabset .tabset-fade .tabset-pills}

Density distribution of imputed vs actual intensities

Initial intensities equal to zero are considered as missing values and imputed prior to the DE analysis. Imputation is performed using the MNAR ("Missing Not At Random") method as adopted in Perseus. Imputed values are randomly drawn from a normal distribution with mean equal to the observed mean (mean of the available intensities) shifted by -1.8 times the observed standard deviation, and a standard deviation equal to the observed standard deviation scaled by a factor of 0.3 (as in Perseus). The plots below show the distribution of imputed values (Imputed = TRUE) and actual values (Imputed = FALSE), all of which are then used for the downstream DE analyses.

plot_imputed_vs_not(CompleteIntensityExperiment = CompleteIntensityExperiment,
                    format = params$format)


MassDynamics/MassExpression documentation built on May 7, 2023, 11:29 a.m.