# setting crop=NULL prevents PDF cropping, while it is desirable in many cases it produces problems we cannot currently resolve on unix systems
knitr::opts_chunk$set(echo = FALSE, dpi=300, fig.width=7, fig.height=3.5, fig.align="center", tidy.opts=list(width.cutoff=80), tidy=TRUE, crop = NULL) # , base.dir = output_dir, root.dir = output_dir

\newpage

Quality control

The quality control figures in this section enable you to investigate reproducibility and global clustering of samples by visualizing:

The first set of quality control figures describes individual samples, thereafter group-level quality metrics are described and finally sample clustering is used to highlight structure in the entire dataset.

if(exists("ggplot_cscore_histograms") && length(ggplot_cscore_histograms) > 0) {
  cat("## DIA confidence score distributions \n\n")
  cat("DIA data was used as input for this analysis. Below histograms visualize both target (green) and decoy (grey) cscores in each sample, indicating how confident the input software was in the identification of peptides from the spectral library in the raw DIA data. Samples are ordered by the number of precursors quantified at q-value confidence threshold 0.01. At this threshold, the respective cscore and number of peptides is shown. \n\n")

  for(p in ggplot_cscore_histograms) {
    suppressWarnings(print(p))
    cat('\n\n')
  }
}

number of peptides and proteins

These plots show the number of (target) peptides that are 'detected' per sample. For DDA, 'detected' implies the peptide has a MS/MS identification. Peptides quantified through match-between-runs (MBR) are quantified but not detected/identified. In case of DDA, we also show the number of peptides quantified through MBR. For DIA, we refer to a peptide as 'detected' if the confidence score (for identification) is <= 0.01.

Samples in this plot are sorted by their experimental group, and then ordered and by their name within each group. This data is also available in the output table 'samples.xlsx'.

# plot height scales with number of samples
barplot_ht = min(9, 1.5 + 0.2 * nrow(dataset$samples))

samples_metadata_counts_colors = samples_colors_long %>%
  filter(!grepl("(detected|all)_(peptides|proteins)", prop)) %>%
  left_join(dataset$samples %>% select(shortname, exclude, detected_peptides), by = "shortname") %>%
  droplevels() # important to drop unused factor levels after filtering !

samples_metadata_counts_colors$shortname = as.factor(samples_metadata_counts_colors$shortname)
jit = runif(seq_along(levels(samples_metadata_counts_colors$shortname)), -0.2, 0.2)
samples_metadata_counts_colors$sample_jitter = jit[as.numeric(samples_metadata_counts_colors$shortname)]

sm_scatterplot_ht = min(10, 1.5 + 0.2 * length(levels(samples_metadata_counts_colors$prop)))
p_dcbp = ggplot_sample_detect_counts_barplots(dataset$samples, samples_colors_long)
suppressWarnings(print(p_dcbp))

\newpage

color-coding sample metadata

The number of detected peptides in a sample, as compared to other samples within a dataset, can be used as a measure for sample quality. Color-coding individual samples for metadata that you provided as input (e.g. experiment batch, sample handling order, gel lanes, etc.) allows visual inspection as to whether these relate to the rate of successful peptide detection.

The figure below provides an overview of all sample metadata at a first glance. On each row all samples in the dataset are shown as a data point, each color-coded by the respective property shown on the y-axis (with minor vertical jitter for visual clarity). If any of these metadata coincide with a major effect on the number of detected peptides, this should become apparent by a clustering of samples by color-code. Hereafter, an additional set of figures will further expand this overview into detailed figures for each sample property.

Note that the visualization of sample metadata in this report depends on user-provided input; each column in the metadata input table (besides sample names) that contains more than 1 unique value is automatically used as a factor for color-coding all figures in this section. All information shown in these figures is also available in the output table 'samples.xlsx'.

p_dcvsmetadata_scatter = ggplot_sample_detect_vs_metadata_scatterplot(samples_metadata_counts_colors)
suppressWarnings(print(p_dcvsmetadata_scatter))

color-coding sample metadata, expanded

To further detail each sample property, each row in the above figure is now split into separate plots. Thus, a figure is generated for each property in the user-provided metadata (column in the samples table, its name shown in the plot title).

For categorical variables, a scatterplot shows on the y-axis all unique variables while the x-axis depicts the number of detected peptides. Colors are consistent with the above plot. exclude samples, if any, are depicted as squares. The median value is shown as a vertical line (thin line = median over all samples, wider line = median while discarding exclude samples). For continuous variables, a scatterplot without (left panel) and with Loess fit is shown (right panel, visualized as blue line if data was successfully fitted).

Note that samples flagged as exclude are user-provided in the sample metadata table. These are included in data visualizations but excluded from downstream statistical analysis (later part of the report).

For example: the first plot shows color-coding by the 'group' property, so each row represents a sample group. If samples in a particular group systematically yield fewer peptides than another group, a clear pattern will be visible.

p_dcvsmetadata_scatter_byprop = ggplot_sample_detect_vs_metadata_scatterplot_by_prop(samples_metadata_counts_colors)
for(prp in names(p_dcvsmetadata_scatter_byprop)) {
  subchunkify(suppressWarnings(print(p_dcvsmetadata_scatter_byprop[[prp]]$plot)), unique_chunk_id = paste0("detect_vs_metadata_scatterplot_", prp), fig_height = min(10, 1.5 + 0.2 * p_dcvsmetadata_scatter_byprop[[prp]]$n), fig_width = 7)
  cat('\n\n')
}

\newpage

data completeness

To visualize how many peptides are consistently identified in multiple samples, the first figure summarizes how common missing values are in the entire dataset. Optimally, most peptides are identified in 100% of samples and this curve slowly falls of. The following figure shows for each sample whether its peptides are also present in other samples in the dataset or whether these are unique to a (minor) subset of samples. You can use this mark of experimental consistency to compare datasets generated by similar protocols and mass-spec acquisition.

cumulative distribution

p_datacompleteness = ggplot_peptide_detect_frequency_distribution(dataset$peptides, dataset$samples, include_quant = !isdia, remove_exclude_samples = TRUE)
suppressWarnings(print(p_datacompleteness))

Samples flagged as 'exclude' (by user) are not taken into account in this figure. Exact values are shown for data points matching 90% and 50% of samples to convenience comparison between analyses (e.g. before/after configuring 'exclude' samples, or comparing between experiments of similar protocol).

\newpage

peptide detection frequency

Each identified peptide in a sample is classified and color-coded by the number of other samples where the same peptide is present. Visualization of the amount of peptides that overlap with other samples in the dataset, from peptides identified in most samples (red) to one-hit-wonders (blue), helps identify uncommon samples (more blue/green than other samples).

Optimally, the majority of peptides in each sample are red~orange with relatively few uniquely identified peptides (blue~green). Samples are sorted by the total amount of detected peptides.

p_detectfrequency = ggplot_peptide_detect_frequency(dataset$peptides, dataset$samples)
suppressWarnings(print(p_detectfrequency))

\newpage

abundance distributions

The figures in this subsection are used to identify unexpected mass-spec sensitivity or sample loading differences. Peptide data is shown as provided in input files, so peptide filtering nor intensity normalization has been applied yet (for proper QC, make sure the software that generated the input data did not apply normalization prior). If the dataset is DDA, match-between-runs (MBR) peptides are included in these distributions whereas for DIA only 'detected' peptides (based on confidence score threshold) are included.

p_intensity = plot_abundance_distributions(dataset$peptides, dataset$samples, isdia)
suppressWarnings(print(p_intensity$intensity_distributions_all))
cat('\n\n')
for(p in p_intensity$intensity_distributions_bygroup) {
  suppressWarnings(print(p))
  cat('\n\n')
}

\newpage

retention time

The figures in this section allow you to identify potential problems during HPLC elution, such as a temporarily blocking column, failing ionization spray or decreasing sensitivity over time. For each sample, all peptides that are also observed in a replicate (such that there is a point of reference available) are visualized.

retention time distributions

The density of the number of peptides eluting at each point in time. The figure below presents an overview of all samples that allows for the identification of outlier samples that follow distinct elution patterns. The following section shows details for each sample. Samples marked as 'exclude' in the provided sample metadata table are visualized as dashed lines.

p_rt = plot_retention_time_v2(dataset$peptides, dataset$samples, isdia)
suppressWarnings(print(p_rt$rt_distributions_all))
suppressWarnings(print(p_rt$rt_distributions_colour_groups))
suppressWarnings(print(p_rt$rt_distributions_collapse_groups))

retention time local effects

To investigate how each measurement differs from others, we visualize each sample as a 3 panel figure. First, the data is binned across the retention time dimension (x-axis). If a samples was marked as 'exclude' in the provided sample metadata, this is indicated in the plot title.

The top panel shows the number of peptides in the input data, e.g. as recognized by the software that generated input for this pipeline, over time (black line). For reference, the grey line shows the median amount over all samples (note; if this is the exact same in all samples, the grey line may not be visible as it falls behind the black line).

The middle panel indicates whether peptide retention times deviate from their median over all samples (blue line). The grey area depicts the 5% and 95% quantiles, respectively. The line width corresponds to the number of peptides eluting at that time (data from first panel). Analogously, the bottom panel shows the deviation in peptide abundance as compared to the median over all samples (red line).

for(p in p_rt$rt_by_sample) {
  suppressWarnings(print(p))
  cat('\n\n') 
}

\newpage

variation among replicates

The reproducibility of replicate measurements is expressed in three different analyses. First, the difference between peptide intensities in each sample are compared to the mean value among all replicates (foldchange distributions). Next, the Coefficient of Variation (CoV) is used as a metric for reproducibility to explore how much the CoV within a sample group can be improved by removing a single sample (eg; if CoV strongly improved after removing sample s, it could be regarded as an outlier). Finally, the CoV within each sample group is visualized as a boxplot and a violin plot, figures commonly seen in proteomics literature and useful for comparing across experiments (of similar protocol).

within-group foldchange distributions

The foldchange of all peptides in a sample is compared to their respective mean value over all samples in the group. This visualizes how strongly each sample deviates from other samples in the same group which helps identify outlier samples. The same data was used as detailed in the "retention time" section above.

For each sample group, two plots are shown: 1) a basic monochrome plot and 2) a variant that color-codes the top10 'worst' samples based on the standard deviation (sd) of their respective distributions. Note; per sample, the 0.5% quantiles of both top- and bottom-most outliers are disregarded in sd computation. The legend is sorted in column-first descending order (sample with highest sd in column 1 row 1, sample with second highest sd in column 1 row 2, etc.). If there are 10 or fewer samples, all are color-coded. On the pages after the plots, these sd values are shown as tables.

The 'input' panel is based on the peptide intensities as-is (i.e. the user-provided input data from upstream software), the 'normalized' panel shows the exact same samples and peptides after normalization (as specified by user). Samples marked as 'exclude' in the provided sample metadata table are visualized as dashed lines.

# returns a list (names are the sample groups), each contains 3 ggplot objects; mono, highlight, legend
l_fc_distr = suppressWarnings(plot_foldchange_distribution_among_replicates(dataset$peptides, dataset$samples))

\newpage

p_group_mean_foldchange = l_fc_distr$plotlist

i = 0 # counter for unique chunk IDs
for(plotlist in p_group_mean_foldchange) {
  # inject dynamic markdown code so we can specify a different height for each figure (i.e. same for both plots, less for the legend)
  subchunkify(suppressWarnings(print(plotlist$mono)), unique_chunk_id = paste0("p_group_mean_foldchange__mono_", i), fig_height = 3.5, fig_width = 7)
  subchunkify(suppressWarnings(print(plotlist$highlight)), unique_chunk_id = paste0("p_group_mean_foldchange__highlight_", i), fig_height = 3.5, fig_width = 7)
  subchunkify(suppressWarnings(print(plotlist$legend)), unique_chunk_id = paste0("p_group_mean_foldchange__legend_", i), fig_height = 2, fig_width = 7)
  cat('\n\n') 
  i = i + 1
}

# generate a table for each group. Only top10 worst samples per group, otherwise with large datasets it'll be 10 pages of tables...
if(length(l_fc_distr$tib_scores) > 0) {
  tib_group_mean_foldchange = l_fc_distr$tib_scores %>%
    mutate(sd = sprintf("%.3f", sd)) %>%
    select(Group = group, Sample = shortname, `Standard Deviation` = sd)

  cat('\n\\newpage \n')
  cat('**top10 outliers per sample group** \n\n ')
  for(g in unique(tib_group_mean_foldchange$Group)) { # note capital G
    rmarkdown_xtable_custom(tib_group_mean_foldchange %>% filter(Group == g) %>% tail(n=10L), caption = "sd() of within-group foldchange distributions")
    cat('\n\n') 
  }
}
isplot_cov_loo = FALSE
p_cov_loo = list()
if("intensity_qc_basic" %in% colnames(dataset$peptides) && any(!is.na(dataset$peptides$intensity_qc_basic))) {
  # generate figures
  p_cov_loo = suppressWarnings(ggplot_coefficient_of_variation__leave_one_out(dataset$peptides, dataset$samples, samples_colors))
  isplot_cov_loo = length(p_cov_loo) > 0 && "loo_bygroup" %in% names(p_cov_loo)
}
cat('\n\\newpage \n')
cat("### CoV, leave-one-out \n\n")
if(isplot_cov_loo) {
  # print documentation
  cat("The figures below describe the effect of removing a particular sample prior to within-group Coefficient of Variation (CoV) computation. The lower the CoV distribution is for a sample, the better reproducibility we get by excluding it. Only sample groups with at least 4 replicates can be used for this analysis, so 3 samples remain after leaving one out. Samples marked as 'exclude' in the provided sample metadata are included in these analyses (shown as dashed lines), and only peptides with at least 3 data points across replicate samples (after leave-one-out) are used for each CoV computation. \n\n")

  # plot
  for(p in p_cov_loo$loo_bygroup) {
    suppressWarnings(print(p))
    cat('\n\n') 
  }
} else {
  cat("_No CoV leave-one-out computations could be made, the dataset lacks sample groups with at least 4 replicate samples._ \n\n")
}
# bugfix: this code should go into a separate chunk from above chunk_loo_print_bygroup
# otherwise, for unknown reason, the newpage command does not work
if("loo_combined" %in% names(p_cov_loo)) {
  # plot height scales with number of groups
  loo_combined_ht = min(9, 1.5 + 0.3 * length(p_cov_loo$loo_bygroup))
  cat("\n\n\\newpage\n")
  subchunkify(suppressWarnings(print(p_cov_loo$loo_combined)), unique_chunk_id = "loo_combined", fig_height = loo_combined_ht, fig_width = 7)
  cat('\n\n') 
}
if(isplot_cov_loo && "tbl_loo_cov" %in% names(p_cov_loo)) {
  # cat('\n\\newpage \n')
  rmarkdown_xtable_custom(p_cov_loo$tbl_loo_cov, caption = "CoV leave-one-out") #, align="lllllp{3.5in}")
  # print(knitr::kable(p_cov_loo$tbl_loo_cov, format = "latex"))
  cat('\n _Leave-one-out impact on within-group CoV (%)_ \n\n')

  ## dev note; could not refrain more fancy formatted tables from floating in large datasets, causing the table to end up at the bottom of the PDF...
  # !! escape the % character as the caption is directly translated to latex by xtable (and otherwise we inject a comment character...). so we'd use "\\%"
  # print(xtable::xtable(p_cov_loo$tbl_loo_cov, caption = "Leave-one-out impact on within-group CoV (percentage)", scalebox = 0.8, floating = FALSE), include.rownames = FALSE, comment=FALSE)
  # print(kableExtra::kable_styling(knitr::kable(p_cov_loo$tbl_loo_cov, caption = "Leave-one-out impact on within-group CoV (percentage)", format = "latex", booktabs = TRUE), latex_options = c("striped", "hold_position")))
  # knitr::kable(output, "latex", booktabs = TRUE, longtable = TRUE, caption = "Leave-one-out impact on within-group CoV (percentage)", scalebox = 0.8) %>% kableExtra::kable_styling(latex_options = c("hold_position", "repeat_header"))
  # cat('\n\n')

  ## dev note; as figure doesn't scale too well for large datasets, even with {r, echo=F, message=F, warning=F, results="asis", fig.height=barplot_ht}
  # print( gridExtra::grid.table(p_cov_loo$tbl_loo_cov, rows = NULL) )
  # print( ggpubr::ggtexttable(p_cov_loo$tbl_loo_cov, rows = NULL, cols = colnames(p_cov_loo$tbl_loo_cov), theme = ggpubr::ttheme("classic")) )
  # cat('\n _Leave-one-out impact on within-group CoV (%)_ \n\n')
}
isplot_cov = FALSE
p_cov = list()
if(any(!is.na(dataset$peptides$intensity_by_group))) {
  p_cov = suppressWarnings(ggplot_coefficient_of_variation(dataset$peptides, dataset$samples, samples_colors))
  isplot_cov = length(p_cov) > 0 && "boxplot" %in% names(p_cov)
}
cat('\n\\newpage \n')
cat("### Coefficient of Variation \n\n")
if(isplot_cov) {
  # print documentation
  cat("The Coefficient of Variation (CoV) is a quality metric for the reproducibility of replicate measurements, here visualized using box- and violin-plots. \n\n")
  cat("Only samples that are NOT marked 'exclude' in the provided sample metadata and are in a sample group among at least 3 replicates are used for these figures. ")
  cat('The user-specified filtering rules (eg; filter_min_detect, filter_min_peptide_per_prot, etc.) were applied within each sample group independently and remaining peptides were subsequently normalized using the "', paste(norm_algorithm, collapse = "&"), '" algorithm (an important parameter to keep in mind when comparing across datasets/analyses). ')
  cat('Only peptides with at least 3 data points across replicate samples are used for each CoV computation. \n\n')

  cat("**Peptide-level CoV:** \n\n")
  suppressWarnings(print(p_cov$boxplot))
  cat("\n\n")
  suppressWarnings(print(p_cov$violin))
} else {
  cat("_No CoV computations could be made, the dataset lacks sample groups with at least 3 replicate samples._ \n\n")
}
isplot_cov_protein = FALSE
p_cov = list()
if(isplot_cov) {
  x = rollup_pep2prot(dataset$peptides %>% select(peptide_id, protein_id, sample_id, intensity=intensity_by_group), rollup_algorithm = rollup_algorithm)
  p_cov_protein = suppressWarnings(ggplot_coefficient_of_variation(x %>% rename(peptide_id=protein_id, intensity_by_group=intensity), dataset$samples, samples_colors))
  rm(x)
  isplot_cov_protein = length(p_cov_protein) > 0 && "boxplot" %in% names(p_cov_protein)
}
if(isplot_cov_protein) {
  cat('\n\\newpage \n')
  cat("**Protein-level CoV:** \n\n(analogous to peptide CoV's, but with additional rollup to protein abundances using '", rollup_algorithm ,"' algorithm) \n\n", sep = "")
  suppressWarnings(print(p_cov_protein$boxplot))
  cat("\n\n")
  suppressWarnings(print(p_cov_protein$violin))
}
if(length(p_varexplained) > 0) {
  cat("\n\\newpage \n## variance explained \n\n")
  cat('To investigate sources of biological and technical variation in the dataset, this analysis applies the "variancePartition" R package (Hoffman GE, Schadt EE, 2016, PMID:27884101) which uses a linear mixed model to quantify variation in abundance attributable to (user provided) sample metadata. \n \n')
  cat('Although this method was originally designed for gene expression data, it may also provide insights in proteomic data (we here apply this method to the protein-level data matrix). ')
  cat('If additional sample metadata was provided, such as experiment batch, gel, etc., this analysis can be used to assess their relative impact on the dataset. Subsequentially, this can be used to figure out which sources of (technical) variation should be prioritized when optimizing experiment protocols (i.e. this analysis will reveal worst offenders). \n \n')
  cat('The properties in the plot are sorted by median variance explained (large to small), with residuals shown last. \n \n')

  suppressWarnings(print(p_varexplained$p_ve_violin))
  cat("\n\n")
  rmarkdown_xtable_custom(p_varexplained$tbl_ve)
  cat("\n\n")
}

\newpage

PCA

A visualization of the first three PCA dimensions illustrates sample clustering. The goal of these figures is to detect global effects from a quality control perspective, such as samples from the same experiment batch clustering together, not to be sensitive to a minor subset of differentially abundant proteins (for which specialized statistical models can be applied downstream).

If additional sample metadata was provided, such as experiment batch, sample-prep dates, gel, etc., multiple PCA figures will be generated with respective color-codings. Users are encouraged to provide relevant experiment information as sample metadata and use these figures to search for unexpected batch effects.

The pcaMethods R package is used here to perform the Probabilistic PCA (PPCA). The set of peptides used for this analysis consists of those peptides that pass your filter criteria in every sample group. If any samples are marked as 'exclude' in the provided sample metadata, an additional PCA plot is generated with these samples included (depicting the 'exclude' samples as square symbols).

Rationale behind data filter \ As mentioned above, the aim of the PCA figures is to identify global effects. To achieve this, we compute sample distances on the subset of peptides identified in each group which prevents rarely detected peptides/proteins from having a disproportionate effect on sample clustering. This pertains not only to 'randomly detected contaminant proteins' but also to proteins with abundance levels near the detection limit, which may be detected in only a subset of samples (eg; some measurements will be more successful/sensitive than others).

Figure legends \ The first 3 principle components compared visually (1 vs 2, 1 vs 3, 2 vs 3) on the rows. Left- and right-side panels on each row represent the same figure without and with sample labels. The principle components are shown on the axis labels together with their respective percentage of variance explained. Samples marked as 'exclude' in the provided sample metadata, if any, are visualized as square shapes.

p_pca__excl_outliers = p_pca__incl_outliers = list()
p_pca__excl_outliers__npep = p_pca__incl_outliers__npep = 0
if(any(dataset$samples$exclude) && "intensity_all_group_withexclude" %in% colnames(dataset$peptides) && any(!is.na(dataset$peptides$intensity_all_group_withexclude))) {
  tibw_withexclude = dataset$peptides %>% select(key_peptide, sample_id, intensity_all_group_withexclude) %>% filter(!is.na(intensity_all_group_withexclude)) %>%
    pivot_wider(id_cols = key_peptide, names_from = sample_id, values_from = intensity_all_group_withexclude)

  # note that we select only the color-coding for sample groups for the PCA that includes outliers
  if(ncol(tibw_withexclude) > 2) {
    p_pca__incl_outliers = suppressWarnings(plot_sample_pca(as_matrix_except_first_column(tibw_withexclude), dataset$samples, samples_colors %>% select(sample_id, shortname, group), sample_label_property = pca_sample_labels))
    p_pca__incl_outliers__npep = nrow(tibw_withexclude)
  }
}


if("intensity_all_group" %in% colnames(dataset$peptides) && any(!is.na(dataset$peptides$intensity_all_group))) {
  tibw_noexclude = dataset$peptides %>% select(key_peptide, sample_id, intensity_all_group) %>% filter(!is.na(intensity_all_group)) %>%
    pivot_wider(id_cols = key_peptide, names_from = sample_id, values_from = intensity_all_group)

  if(ncol(tibw_noexclude) > 2) {
    p_pca__excl_outliers = suppressWarnings(plot_sample_pca(as_matrix_except_first_column(tibw_noexclude), dataset$samples, samples_colors, sample_label_property = pca_sample_labels))
    p_pca__excl_outliers__npep = nrow(tibw_noexclude)
  }
}
if(length(p_pca__incl_outliers) > 0) {
  cat('\n\\newpage \n')
  cat(sprintf("\n**PCA of all samples, including those flagged as \'exclude\', using %d peptides** \\ \n", p_pca__incl_outliers__npep))

  for(p in p_pca__incl_outliers) {
    suppressWarnings(print(p))
    cat('\n\n')
  }
}


if(length(p_pca__excl_outliers) > 0) {
  cat('\n\\newpage \n')
  cat(sprintf("\n**PCA only on samples not flagged as \'exclude\', using %d peptides** \\ \n", p_pca__excl_outliers__npep))

  for(p in p_pca__excl_outliers) {
    suppressWarnings(print(p))
    cat('\n\n')
  }
}
if(length(l_contrast) > 0) {
  cat("\n\\newpage \n# Differential abundance analysis \n\n")

  cat('**goal: maximize reliable features for quantification** \n \n')
  cat('In a pairwise analysis of two groups of samples, only peptides with N data-points in both groups are used for quantitative analysis (where N = defined by user settings). For example; if peptide *p* is consistently quantified in sample groups A and B but not in C/D/E, it can be used when comparing group A *versus* group B but should not be used in any other group comparisons. This approach is particularly suited to maximize the number of peptides used for statistical analysis in experimental designs with many sample groups. In MS-DAP this is referred to as "by contrast" filtering. \n\n')
  cat('A common alternative strategy is a global filtering approach where peptides are selected based on their properties in the overall dataset (eg; present in x% of samples or x% of replicates in all groups) and subsequentially the resulting data matrix is used for all downstream statistical analyses. In the example above where peptide *p* is present in a subset of sample groups, *p* would either be left out (not present in majority of samples in entire dataset) or erronously used when applying t-statistics to groups B and C (since *p* is not present in group C, it may differentially detected but there are no features available for quantitative analysis) \n\n')

  cat('**contrasts and foldchanges** \n \n')
  cat('Note that a MS-DAP contrast for "A vs B" returns foldchanges for B/A. For example, for the contrast "control vs disease" a positive log2 foldchange implies protein abundances are higher in the "disease" sample group. \n\n')

  has_msqrob = "msqrob" %in% dataset$de_proteins$dea_algorithm
  has_msempire = "msempire" %in% dataset$de_proteins$dea_algorithm

  for(contr_index in seq_along(l_contrast)) {
    contr = names(l_contrast)[contr_index]

    # cat('\n## ', contr, '\n\n')
    cat('\n## ', unlist(lapply(strsplit(gsub(" *#.*", "", sub("^contrast:\\s*", "", contr)), " vs ", fixed = T), function(x) paste(stringr::str_trunc(x, 35, "right"), collapse = " vs "))), '\n\n') # analogous to formatting contrast shortnames for table tib_report_stats_summary @ report main function


    col_contr_intensity = get_column_intensity(dataset$peptides, contr)
    cat("*", sprintf("**user setting:** using '%s' peptide filtering approach", names(col_contr_intensity)), " \n\n")
    if(length(dataset$contrasts) >= contr_index && 
       dataset$contrasts[[contr_index]]$label == contr &&
       length(dataset$contrasts[[contr_index]]$colname_additional_variables) > 0) {
      cat("*", sprintf("**user setting:** additional regression variables: %s", paste(dataset$contrasts[[contr_index]]$colname_additional_variables, collapse = ", ")), " \n\n")
    }

    tib_log = dataset$peptides %>% 
      select(peptide_id, protein_id, intensity = !!as.character(col_contr_intensity)) %>%
      filter(!is.na(intensity)) %>% summarise(n_precursor=n_distinct(peptide_id), n_prot=n_distinct(protein_id))
    cat("*", sprintf("%d peptides in %d proteins remain in the current contrast after peptide filters and are used for the statistical analysis in this section", tib_log$n_precursor[1], tib_log$n_prot[1]), " \n\n")

    contr_line1 = dataset$de_proteins %>% filter(contrast == contr) %>% head(n=1)
    cat("* qvalue threshold:", contr_line1 %>% pull(signif_threshold_qvalue), " \n\n")
    cat("* log2 foldchange threshold:", contr_line1 %>% pull(signif_threshold_log2fc), " \n\n")

    cat('\n### volcano \n\n')
    cat('The plot title shows the statistical model and contrast (sample groups in the comparison). Left- and right-side figure panels on each row represent the same figure without and with labels for the 25 proteins with lowest p-value. \n\n')
    cat('Bottom figure panels have limited x- and y-axis. For datasets with a small number of strong outliers in p-value or fold-change, which may have a profound effect on the plot scales, this allows inspection of the remainder of the volcano plot without disproportionate influence by \'extreme\' values. \n\n')
    cat('Labels for proteins that are more than 12 characters long are truncated for visual clarity (indicated by trailing ...). For protein identifiers that are ambiguous, e.g. a protein-group with assigned genes "gene1a;gene1b", only the first label/ID is shown for visual clarity (indicated by trailing *). \n\n')

    i=0
    for(p in l_contrast[[contr_index]]$p_volcano_contrast) {
      subchunkify(suppressWarnings(print(p)), unique_chunk_id = paste0("volcano_", contr_index, "_", i), fig_height = 8, fig_width = 7)
      cat('\n\n') 
      i = i + 1
    }

    cat('\n\\newpage \n')
    cat('\n### foldchange distribution \n\n') 
    cat('Distributions of estimated foldchanges produced by the statistical models. If the mode is far from 0, consider alternative normalization strategies. Do note the scale on the x-axis, for some experiments the foldchanges are very low which in turn may exaggerate this figure. \n\n') 

    if(has_msqrob) {
      cat("*note; the MSqRob model tends to assign zero (log)foldchange for proteins with minor difference between conditions where the model is very sure the null hypothesis cannot be rejected (shrinkage by the ridge regression model). As a result, many foldchanges will be zero and the density plot for MSqRob may look like a spike instead of the expected Gaussian shape observed in other models* \n\n") 
    }

    if(length(l_contrast[[contr_index]]$p_foldchange_density) > 0) {
      subchunkify(suppressWarnings(print(l_contrast[[contr_index]]$p_foldchange_density)), unique_chunk_id = paste0("foldchange_density_", contr_index), fig_height = 7, fig_width = 7)
    }

    cat('\n\\newpage \n')
    cat('\n### p-value distribution \n\n') 
    cat('Histogram of p-values computed by differential expression analysis algorithms, as-is, for quality-control inspection. The horizontal line indicates the expected counts assuming a uniform distribution (total number of p-values divided by number of histogram bins)\n\n')
    cat('See further: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6164648/ \n\n')
    cat('See further: http://varianceexplained.org/statistics/interpreting-pvalue-histogram/ \n\n')

    if(has_msqrob || has_msempire) {
      cat("*note; the MSqRob and MS-EmpiRe models often yield p-value distributions that show a large peak at p-value 1, these are typically proteins with estimated log foldchanges at/near zero where these models are very sure the null hypothesis cannot be rejected* \n\n") 
    }

    if(length(l_contrast[[contr_index]]$p_pvalue_hist) > 0) {
      subchunkify(suppressWarnings(print(l_contrast[[contr_index]]$p_pvalue_hist)), unique_chunk_id = paste0("pvalue_hist_", contr_index), fig_height = 7, fig_width = 7)
    }


    ## differential detect
    if(exists("dd_plots") && contr %in% names(dd_plots) && length(dd_plots[[contr]]) > 0) {
      cat('\n\\newpage \n')
      cat('\n### differential detect \n\n')
      cat('Some proteins may not have peptides with sufficient data points over samples to be used for differential expression 
analysis (DEA), but do show a strong difference in the number of detected peptides between sample groups. 
In some proteomics experimental designs, for example a wildtype-knockout study, those are interesting proteins. 
For this purpose, a basic metric for differential testing based on observed peptide counts is provided in MS-DAP as 
a situational tool. Importantly, this approach is less robust than DEA and the criteria to find a reliable set of 
differentially expressed proteins (by differential testing) might differ between real-world datasets. \n\n

As general guidelines for differential detection, the recommended default setting is to filter for proteins that 
were observed with at least 2 peptides in at least 3 replicates (or 50% of replicates, whichever number is greater). 
Use the plots of differential detection z-score histograms to observe the overall distribution and start your data 
exploration at proteins with the strongest z-scores to find desired z-score cutoffs (typically an absolute z-score 
of 5 or higher, but this is not set in stone for all datasets). This works best for DDA experiments, for DIA only the most 
extreme values are informative in many cases (e.g. proteins exclusively identified in condition A & found in nearly 
all replicates of condition A). \n\n')
      cat('Below figure shows the distribution of these scores with thresholds at 6 std. Both the z-scores and the counts these are based upon are available in the statistical result Excel table. \n\n')

      # !! index by name. Order may be different from l_contrasts !!
      subchunkify(suppressWarnings(print(dd_plots[[contr]])), unique_chunk_id = paste0("dd_hist_", contr_index), fig_height = ifelse(isdia, 3, 5.5), fig_width = 5)
    }



    cat('\n\\newpage \n')
  }
}
if(exists("tib_report_stats_summary") && is.data.frame(tib_report_stats_summary) && nrow(tib_report_stats_summary) > 0) {
  cat("\n\\newpage \n# Summary of differential testing \n\n")
  cat('Differential expression analysis (DEA): summary of significant proteins. The "signif" column indicates the number of significant hits at user-provided thresholds for adjusted p-value and foldchange. Columns "q<0.01" and "q<0.05" show the number of significant hits at 1% and 5% FDR cutoffs, respectively. The top10 proteingroups with strongest p-value per contrast and DEA algorithm is shown in the last column (note that these may include non-significant proteins, i.e. even if there are no significant hits the top10 is still shown). \n\n')

  rmarkdown_xtable_custom(tib_report_stats_summary, caption = "Significant hits", align="lp{2in}lllllp{3in}", scalebox = 0.9)
  # print(xtable::xtable(tib_report_stats_summary, caption = "Significant hits", scalebox = 0.8), floating = FALSE, include.rownames = FALSE, comment=FALSE)
  #knitr::kable(tib_report_stats_summary, caption = "Significant hits", format="latex", booktabs=TRUE) %>% kable_styling(latex_options="scale_down")
  cat('\n\n')


  ## differential detect
  cat('\\bigskip \n\n')
  cat('\\bigskip \n\n')
  cat("Differential detection: summary of proteins with extreme differences in observed peptides. A simple metric to complement results from DEA, which is the main result, especially for proteins that lack data to perform DEA. \n\nDifferential detection scores based only on 'detected' peptides; \n\n")
  if(exists("tib_report_diffdetects_summary") && is.data.frame(tib_report_diffdetects_summary) && nrow(tib_report_diffdetects_summary) > 0) {
    rmarkdown_xtable_custom(tib_report_diffdetects_summary, caption = "Differential Detection @ only 'detected' peptides", align="lp{2in}llp{3.5in}", scalebox = 0.9)
    # print(kableExtra::column_spec(knitr::kable(tib_report_diffdetects_summary, format="latex", row.names = FALSE), 4, width = "4in"))
    # print(xtable::xtable(tib_report_diffdetects_summary, caption = "Differential Detection, candidate proteins", scalebox = 0.8, floating = FALSE), include.rownames = FALSE, comment=FALSE)
    cat('\n\n')
  }
  if(exists("tib_report_diffdetects_summary_quant") && is.data.frame(tib_report_diffdetects_summary_quant) && nrow(tib_report_diffdetects_summary_quant) > 0) {
    cat("Differential detection scores based only on all quantified peptides; \n\n")
    rmarkdown_xtable_custom(tib_report_diffdetects_summary_quant, caption = "Differential Detection @ all quantified peptides", align="lp{2in}llp{3.5in}", scalebox = 0.9)
    # print(kableExtra::column_spec(knitr::kable(tib_report_diffdetects_summary_quant, format="latex", row.names = FALSE), 4, width = "4in"))
    # print(xtable::xtable(tib_report_diffdetects_summary_quant, caption = "Differential Detection, candidate proteins", scalebox = 0.8, floating = FALSE), include.rownames = FALSE, comment=FALSE)
    cat('\n\n')
  }

}

\definecolor{textColorRed}{RGB}{204,37,22} \definecolor{textColorBlue}{RGB}{4,54,140} \definecolor{textColorSuccess}{RGB}{1,128,35} \definecolor{textColorGray}{RGB}{85,85,85}

if(exists("log_") && length(log_) > 0) {
  cat('\n\\newpage \n')
  cat('# log \n')
  cat('\\scriptsize \n')

  # if the log is a list assume the first element in each is the log message (eg; like msdap::logger.default(), where each entry contains at least a message and a type)
  if(is.list(log_)) {
    # iterate log entries
    for(i in seq_along(log_)) {
      l = unlist(log_[[i]]) # unlist is not strictly needed, but robust against whatever custom loggers might be used
      if(length(l) == 0 || l[1] == "") { # guard against empty entries
        next
      }
      msg = l[1]

      if(length(l) > 1) {
        sts = tolower(l[2])
        clr = "black"
        if(sts %in% c("warning", "error")) clr = "textColorRed"
        if(sts == "info") clr = "textColorBlue"
        if(sts == "progress") clr = "textColorGray"
        if(sts == "success") clr = "textColorSuccess"
        # color-coded status  +  message
        cat(sprintf("\\textcolor{%s}{%s}", clr, paste0("[", sts, "] ")), gsub("\\s*\n", " \\ \n", msg), " \n\n", sep="")
      } else {
        # message only
        cat(gsub("\\s*\n", " \\ \n", msg), " \n\n", sep="")
      }
      # cat("[", s[2], "] ", gsub("\\s*\n", " \\ \n", s[1]), " \n\n", sep="") # v1: no color-coding
    }
  } else {
    # log_ is not a list, assume array of messages (for compatability with a logger other than msdap::logger.default() that simply stores an array of strings)
    for(s in log_) {
      cat(gsub("\\s*\n", " \n\n", paste(stringr::str_wrap(s, width = 100), collapse=" \\ \n"), fixed=T), " \n\n\n")
    }
  }

  cat('\n\\normalsize \n')
}

\newpage

R command history

This shows the history commands from your R script that starts this pipeline, thereby automatically documenting the parameters/settings used. All lines of executed code since (last) importing data using this R package are shown.

Using this feature

Do not use RStudio's source option to execute our pipeline since it will only write source(...yourscript.R) to the session history, and consequentially that is all you see in this 'code log'. Instead, select all lines in your script (control + A) and then "run" the selected code (either click the run button in RStudio, or use control + enter). All lines shown in this section are the same as shown in the RStudio 'History' pane (a tab on the top-right of its UI).

\scriptsize

#, you can reset these through the RStudio UI before making your 'final' analysis to generate a clean 'code log' (for advanced users, you can start your script with `rstudioapi::executeCommand("clearHistory")` to clear history)

# update: we now first subset the R command history from the last "import_dataset..." on, then format the R code. So prior R code/commands that are 'unparsable' will not affect the pretty-printing of the subset of code that we want to display here
# try to format code (throws error if there are lines that yield syntax errors)
formatted_code = tryCatch(format_r_code(subset_relevant_code_snippet_for_report(history_as_string)), error = function(e) {""})
cat('```r \n')
# on code format error, just produce the r history as-is
if(length(formatted_code) == 1 && formatted_code == "") {
  cat(paste(history_as_string, collapse=" \n"), " \n")
} else {
  cat(paste(formatted_code, collapse=" \n"), " \n")
}
cat('``` \n')

if(length(formatted_code) == 1 && formatted_code == "") {
  cat('\n\n *Could not pretty-print your R code. Perhaps there are syntax errors in your R history? If so, either clear your R history or restart RStudio to amend.* \n')
}
if(grepl("^\\s*source\\s*\\(", tail(history_as_string, 1))) {
  cat('\n\n **Instead of using "source" to execute your script in RStudio,\nfollow above instructions to enable inclusion of your code/parameters in this report** \n')
}

\normalsize

\newpage

R session info

The computer system and versioning of all R packages used to run this analysis are shown below to facilitate, in combination with the previous section, reproducibility.

tmp = devtools::session_info()
tib_platform = tibble(setting = names(tmp$platform), value = unlist(tmp$platform))
tib_packages = as_tibble(tmp$packages)

rmarkdown_xtable_custom(tib_platform)
cat('\n _System_ \n\n')
cat('\\bigskip \n\n')
cat('\\bigskip \n\n')

rmarkdown_xtable_custom(tib_packages %>% filter(!is_base & attached) %>% select(package, loadedversion, source))
cat('\n _Attached packages_ \n\n')
cat('\\bigskip \n\n')
cat('\\bigskip \n\n')

rmarkdown_xtable_custom(tib_packages %>% filter(!is_base & !attached) %>% select(package, loadedversion, source))
cat('\n _Packages that are not attached_ \n\n')

# knitr::kable(tib_platform, caption = "System")
# knitr::kable(tib_packages %>% filter(!is_base & attached) %>% select(package, loadedversion, source), caption = "Attached packages")
# knitr::kable(tib_packages %>% filter(!is_base & !attached) %>% select(package, loadedversion, source), caption = "Packages that are not attached")


ftwkoopmans/msdap documentation built on March 5, 2025, 12:15 a.m.