analysis_quickstart: Quickstart for analyses in this pipeline

View source: R/quickstart.R

analysis_quickstartR Documentation

Quickstart for analyses in this pipeline

Description

all-in-one function that covers the vast majority of use-cases of analyzing a dataset imported into MS-DAP. (assuming you already loaded peptide data, sample metadata and fasta files using MS-DAP import functions).

Usage

analysis_quickstart(
  dataset,
  filter_min_detect = 0,
  filter_fraction_detect = 0,
  filter_min_quant = 0,
  filter_fraction_quant = 0,
  filter_min_peptide_per_prot = 1,
  filter_topn_peptides = 0,
  filter_by_contrast = FALSE,
  norm_algorithm = c("vsn", "modebetween_protein"),
  rollup_algorithm = "maxlfq",
  dea_algorithm = c("deqms", "msqrob", "msempire"),
  dea_qvalue_threshold = 0.01,
  dea_log2foldchange_threshold = 0,
  diffdetect_min_peptides_observed = 2,
  diffdetect_min_samples_observed = 3,
  diffdetect_min_fraction_observed = 0.5,
  pca_sample_labels = "auto",
  var_explained_sample_metadata = NULL,
  multiprocessing_maxcores = NA,
  output_abundance_tables = TRUE,
  output_qc_report = TRUE,
  output_dir,
  output_within_timestamped_subdirectory = TRUE,
  dump_all_data = FALSE
)

Arguments

dataset

a valid dataset object generated upstream by an MS-DAP import function. For instance, import_dataset_skyline() or import_dataset_maxquant_evidencetxt()

filter_min_detect

in order for a peptide to 'pass' in a sample group, in how many replicates must it be detected?

filter_fraction_detect

in order for a peptide to 'pass' in a sample group, what fraction of replicates must it be detected?

filter_min_quant

in order for a peptide to 'pass' in a sample group, in how many replicates must it be quantified?

filter_fraction_quant

in order for a peptide to 'pass' in a sample group, what fraction of replicates must it be quantified?

filter_min_peptide_per_prot

in order for a peptide to 'pass' in a sample group, how many peptides should be available after detect filters? 1 is default, but 2 can be a good choice situationally (eg; to not rely on proteins with just 1 quantified peptide)

filter_topn_peptides

maximum number of peptides to maintain for each protein (from the subset that passes above filters, peptides are ranked by the number of samples where detected and their variation between replicates).

filter_by_contrast

should the above filters be applied to all sample groups, or only those tested within each contrast? Enabling this optimizes available data in each contrast, but increases the complexity somewhat as different subsets of peptides are used in each contrast and normalization is applied separately.

norm_algorithm

normalization algorithm(s), or provide an empty string to skip normalization. Refer to normalization_algorithms() function documentation for available options and a brief description of each. Provide an array of options to run each algorithm consecutively, for instance; c("vsn", "modebetween_protein") to first apply vsn normalization and then correct between-group ratios such that the protein-level log2-foldchange mode is zero

rollup_algorithm

rollup_algorithm strategy for combining peptides to proteins as used in DEA algorithms that first combine peptides to proteins and then apply statistics, like eBayes and DEqMS. Options: maxlfq, tukey_median, sum. See further documentation for function rollup_pep2prot()

dea_algorithm

algorithm for differential expression analysis (provide an array of strings to run multiple, in parallel). Refer to dea_algorithms() function documentation for available options and a brief description of each. To use a custom DEA function, provide the respective R function name as a string (see GitHub documentation on custom DEA functions for more details)

dea_qvalue_threshold

threshold for significance of adjusted p-values in figures and output tables. Output tables will also include all q-values as-is

dea_log2foldchange_threshold

threshold for significance of log2 foldchanges. Set to zero to disregard or a positive value to apply a cutoff to absolute log2 foldchanges. MS-DAP can also perform a bootstrap analyses to infer a reasonable threshold by setting this parameter to NA

diffdetect_min_peptides_observed

for differential detection only; minimum number of peptides that a protein must be detected with in either group (within at least diffdetect_min_samples_observed) in order to be included in the differential detection z-score results. Set to NA to disable differential detection

diffdetect_min_samples_observed

for differential detection only; minimum number of samples where a protein should be observed at least once by any of its peptides (in either group) when comparing a contrast of group A vs B. Set to NA to disable differential detection

diffdetect_min_fraction_observed

for differential detection only; analogous to diffdetect_min_samples_observed, but here you can specify the fraction of samples where a protein needs to be detected in either group (within the respective contrast). default; 0.5 (50% of samples)

pca_sample_labels

whether to use sample names or a numeric ID as labels in the PCA plot. options: "auto" (let code decide, default), "shortname" (use sample shortnames), "index" (auto-generated numeric ID), "index_asis" (same as index option and specifically disable label overlap reduction)

var_explained_sample_metadata

optionally, enable variance-explained analysis. This is slow, even for small datasets, and even moreso as the number of experiment metadata grows (so to save time in routine analyses, this is disabled by default). Set to NULL to disable (default), NA to automatically infer column names from dataset@samples to be used, or provide an array of column names from dataset@samples to be used (e.g. c("group","batch","sex"))

multiprocessing_maxcores

optionally, integer parameter to set the maximum number of cores to use when running MSqRob/MSqRobSum DEA algorithms. If other DEA methods are used, this setting doesn't do anything. Set to NA (default) to automatically select all available CPU cores minus 1. For systems with many CPU cores that run into errors related to "socketConnection" or "PSOCK", try limiting this to a lower number (e.g. 8)

output_abundance_tables

whether to write peptide- and protein-level data matrices to file. options: FALSE, TRUE

output_qc_report

whether to create the Quality Control report. options: FALSE, TRUE . Highly recommended to set to TRUE (default). Set to FALSE to skip the report PDF (eg; to only do differential expression analysis and skip the time-consuming report creation)

output_dir

output directory where all output files should be stored. If the provided file path is not an existing directory, it will be created. Optionally, disable the creation of any output files (QC report, DEA table, etc.) by setting this parameter to NA (also overrides the 'dump_all_data' parameter)

output_within_timestamped_subdirectory

optionally, automatically create a subdirectory (within output_dir) that has the current date&time as name and store results there. options: FALSE, TRUE

dump_all_data

if you're interested in performing custom bioinformatic analyses and want to use any of the data generated by this tool, you can dump all intermediate files to disk. Has performance impact so don't enable by default. options: FALSE, TRUE

Filtering

Peptide filter criteria applied to replicate samples within a sample group. params; filter_min_detect, filter_fraction_detect, filter_min_quant, filter_fraction_quant. You only have to provide active filters (but specify at least 1), filters/settings you do not specify don't do anything by default.

Settings: for DDA: at least 1~2 detect (MS/MS ID) and quantified in at least ~75% of replicates. for DIA: detect (confidence score < threshold) in at least ~75% of replicates (because for DIA, you typically have an abundance value in each sample regardless of the identifier confidence score). If there are only 3 replicates, we recommend filtering such that there are at least 3 datapoints to work with in differential expression analysis.

Taken together, recommended settings for a DDA dataset with 3~8 replicates in each sample group look like this;

filter_min_detect = 1 (or zero to fully rely on MBR), filter_fraction_detect = 0.25 (or zero to fully rely on MBR), filter_min_quant = 3, filter_fraction_quant = 0.75

Analogous for DIA;

filter_min_detect = 3, filter_fraction_detect = 0.75

Filter within contrast vs using all groups

Two distinct approaches to selecting peptides can be used for differential expression analysis: 1) 'within contrast' and 2) 'apply filter to all sample groups'.

  1. Determine within each contrast (eg; group A vs group B) what peptides can be used by applying above peptide filter criteria and then apply normalization to this data subset. Advantaguous in datasets with many groups; this maximizes the number of peptides used in each contrast (eg; let peptide p be observed in groups A and B, not in C. we'd want to use it in A vs B, not in A vs C). As a disadvantage, this complicates interpretation since the exact data used is different in each contrast (slightly different peptides and normalization in each contrast).

  2. Apply above filter criteria to each sample group (eg; a peptide must past these filter rules in every sample group) and then apply normalization

This data matrix is then used for all downstream statistics

Advantage; simple and robust

Disadvantage; potentially miss out on (group-specific) peptides/data-points that may fail filter criteria in just 1 group, particularly in large datasets with 4+ groups

Set filter_within_contrast = FALSE for this option

Note; if there are just 2 sample groups (eg; WT vs KO), this point is moot as both approaches are the same

Normalization

normalization algorithms are applied to the peptide-level data matrix. options: "" (empty string disables normalization), "vsn", "loess", "rlr", "msempire", "vwmb", "modebetween", "modebetween_protein" (this balances foldchanged between sample groups. Highly recommended, see MS-DAP manuscript) Refer to normalization_algorithms() function documentation for available options and a brief description of each.

You can combine normalizations by providing an array of options to apply subsequential normalizations.

For instance, norm_algorithm = c("vsn", "modebetween_protein") applies the vsn algorithm (quite strong normalization reducing variation) and then balances between-group protein-level foldchanges with modebetween normalization.

Benchmarks have shown that c("vwmb", "modebetween_protein") and c("vsn", "modebetween_protein") are the optimal strategies, see MS-DAP manuscript.

Differential Expression Analysis

Statistical models for differential expression analysis

MSqRob is recommended for most cases; a peptide-level model that is highly sensitive and quite robust. Reference: https://github.com/statOmics/MSqRob

MS-EmpiRe a peptide-level model that works especially well for DDA data. Reference: https://github.com/zimmerlab/MS-EmpiRe

eBayes is robust but conservative, using the limma package to apply moderated t-tests on protein-level abundances. Reference: https://doi.org/doi:10.18129/B9.bioc.limma

options: ebayes, deqms, msempire, msqrob, msqrobsum. Refer to dea_algorithms() function documentation for available options and a brief description of each.

You can simply apply multiple DEA models in parallel by supplying an array of options. The output of each model will be visualized in the PDF report and data included in the output Excel report. e.g.; dea_algorithm = c("ebayes", "deqms", "msempire", "msqrob")

See Also

dea_algorithms() and normalization_algorithms() for available algorithms and documentation.


ftwkoopmans/msdap documentation built on March 5, 2025, 12:15 a.m.