knitr::opts_chunk$set( collapse = TRUE, fig.path = "images/custnorm-", comment = "#>" )
devtools::load_all() # ! change to some temporary working directory on your computer ! setwd("C:/temp/")
This vignette demonstrates how you can plug custom functions into the MS-DAP pipeline.
note: in the code blocks shown below (grey areas), the lines that start with #>
are the respective output that would be printed to the console if you run these code snippets on your computer
load the Skyline output of the LFQbench study (PMID:27701404 ~ this file is bundled with the MS-DAP package, you don't have to download anything).
extract the respective group/condition of each sample by matching a regular expression against the filenames. This is just to demonstrate a more advanced utility function that may come in handy for bioinformatic analyses. In typical MS-DAP workflows, the sample metadata would be loaded from file (a csv or Excel table), as demonstrated in the user guide.
finally, we define the contrast: group A versus group B.
note; the sample-to-condition assignments were taken from process_hye_samples.R
@ https://www.ebi.ac.uk/pride/archive/projects/PXD002952/files
library(msdap) f <- system.file("extdata", "Skyline_HYE124_TTOF5600_64var_it2.tsv.gz", package = "msdap") dataset = import_dataset_skyline(f, confidence_threshold = 0.01, return_decoys = F, acquisition_mode = "dia") dataset = sample_metadata_custom(dataset, group_regex_array = c(A = "007|009|011", B = "008|010|012") ) dataset = setup_contrasts(dataset, contrast_list = list(c("A", "B"))) print(dataset$samples %>% select(sample_id, group))
We here implement a simple normalization approach to illustrate the technique for providing MS-DAP with a custom normalization function. The code below implements quantile normalization.
Note that in MS-DAP, normalization is always applied to peptide-level. If you want to normalize at protein-level instead, we suggest to perform peptide-to-protein rollup within the normalization function, apply normalization and then backport those scaling factors to the peptide-level matrix.
The requirements to make your custom normalization function work are:
1) The function must have a unique name that does not overlap with common R functions (bad name: 'mean'). So pick unique names like 'mynorm_quantile' or 'my_norm_v1'
2) It should at least have a parameter named x_as_log2
. Furthermore, make sure to add a ...
parameter as a 'catch all' for named parameters passed by the MS-DAP pipeline that you won't make use of, and to be robust against future parameters that might be added.
List of parameters that are currently passed to custom normalization functions, add these to your function to work with the respective data:
x_as_log2
= log2 transformed peptide intensity data matrix, where each row is a peptide and each column a samplegroup_by_cols
= array of characters that describe the sample grouping (which columns belong to the same sample group)group_by_rows
= array of characters that describe the peptide grouping (which rows belong to the same protein)rollup_algorithm
= user's preferred peptide-to-protein rollup algorithm3) return data has to be of the exact same format as x_as_log2 (but with normalized abundance values)
my_norm = function(x_as_log2, ...) { cat("Example custom normalization implementation, my_norm(), scaling all samples by some quantile\n") quantile_for_normalization = 0.95 # value at quantile x for each column/sample scale_per_sample = matrixStats::colQuantiles(x_as_log2, probs = quantile_for_normalization, na.rm=T) # instead of scaling samples such that quantile x is zero, adjust by mean shift so output values are of the same order as input scale_per_sample = scale_per_sample - mean(scale_per_sample) # apply scaling to each column (transpose, scale 'rows', then transpose again) return(t(t(x_as_log2) - scale_per_sample)) }
now that we have implemented a normalization function, let's run a very basic data matrix to test if it works;
mock_data = cbind(1:6, 2:7, 0:5) print(mock_data) print(my_norm(mock_data))
In the last section of this document, we'll use this new normalization function in MS-DAP
Below we have implemented a simplified DEA function that applies a t.test()
on protein abundances.
1) The function must have a unique name that does not overlap with common R functions (bad name: 'mean'). So pick unique names like 'mydea_ttest' or 'my_ttest_v1'
2) Analogous to the custom normalization function presented above, multiple named parameters will be passed to your custom dea function. You can choose which to work with, but make sure to add a ...
parameter as a 'catch all' for named parameters passed by the MS-DAP pipeline that you won't make use of, and to be robust against future parameters that might be added.
List of parameters that are currently passed to custom dea functions, add these to your function to work with the respective data:
peptides
= long-format tibble of peptide data. For simple use-cases, you might first consider the peptide- or protein-level ExpressionSet (these hold simple data matrices to operate on, see example code block below)samples
= wide-format tibble with sample metadata, take note of the columns 'sample_id' and 'condition' so you can classify samples into conditions while A/B testingeset_peptides
= ExpressionSet for peptides. Holds a peptide*sample log2 intensity matrix as well as row- and column-level sample dataeset_proteins
= ExpressionSet for proteins, analogous to eset_peptides
but peptide-to-protein rollup was already performed.input_intensities_are_log2
= boolean that tells you whether the intensities are log2 transformed (they are by default, but could change in future, so implement a check like below)random_variables
= array of column names in the samples tableThe return data must be a tibble with these columns;
protein_id
= character column with the protein identifiers provided in the input data (peptides tibble or metadata from peptide/protein ExpressionSet). These must be unique valuespvalue
= numeric column with p-valuesqvalue
= numeric column with p-values adjusted for multiple testing correction (e.g. after applying FDR to your p-values)foldchange.log2
= numeric column with log2 foldchangesdea_algorithm
= name of your algorithm. must be a non-empty character string uniquely indicating the name/label of your method. DO NOT use names already used by other methods in this pipeline, like 'ebayes'The example code, together with code comments, should provide a code skeleton for the implementation of your custom DEA algorithm.
# note that we only add named parameters for data we're using in this example implementation (i.e. additional parameters like `random_variables` and `peptides` are sent to `...`) my_dea_stats = function(eset_proteins, input_intensities_are_log2, ...) { ### 1) from provided protein-level data; extract the protein intensity matrix, to which condition each sample belongs and find the columns matching groups 1 and 2 x = Biobase::exprs(eset_proteins) # transform to log2 if input data is non-log if (!input_intensities_are_log2) { x = log2(x) } # set non-finite values to NA x[!is.finite(x)] = NA # extract groups x_groups = pData(eset_proteins)$condition x_groups_unique = unique(x_groups) # assume there are 2 groups stopifnot(length(x_groups_unique) == 2) cols_grp1 = x_groups == x_groups_unique[1] cols_grp2 = x_groups == x_groups_unique[2] ### 2) calculate p-values for each protein pval = rep(NA, nrow(x)) for(i in 1:nrow(x)) { i_tt = t.test(x[i,cols_grp1], x[i,cols_grp2], alternative = "two.sided", paired = FALSE, var.equal = FALSE) pval[i] = i_tt$p.value } # log2 fold-change # in MS-DAP, we compute B/A foldchanges (as noted in msdap::setup_contrasts() ) log2FC = rowMeans(x[,cols_grp2,drop=F], na.rm=T) - rowMeans(x[,cols_grp1,drop=F], na.rm=T) ### 3) create a result tibble that contains all columns required for downstream compatability with this pipeline; protein_id, pvalue, qvalue, foldchange.log2, dea_algorithm result = tibble(protein_id = rownames(x), pvalue = pval, qvalue = p.adjust(pval, method = "fdr"), foldchange.log2 = log2FC, # the name of your algorithm. must be a non-empty character string uniquely indicating the name/label of your method # do NOT use names already used by other methods in this pipeline, like 'ebayes' ! dea_algorithm = "my_dea_stats") cat("Example custom DEA implementation, my_dea_stats(), yielded", sum(is.finite(result$qvalue) & result$qvalue<=0.01), "hits at qvalue<=0.01\n") return(result) }
To use a custom function, we just write the name of the R function as a parameter. Do not confuse this with the 'dea_algorithm' name given in the result tibble of your dea function, that is merely the label used in output plots and tables. We remember the LFQbench dataset is a comparison of 2 conditions with 3 replicated measurements each, and apply feature selection rules accordingly.
dataset = analysis_quickstart( dataset, filter_min_detect = 2, # only peptides identified in at least 2 samples per sample group filter_min_quant = 3, # only peptides quantified in at least 3 samples per sample group norm_algorithm = "my_norm", # custom algorithm ! dea_algorithm = c("ebayes", "my_dea_stats"), # we use eBayes for reference, and also apply our custom dea algorithm independently ! dea_qvalue_threshold = 0.01, # adjusted p-value to cutoff 'significant' hits @ dataset$de_proteins column 'signif' (but note we don't use this output column in code below) dea_log2foldchange_threshold = 0, # foldchange threshold for proteins to be 'significant' @ dataset$de_proteins column 'signif' (but note we don't use this output column in code below) diffdetect_min_samples_observed = NA, # in this example, we disable the `differential detection` qualitative analysis output_dir = NA # in this example, we disable the generation of all output files )
Print the number of significant proteins
print(dataset$de_proteins %>% group_by(dea_algorithm) %>% summarise(`1% FDR` = sum(qvalue <= 0.01), `1% FDR AND foldchange threshold` = sum(qvalue <= 0.01 & signif)))
A simple Venn diagram of proteins at 1% FDR for each method shows our custom function identifies a subset of the results from limma's ebayes function, so we verified that the example implementation doesn't yield random stuff.
gplots::venn(list(ebayes=dataset$de_proteins %>% filter(dea_algorithm=="ebayes" & qvalue <= 0.01) %>% pull(protein_id), my_dea_stats=dataset$de_proteins %>% filter(dea_algorithm=="my_dea_stats" & qvalue <= 0.01) %>% pull(protein_id)))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.