Getting started

The MetaboDiff R package requires R version 4.0.2 or higher.

Installing dependencies

CRAN occasionally fails to compile the WGCNA package for Mac OS X. Hence, it is recommended to install the package before installing MetaboDiff.

install.packages("WGCNA")

If asked, install the package from source. Alternatively you might need to download the package from the developer and install the package locally:

install.packages(path_to_file, repos = NULL, type="source")

If you encounter problems installing WGCNA, please refer to the developer page. Please note that MetaboDiff can only be installed if WGCNA was successfully installed.

Installing MetaboDiff

MetaboDiff is available for all operating systems and can be installed via Github

library("devtools")
install_github("andreasmock/MetaboDiff")

and once installed loaded by

library(MetaboDiff)

Part I: Data processing

Input data

MetaboDiff requires three objects as input:

Example data for the tutorial is derived from a study by Priolo and colleagues in which the authors used the service of Metabolon® to compare the tissue metabolome of 61 prostate cancers with 25 normal prostate specimens[^1].

Please make sure that the colnames (samples) of assay correspond to the rownames (patients) of colData and the rownames of assay to the rownames of rowData.

assay[1:5,1:5]
head(colData)
head(rowData)

The function create_mae merges all objects into a so called MultiAssayExperiment object[^2] to simplify all downstream analysis.

(met <- create_mae(assay,rowData,colData))

Metabolite annotation

Metabolite annotation can be retrieved from the Small Molecular Pathway Database (SMPDB) if HMDB, KEGG or ChEBI ids are part of the rowData object [^3].

# Please specify the accoding column in rowData. In the example data set, the KEGG id was in column 6 and the HMBD id in column 7. A ChEBI id was not available.
met <- get_SMPDBanno(met,
                        column_kegg_id=6,
                        column_hmdb_id=7,
                        column_chebi_id=NA)

Imputation of missing values

In contrast to other high-throughput technologies, missing values are common in quantitative metabolomic datasets.

The function na_heatmap visualizes the missing metabolite measurements across the samples. The name of the column in colData for grouping and the label colors for the two groups need to be specified.

na_heatmap(met,
           group_factor="tumor_normal",
           label_colors=c("darkseagreen","dodgerblue"))

The example data supports the need for data imputation (figure \@ref(fig:heat)). Imputation is performed by k-nearest neighbor imputation, which could be shown to minimize the effects on the normality and variance of the data as long as the number of missing data does not exceed 40%[^4].

The function knn_impute adds the slot "impute" to the MultiAssayExperiment object that contains the imputed relative metabolite measurements for all metabolites with raw measurements in more than 60% of cases. We recommend a cutoff of 40% (i.e. 0.4). However the cutoff might be changed according to the discretion of the user.

(met = knn_impute(met,cutoff=0.4))

As apparent form the summary description of the object met 69 metabolites are excluded in the slot imputed due to missing measurements in more than 40% of samples.

Outlier heatmap

Before we normalize the data, we want to exclude putative outliers in the study set. To this end, the function outlier_heatmap is provided. The sample annotation shows the number of missing metabolites per sample as a proxy of the impact of imputation on clustering. To identify outliers, the dendrogram also displays the results of a k-means clustering. In the examplary data we set 2 clusters (k=2; figure \@ref(fig:outlier)).

outlier_heatmap(met,
                group_factor="tumor_normal",
                label_colors=c("darkseagreen","dodgerblue"),
                k=2)

The imputed data of the example study set displays a cluster of 5 samples (cluster 1) with in average lower relative metabolite measurements. Due to the lack of batch information, this cannot be investigated further at this time. To demonstrate how a cluster can be removed, we apply the function remove_cluster to remove cluster 1:

(met <- remove_cluster(met,cluster=1))

As displayed in the summary of the met object, the 5 samples of cluster 1 were successfully removed from the slots "raw" and "imputed".

Normalization

Variance stabilizing normalization (vsn) is used to ensure that the variance remains nearly constant over the measured spectrum[^5].

(met <- normalize_met(met))

At this point the data processing is completed with the MultiAssayExperiment object containing 4 slots:

Quality control of normalization

quality_plot(met,
             group_factor="tumor_normal",
             label_colors=c("darkseagreen","dodgerblue"))

The quality control plots shows the distribution of raw and normalized metabolic measurements for every sample in the study set. As aimed, the distribution of measurements is comparable across the study set for the normalized and imputed data (figure \@ref(fig:quality)D).

Part II: Data analysis

Part II of MetaboDiff comprises a set of functions for comparative data analysis that are purpose-built for the preprocessed MultiAssayExperiment object. All statistics obtained in this part will be saved in the metadata slot of the object. Certain plotting functions will only be available once the corresponding statistics has been run, e.g. you will only be able to plot the volcano plot after executing the function for differential analysis (diff_test). Please note that the data analysis has been developed for metabolomic data sets of more than 100 metabolites. Hence, some functionality might not be available with smaller data sets < 50 metabolites.

Biological questions

The following table summarizes the biological questions that can be answered by MetaboDiff functions.

Question | Statistical Methodology ----------------------------------------- | ---------------------------------- Are there outliers in my data set? | outlier heatmap including k-means clustering Are there metabolome-wide changes between samples? | unsupervised analyses (PCA, tSNE) Which individual metabolites show differential abundance between groups? | hypothesis testing with correction for multiple testing Which metabolic subpathways (i.e. modules) differ between groups? | metabolic correlation network analysis - module significance (MS) plot Which metabolite is most closely related to the individual subpathway (i.e. modules)? | metabolic correlation network analysis - module of interest (MOI) plot

Unsupervised analysis

To explore the metabolomic profiles between samples in an unsupervised fashion, MetaboDiff offers PCA as a linear and tSNE as a non-linear dimension reduction technique.

source("http://peterhaschke.com/Code/multiplot.R")
multiplot(
pca_plot(met,
         group_factor="tumor_normal",
             label_colors=c("darkseagreen","dodgerblue")),
tsne_plot(met,
          group_factor="tumor_normal",
             label_colors=c("darkseagreen","dodgerblue")),
cols=2)

A plot of the first two principal components shows no metabolome-wide difference between the grouping of interest. The first two dimensions in the tSNE plot do not reveal a distinct difference in the metabolomic profiles between the normal and tumor samples.

Hypothesis testing

Differential analysis for individual metabolites is performed with the function diff_test. If two groups are compared the function applies the Student`s T-Test. If there are more than two groups an analysis of variance (ANOVA) is applied. The p-values are corrected for multiple testing by the Benjamini-Hochberg procedure. Hypothesis testing can be simultaneously performed for multiple groups. In our example data, we perform the hypothesis testing for the sample grouping "tumor_normal", as well as for the randomly generated grouping "random_gender".

met = diff_test(met,
                group_factors = c("tumor_normal","random_gender"))

The results of the hypothesis testing is saved in the metadata slot of the MultiAssayExperiment object.

str(metadata(met), max.level=2)

The name of the corresponding slot comprises the name of the group_factor (e.g. tumor_normal), as well as the factor levels of the grouping (N_vs_T). Here, the first factor is always the reference for the difference in means (dm) calculation. A positive difference in means in the grouping T_vs_N is hence a higher abundance in the normal (N) group.

The columns of the result dataframe are the unadjusted p-value (pval), the adjusted p-value by Benjaminni-Hochberg procedure (adj_pval) and the difference in means between groups.

The comparative analysis can be visualized by means of a volcano plot (figure \@ref(fig:volcano1)). The option p_adjust enables to choose if the p-value cutoff counts for unadjusted or adjusted p-values. The argument dm_cutoff sets the cutoff for the absolute difference in means.

par(mfrow=c(1,2))
volcano_plot(met, 
             group_factor="tumor_normal",
             label_colors=c("darkseagreen","dodgerblue"),
             dm_cutoff=0.5,
             p_adjust = FALSE)
volcano_plot(met, 
             group_factor="tumor_normal",
             label_colors=c("darkseagreen","dodgerblue"),
             dm_cutoff=0.5,
             p_adjust = TRUE)

As a sanity check, we also display the volcano plot for the random grouping "random_gender" for which we do not expect the same number of significant metabolites.

par(mfrow=c(1,2))
volcano_plot(met, 
             group_factor="random_gender",
             label_colors=c("brown","orange"),
             p_adjust = FALSE)
volcano_plot(met, 
             group_factor="random_gender",
             label_colors=c("brown","orange"),
             p_adjust = TRUE)

As expected, no metabolite was significantly different between the randomly assigned grouping male vs. female after multiple testing with a cutoff of p-value<0.05 (figure \@ref(fig:volcano2)).

Metabolic correlation network analysis

Implementation

To derive meaningful subpathways that are enriched between groups, MetaboDiff generates a metabolic correlation network. Within MetaboDiff, metabolic correlation network analysis is performed by a set of functions:

For more details about the implementation and interpretation, please refer to the package vignette "Metabolic correlation network analysis".

met <- met %>%
    diss_matrix %>%
    identify_modules(min_module_size=5) %>%
    name_modules(pathway_annotation="SUB_PATHWAY") %>%
    calculate_MS(group_factors=c("tumor_normal","random_gender"))

Association of sample traits with correlation modules

The following plots illustrate the association of the sample trait of interest (tumor vs. normal) with the metabolic correlation modules derived form the data. The option p_adjust enables to choose if the p-value cutoff counts for unadjusted or adjusted p-values.

MS_plot(met,
        group_factor="tumor_normal",
        p_value_cutoff=0.05,
        p_adjust=FALSE)
MS_plot(met,
        group_factor="random_gender",
        p_value_cutoff=0.05,
        p_adjust = FALSE)

In line with the volcano plots, the random grouping did not result in a significant association (figure \@ref(fig:MSplot3)).

Exploration of individual metabolites within correlation module

The relationship of individual metabolites within a correlation module can be explored by means of a module of interest (MOI) plot, where the so called module membership is plotted over the p-value. In short, the module membership tells us, how closely related a metabolite is to the corresponding module and to other metabolites.

MOI_plot(met,
         group_factor="tumor_normal",
         MOI = 2,
         label_colors=c("darkseagreen","dodgerblue"),
         p_adjust = FALSE) + xlim(c(-1,8))

For more information regarding the correlation network methodology, see the vignette Metabolic_correlation_netork_analysis. A full case study applying metabolic correlation network analysis is described in the vignette Case_study.

Session information {.unnumbered}

sessionInfo()

[^1]: Priolo, C., Pyne, S., Rose, J., Regan, E. R., Zadra, G., Photopoulos, C., et al. (2014). AKT1 and MYC Induce Distinctive Metabolic Fingerprints in Human Prostate Cancer. Cancer Research, 74(24), 7198–7204. http://doi.org/10.1158/0008-5472.CAN-14-1490

[^2]: Sig M (2017). MultiAssayExperiment: Software for the integration of multi-omics experiments in Bioconductor. R package version 1.2.1

[^3]: Jewison T, Su Y, Disfany FM, et al. SMPDB 2.0: Big Improvements to the Small Molecule Pathway Database. Nucleic Acids Res. 2014 Jan;42(Database issue):D478-84.

[^4]: Armitage, E. G., Godzien, J., Alonso-Herranz, V., L pez-Gonz lvez, N., & Barbas, C. (2015). Missing value imputation strategies for metabolomics data. Electrophoresis, 36(24), 3050–3060. http://doi.org/10.1002/elps.201500352

[^5]: Huber, W., Heydebreck, von, A., Sültmann, H., Poustka, A., & Vingron, M. (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics, 18 Suppl 1, S96–104.

[^6]: Ignatiadis, N., Klaus, B., Zaugg, J. B., & Huber, W. (2016). Data-driven hypothesis weighting increases detection power in genome-scale multiple testing. Nature Methods, 13(7), 577–580. http://doi.org/10.1038/nmeth.3885

[^7]: Langfelder, P., & Horvath, S. (2008). WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics, 9, 559–559. http://doi.org/10.1186/1471-2105-9-559

[^8]: Zheng, C.-H., Yuan, L., Sha, W., & Sun, Z.-L. (2014). Gene differential coexpression analysis based on biweight correlation and maximum clique. BMC Bioinformatics, 15 Suppl 15(Suppl 15), S3. http://doi.org/10.1186/1471-2105-15-S15-S3

[^9]: Zhang, B., & Horvath, S. (2005). A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1), Article17. http://doi.org/10.2202/1544-6115.1128

[^10]: Langfelder, P., Zhang, B., & Horvath, S. (2008). Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics, 24(5), 719–720. http://doi.org/10.1093/bioinformatics/btm563.

[^11]: Horvath, S., & Dong, J. (2008). Geometric Interpretation of Gene Coexpression Network Analysis. PLoS Computational Biology (PLOSCB) 4(8), 4(8), e1000117–e1000117. http://doi.org/10.1371/journal.pcbi.1000117



andreasmock/MetaboDiff documentation built on Oct. 31, 2020, 8:18 a.m.