Introduction to MoDentify {#intro}

Phenotype associations in large-scale, heterogeneous metabolomics data sets can be expected to be substantially complex, spanning functional modules. Functional modules are commonly defined as groups of correlating entities that are functionally coordinated, coregulated, or generally driven by a common biological process [@mitra2013]. Moreover, phenotypes will associate with metabolic modules at different scales, ranging from global associations spanning entire pathways or even sets of pathways, to localized associations with only few metabolites [@Do2017]. MoDentify is an algorithm for the identification of modules at different layers of resolution (both at the fine-grained metabolite level and the more global pathway-level).

Given a network, a phenotype variable, a scoring function, and a seed (starting) node, a greedy search algorithm identifies an optimal module by score maximization. This module is determined by extending candidate modules along its network edges, until no further score improvement can be achieved (see Module identification).

Network inference {#networkinference}

MoDentify searches for phenotype associated modules based on a correlation network. Depending on the chosen resolution level, networks are either inferred using metabolite-metabolite correlations or pathway-pathway correlations. To this end, Gaussian graphical models (GGMs) are estimated using the GeneNet R package. GGMs are based on partial correlations, which represent associations between two variables corrected for all remaining variables in multivariate Gaussian distributions [@Krumsiek2011]. Required covariates such as age, gender, or body mass index (BMI) can be included into the model. Nodes correspond to the variables of the data set (metabolites, or pathways), and edges between nodes are considered if both Pearson correlations and partial correlations are statistically significant with a chosen significance threshold (e.g. at $\alpha$=0.05) after multiple testing correction.

Pathway representation {#pathway}

The pathway network is inferred by calculating a representative variable for each pathway, thereby creating a new data set with pathway representation values for each observations. MoDentify provides two approaches for pathway representation:

1) eigenmetabolite approach: For each pathway a principal component analysis (PCA) or Singular Value Decomposition (SVD) is performed after scaling all variables to a mean of 0 and a variance of 1. The first principal component - also termed eigenmetabolite - is used as a representative value for the entire set of variables in the pathway [@Langfelder2007]. These eigenmetabolites are then subjected to the network inference procedure.

2) average approach: All variables are scaled such that the mean is 0 and the variance is 1. Subsequently, the pathway representative is calculated as the average of all variable values in the given pathway (average z-score).

If pathways can be assumed to be homogeneous (by sharing common biological or biochemical properties), the eigenmetabolite approach can be used. If homogeneity can not be guaranteed, the average approach might be a better choice. In general, it is helpful to explore the explained variances of the eigenmetabolites (See Explained variances). This new data set finally serves as input for the estimation of the pathway GGM.

Module representation and scoring {#module}

Based on the network, a greedy approach is used to search for modules by score maximization. The score of a candidate module $M$ is obtained from the multivariable linear regression model:

$$R_M = \beta_{M,0} + \beta_{M,1} \times P + \sum_{i=1}^{|C|} (\beta_{M,i+1}\times c_i) + \epsilon_M$$

where $R_M$ is the module representative value, $\beta_{M,0}$ is the intercept, $\beta_{M,i}$ is the regression coefficient for the respective independent variable, $P$ is the phenotype of interest, $C$ is the set of covariates $c_i$, and $\epsilon_M$ is a normally distributed error term. The module score is then defined as the negative logarithmized p-value of the coefficient $\beta_{M,1}$, which represents the magnitude of phenotype association. Notably, the score of a single component equals its negative logarithmized p-value from a univariate analysis.

$R_M$, the module representative can either be defined by the eigenmetabolite approach, or by the average approach as described in section Pathway representation. If modules at pathway level should be identified and $M$ consists of multiple pathways, then $R_M$ is calculated using all metabolites in all pathways independent of pathway assignment. Notably, for module representatives we recommend to use the average approach since modules can be highly heterogeneous.

Module identification {#moduleidentification}

Given the scoring function and a seed node, a greedy search procedure is performed to identify an optimal module. The algorithm starts with a seed node as candidate module. The score of the candidate module that contains only the seed node is obtained from the univariate differential analysis. In each iteration, the neighborhood of the candidate module is identified. Each neighbor (candidate) is added to the candidate module and the score of the extended candidate module is calculated with the scoring function as defined in section Module representation and scoring. The candidate node leading to the highest score improvement is added to the candidate module if the module score is higher than the score of each of its single components. The algorithm terminates if no further score improvements can be made, and the optimal module is returned. In an optional consolidation step, overlapping optimal modules, e.g. from different seed nodes, can be combined into one module, which is re-scored by the scoring function. To assess the significance of the modules, a conservative multiple testing correction procedure was used by correcting for the total number of nodes in the underlying network.

knitr::include_graphics(system.file("figures", "MoDentify_scheme.png", package = "MoDentify"))

\newpage

Prerequisites and installation {#prerequisites}

For visualization of the identified modules in the underlying network, the following software is required:

  1. Bioconductor: RCy3
  2. Cytoscape (>= 3.7.0) available at https://cytoscape.org/download.html
  3. The CyREST (>= 3.8.0) plugin for Cytoscape, which is automatically installed with Cytoscape >= 3.3.0

Load MoDentify with:

library(MoDentify)

\newpage

Example data set {#data}

In the following, we show an application of MoDentify on preprocessed metabolomics data for blood, urine, and saliva samples from the Qatar Metabolomics Study on Diabetes (QMDiab), a type 2 diabetes case-control cohort published in several previous publications [@donetwork2015, @mookkanamori2013, @suhreconnecting2017, @yousri2015]. The study was conducted in 2012 at the Dermatology Department of Hamad Medical Corporation and the Weill Cornell Medical College in Doha, Qatar. Untargeted metabolomics measurements (by Metabolon, Inc.) were available for 190 diabetes patients and 184 non-diabetics of Arab and Asian ethnicities aged 17-81 years. After preprocessing and combining data for all three fluids, 1524 (501 plasma, 734 urine, and 289 saliva) metabolites measured for 310 sindividuals were available.

Raw and preprocessed metabolomics data for blood, urine, and saliva, metabolite pathway annotations, and phenotype information on age, gender, body mass index, and type 2 diabetes status are available at the figshare repository via the following link https://figshare.com/s/9145577a1af99f4de480. The preprocessed data, stored in the three data.tables qmdiab.data, qmdiab.annos, and qmdiab.phenos, is part of MoDentify and will be available after installation.

qmdiab.data is a data.frame with metabolites in columns and observations in rows. Each entry corresponds to the preprocessed relative ion counts for the respective metabolite in the respective sample measured by Metabolon, Inc. All variable labels contain the prefixes "P::", "U::", or "S::" indicating metabolites measured in plasma, urine, and saliva, respectively.

# QMDiab metabolomics data
data("qmdiab.data")
knitr::kable(qmdiab.data[1:4,1:3])

qmdiab.annos is a data.frame with metabolites as rows and annotations as columns. We used a priori pathway annotations from Metabolon, Inc. which assigns each metabolite to one "sub-pathway" representing metabolic pathways and biochemical subclasses (e.g. Branched-chain amino acids), and to one "super-pathway" representing the general chemical or functional class (e.g. Amino acids).

# QMDiab annotations
data("qmdiab.annos")
knitr::kable(qmdiab.annos[1:4,2:5])

Finally, qmdiab.phenos is a data.frame containing information about age, gender, body mass index, and type 2 diabetes status for each study participant.

# QMDiab phenotypes
data("qmdiab.phenos")
knitr::kable(qmdiab.phenos[1:3,])

Alternatively, the three data.tables can be generated from the excel file QMDiab_metabolomics_Preprocessed.xlsx stored at the Dryad Data Repository:

source(system.file("scripts", "getQmdiabData.R", package = "MoDentify"))
qmdiab <- getQmdiabData("QMDiab_metabolomics_Preprocessed.xlsx")

\newpage

Module identification at metabolite level {#metlevel}

Network inference

met.graph <- generateNetwork(data = qmdiab.data, 
                            covars = qmdiab.phenos, 
                            annotations = 
                              qmdiab.annos[,c("name","Super.pathway","Sub.pathway")],
                            correlation.type = "partial",
                            alpha = 0.05, 
                            correction.method = "bonferroni")

The inferred network is an igraph object with 1524 nodes and 1945 edges.

summary(met.graph)

Alternatively, a pre-existing network can be used for module identification. Note that all nodes in the network must be available as variables in the input data.

net.DT <- fread(input = system.file("extdata", "qmdiab.network.txt", 
                                    package = "MoDentify"), header = TRUE)

annotations <- fread(input = system.file("extdata", "qmdiab.annotations.txt",
                                         package = "MoDentify"), header = TRUE)

met.graph <- graph_from_data_frame(d = net.DT,
                                   directed = FALSE,
                                   vertices = annotations)

Edges between variable groups

In metabolomics, it has frequently been observed that metabolites from the same pathways tend to correlate with each other, while there are only few correlations across pathways [@donetwork2015, @Krumsiek2011, @mitra2013]. This trend can be explored in the inferred network by investigating the number of edges within and between pathways:

ggEdgesBetweenAttributes(met.graph, "Super.pathway", rm.unknown = TRUE)

Module identification

modules.summary <- identifyModules(graph = met.graph,
                                  data = qmdiab.data,
                                  annotations = qmdiab.annos,
                                  covars = qmdiab.phenos[,1:3],
                                  phenotype = qmdiab.phenos$T2D,
                                  alpha = 0.05,
                                  correction.method ="bonferroni",
                                  representative.method = "average")

Modules can be exported to a file using:

  writeModules(modules.summary, "QMDiab_modules.txt")

Network and module visualization

#- remember to open Cytoscape and activate the CytoscapeRPC plugin
drawModules(graph = met.graph,
             title = "QMDiab_modules",
             summary = modules.summary,
             save.image = FALSE,
             modules.to.draw = NULL)
knitr::include_graphics(system.file("figures", "MoDentify_modules.png", package = "MoDentify"))

\newpage

Module identification at pathway level {#pathwaylevel}

Network inference

pathway.graph <- generatePathwaysNetwork(data = qmdiab.data,
                                         annotations = qmdiab.annos,
                                         level = "Sub.pathway",
                                         correlation.type = "pearson",
                                         alpha = 0.05,
                                         correction.method = "bonferroni",
                                         rm.unknown = TRUE,
                                         representative.method = "eigenmetabolite")

\newpage

Explained variances of eigenmetabolites {#expvar}

Eigenmetabolites are calculated as the first principal components of a PCA. The explained variances can be plotted with:

# plot all pathways
ggExplainedVariance(pathway.graph)

\newpage

# plot the first 50 pathways
ggExplainedVariance(pathway.graph, 1:50)

Module identification

pathway.modules <- identifyModules(graph = pathway.graph$network,
                                            data = qmdiab.data,
                                            covars = qmdiab.phenos[,1:3],
                                            phenotype = qmdiab.phenos$T2D,
                                            level = pathway.graph$level,
                                            annotations = qmdiab.annos,
                                            alpha = 0.05,
                                            correction.method ="bonferroni",
                                            representative.method = "average")

Network and module visualization

drawModules(graph = pathway.graph$network,
             title = "QMDiab_modules_pw",
             nodes = pathway.modules$nodes,
             modules = pathway.modules$modules,
             save.image = FALSE,
             modules.to.draw = NULL)
knitr::include_graphics(system.file("figures", "MoDentify_pathway_modules.png", package = "MoDentify"))

\newpage

Citation

To cite MoDentify, please cite the following publication:

Do KD, Rasp D, Kastenmueller G, Suhre K, Krumsiek J. MoDentify: phenotype-driven module identification in metabolomics networks at different resolutions. Bioinformatics, 2018 https://doi.org/10.1093/bioinformatics/bty650

Session info

Here is the output of sessionInfo() on the system on which this document was compiled running pandoc r rmarkdown::pandoc_version():

sessionInfo()

\newpage {.unnumbered}



krumsiek/MoDentify documentation built on March 25, 2021, 8:32 a.m.