enrich-funs: Functions to map and enrich a list of metabolites
In FELLA: Interpretation and enrichment for metabolomics data

Description Usage Arguments Details Value References Examples

Function defineCompounds creates a FELLA.USER object from a list of compounds and a FELLA.DATA object.

Functions runHypergeom, runDiffusion and runPagerank perform an enrichment on a FELLA.USER with the mapped input metabolites (through defineCompounds) and a FELLA.DATA object. They are based on the hypergeometric test, the heat diffusion model and the PageRank algorithm, respectively.

Function enrich is a wrapper with the following order: loadKEGGdata (optional), defineCompounds and one or more in runHypergeom, runDiffusion and runPagerank

defineCompounds(compounds = NULL, compoundsBackground = NULL,
    data = NULL)

runHypergeom(object = NULL, data = NULL, p.adjust = "fdr")

runDiffusion(object = NULL, data = NULL, approx = "normality",
    t.df = 10, niter = 1000)

runPagerank(object = NULL, data = NULL, approx = "normality",
    dampingFactor = 0.85, t.df = 10, niter = 1000)

enrich(compounds = NULL, compoundsBackground = NULL,
    methods = listMethods(), loadMatrix = "none", approx = "normality",
    t.df = 10, niter = 1000, databaseDir = NULL, internalDir = TRUE,
    data = NULL, ...)

`compounds`	Character vector containing the KEGG IDs of the compounds considered as affected
`compoundsBackground`	Character vector containing the KEGG IDs of the compounds that belong to the background. Can be `NULL` for the default background (all compounds)
`data`	FELLA.DATA object
`object`	FELLA.USER object
`p.adjust`	Character passed to the `p.adjust` method
`approx`	Character: "simulation" for Monte Carlo, "normality", "gamma" or "t" for parametric approaches
`t.df`	Numeric value; number of degrees of freedom of the t distribution if the approximation `approx = "t"` is used
`niter`	Number of iterations (permutations) for Monte Carlo ("simulation"), must be a numeric value between 1e2 and 1e5
`dampingFactor`	Numeric value between 0 and 1 (none inclusive), damping factor `d` for PageRank (`page.rank`)
`methods`	Character vector, containing some of: `"hypergeom"`, `"diffusion"`, `"pagerank"`
`loadMatrix`	Character vector to choose if heavy matrices should be loaded. Can contain: `"diffusion"`, `"pagerank"`
`databaseDir`	Character, path to load the `FELLA.DATA` object if it is not already passed through the argument `data`
`internalDir`	Logical, is the directory located in the package directory?
`...`	Further arguments for the enrichment function(s) `runDiffusion`, `runPagerank`

Function defineCompounds maps the specficied list of KEGG compounds [Kanehisa, 2017], usually from an experimental metabolomics study, to the graph contained in the FELLA.DATA object. Importantly, the names must be KEGG ids, so other formats (common names, HMDB ids, etc) must be mapped to KEGG first. For example, through the "Compound ID Conversion" tool in MetaboAnalyst [Xia, 2015]. The user can also define a personalised background as a list of KEGG compound ids, which should be more extensive than the list of input metabolites. Once the compounds are mapped, the enrichment can be performed through runHypergeom, runDiffusion and runPagerank.

Function runHypergeom performs an over representation analysis through the hypergeometric test [Fisher, 1935] on a FELLA.USER object with mapped metabolites and a FELLA.DATA object. If a custom background was specified, it will be used. This approach is included for completeness and it is not the main purpose behind the FELLA package. Importantly, runHypergeom is not a hypergeometric test using the original KEGG pathways. Instead, a compound "belongs" to a "pathway" if it can reach the original pathway in the upwards-directed KEGG graph. This is a way to evaluate enrichment including indirect connections to a pathway, e.g. through an enzymatic family. New "pathways" are expected to be larger than the original pathways in this analysis and therefore the results can differ from the standard over representation.

Function runDiffusion performs the diffusion-based enrichment on a FELLA.USER object with mapped metabolites and a FELLA.DATA object [Picart-Armada, 2017]. If a custom background was specified, it will be used. The idea behind the heat diffusion is the usage of the finite difference formulation of the heat equation to propagate labels from the metabolites to the rest of the graph.

Following the notation in [Picart-Armada, 2017], the temperatures (diffusion scores) are computed as:

T = -KI^(-1)*G

G is an indicator vector of the input metabolites (1 if input metabolite, 0 otherwise). KI is the matrix -KI = L + B, being L the unnormalised graph Laplacian and B the diagonal matrix with B[i,i] = 1 if node i is a pathway and B[i,i] = 0 otherwise.

Equivalently, with the notation in the HotNet approach [Vandin, 2011], the stationary temperature is named fs:

fs = Lgamma^(-1)*bs

bs is the indicator vector G from above. Lgamma, on the other hand, is found as Lgamma = L + gamma*I, where L is the unnormalised graph Laplacian, gamma is the first order leaking rate and I is the identity matrix. In our formulation, only the pathway nodes are allowed to leak, therefore I is switched to B. The parameter gamma is set to gamma = 1.

The input metabolites are forced to stay warm, propagating flow to all the nodes in the network. However, only pathway nodes are allowed to evacuate this flow, so that its directionality is bottom-up. Further details on the setup of the diffusion process can be found in the supplementary file S2 from [Picart-Armada, 2017].

Finally, the warmest nodes in the graph are reported as the relevant sub-network. This will probably include some input metabolites and also reactions, enzymes, modules and pathways. Other metabolites can be suggested as well.

Function runPagerank performs the random walk based enrichment on a FELLA.USER object with mapped metabolites and a FELLA.DATA object. If a custom background was specified, it will be used. PageRank was originally conceived as a scoring system for websites [Page, 1999]. Intuitively, PageRank favours nodes that (1) have a large amount of nodes pointing at them, and (2) whose pointing nodes also have high scores. Classical PageRank is formulated in terms of a random walker - the PageRank of a given node is the stationary probability of the walker visiting it.

The walker chooses, in each step, whether to continue the random walk with probability dampingFactor or to restart it with probability 1 - dampingFactor. In the original publication, dampingFactor = 0.85, which is the value used in FELLA by default. If he or she continues, an edge is picked from the outgoing edges in the current node with a probability proportional to its weight. If he or she restarts it, a node is uniformly picked from the whole graph. The "personalised PageRank" variant allows a user-defined distribution as the source of new random walks. The R package igraph contains such variant in its page.rank function [Csardi, 2006].

As described in the supplement S3 from [Picart-Armada, 2017], the PageRank PR can be computed as a column vector by imposing a stationary state in the probability. With a damping factor d and the user-defined distribution p as a column vector:

PR = d*M*PR + (1 - d)*p

M is the matrix whose element M[i,j] is the probability of transitioning from j to i. If node j has outgoing edges, their probability is proportional to their weight - all weights must be positive. If node j has no outgoing edges, the probability is uniform over all the nodes, i.e. M[i,j] = 1/nrow(M) for every i. Note that all the columns from M sum up exactly 1. This leads to an expression to compute PageRank:

PR = (1 - d)*p*(I - d*M)^(-1)

The idea behind the method "pagerank" is closely related to "diffusion". Relevant metabolites are the sources of new random walks and nodes are scored through their PageRank. Specifically, p is set to a uniform probability on the input metabolites. More details on the setup can be found in the supplementary file S3 from [Picart-Armada, 2017].

There is an important detail for "diffusion" and "pagerank": the scores are statistically normalised. Omitting this normalisation leads to a systematic bias, especially in pathway nodes, as described in [Picart-Armada, 2017].

Therefore, in both cases, scores undergo a normalisation through permutation analysis. The score of a node i is compared to its null distribution under input permutation, leading to their p-scores. As described in [Picart-Armada, 2017], two alternatives are offered: a parametric and deterministic approach and a non-parametric, stochastic one.

Stochastic Monte Carlo trials ("simulation") imply randomly permuting the input niter times and counting, for each node i, how many trials led to an equally or more extreme value than the original score. An empirical p-value is returned [North, 2002].

On the other hand, the parametric scores (approx = "normality") give a z-score for such permutation analysis. The expected value and variance of such null distributions are known quantities, see supplementary file S4 from [Picart-Armada, 2017]. To work in the same range [0,1], z-scores are transformed using the routine pnorm. The user can also choose the Student's t using approx = "t" and choosing a number of degrees of freedom through t.df. This uses the function pt instead. Alternatively, a gamma distribution can be used by setting approx = "gamma". The theoretical mean (E) and variance (V) are used to define the shape (E^2/V) and scale (V/E) of the gamma distribution, and pgamma to map to [0,1].

Any sub-network prioritised by "diffusion" and "pagerank" is selected by applying a threshold on the p-scores.

Finally, the function enrich is a wrapper to perform the enrichment analysis. If no FELLA.DATA object is supplied, it loads it, maps the affected compounds and performs the desired enrichment(s) with a single call. Returned is a list with the loaded FELLA.DATA object and the results in a FELLA.USER object. Conversely, the user can supply the FELLA.DATA object and the wrapper will map the metabolites and run the desired enrichment method(s). In this case, only the FELLA.USER will be returned.

defineCompounds returns the FELLA.USER object with the mapped metabolites, ready to be enriched.

runHypergeom returns a FELLA.USER object updated with the hypergeometric test results

runDiffusion returns a FELLA.USER object updated with the diffusion enrichment results

runPagerank returns a FELLA.USER object updated with the PageRank enrichment results

enrich returns a FELLA.USER object updated with the desired enrichment results if the FELLA.DATA was supplied. Otherwise, a list with the freshly loaded FELLA.DATA object and the corresponding enrichment in the FELLA.USER object.

Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., & Morishima, K. (2017). KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic acids research, 45(D1), D353-D361.

Xia, J., Sinelnikov, I. V., Han, B., & Wishart, D. S. (2015). MetaboAnalyst 3.0 - making metabolomics more meaningful. Nucleic acids research, 43(W1), W251-W257.

Fisher, R. A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98(1), 39-82.

Picart-Armada, S., Fernandez-Albert, F., Vinaixa, M., Rodriguez, M. A., Aivio, S., Stracker, T. H., Yanes, O., & Perera-Lluna, A. (2017). Null diffusion-based enrichment for metabolomics data. PLOS ONE, 12(12), e0189012.

Vandin, F., Upfal, E., & Raphael, B. J. (2011). Algorithms for detecting significantly mutated pathways in cancer. Journal of Computational Biology, 18(3), 507-522.

Page, L., Brin, S., Motwani, R., & Winograd, T. (1999). The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.

Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695(5), 1-9.

North, B. V., Curtis, D., & Sham, P. C. (2002). A note on the calculation of empirical P values from Monte Carlo procedures. American journal of human genetics, 71(2), 439.

## Load the internal database. 
## This one is a toy example!
## Do not use as a regular database
data(FELLA.sample)
## Load a list of compounds to enrich
data(input.sample)

######################
## Example, step by step

## First, map the compounds
obj <- defineCompounds(
compounds = c(input.sample, "I_dont_map", "me_neither"), 
data = FELLA.sample)
obj
## See the mapped and unmapped compounds
getInput(obj)
getExcluded(obj)
## Compounds are already mapped 
## We can enrich using any method now

## If no compounds are mapped an error is thrown. Example:
## Not run: 
data(FELLA.sample)
obj <- defineCompounds(
compounds = c("C00049", "C00050"), 
data = FELLA.sample)
## End(Not run)

## Enrich using hypergeometric test
obj <- runHypergeom(
object = obj, 
data = FELLA.sample)
obj

## Enrich using diffusion
## Note how the results are added;  
## the hypergeometric results are not overwritten
obj <- runDiffusion(
object = obj, 
approx = "normality", 
data = FELLA.sample)
obj

## Enrich using PageRank
## Again, this does not overwrite other methods 
obj <- runPagerank(
object = obj, 
approx = "simulation", 
data = FELLA.sample)
obj

######################
## Example using the "enrich" wrapper

## Only diffusion
obj.wrap <- enrich(
compounds = input.sample, 
method = "diffusion", 
data = FELLA.sample)
obj.wrap

## All the methods
obj.wrap <- enrich(
compounds = input.sample, 
methods = FELLA::listMethods(), 
data = FELLA.sample)
obj.wrap