metaboGSE: Gene Set Enrichment Analysis via Integration of Metabolic Networks and RNA-Seq Data

knitr::opts_chunk$set(fig.width=12, fig.height=4, fig.path='./', fig.align='center', echo = TRUE)
knitr::opts_knit$set(root.dir = normalizePath('..'))

1  Introduction

The metaboGSE package is designed for the integration of transcriptomic data and genome-scale metabolic networks (GSMN) by constructing condition-specific series metabolic sub-networks by means of RNA-seq data and providing a gene set enrichment analysis with the aid of such sub-network series [@tran].

2  Installation

metaboGSE depends on the sybil package and was evaluated with GLPK and COIN-OR Clp solvers via the glpkAPI and clpAPI packages, respectively, available in CRAN^[]. The solvers and their R interface API package are not automatically installed along with metaboGSE. The simplex method from GLPK and inibarrier method from Clp were investigated and yielded identical results. glpkAPI simplex was observed to be somehow faster. Other solvers and methods implemented for sybil should also work, however have not been tested. Eventually it should be mentioned that metaboGSE is very demanding for the solvers and may reveal sporadic computational instability, possibly due to the solver compilation and execution environment.

3  Usage

3.1  Sybil settings


3.2  Data preparation

3.2.1  Metabolic networks

A metabolic network can be imported from tabular or SBML inputs (see sybil's manual for more details). Two Yarrowia lipolytica models in normoxic and hypoxic environments are provided in the iMK735 dataset [@kavscek] and presented in the MetaNetX/MNXref namespace [@moretti]. They are identical apart from the bounds of exchange oxygen flux.


3.2.2  RNA-seq data

Normalized (or raw) RNA-seq counts should be provided as a matrix with gene per row and library per column. The exprMaguire [@maguire] dataset contains two matrices representing log2 voom-normalized count (expr) and RPKM (pkmExpr) per library.

str(exprMaguire$expr, vec.len=3)

3.2.3  Gene set annotation

Here we apply metaboGSE for Gene Ontology enrichment analysis. A mapping between genes and GO terms should be provided. The yarli2GO dataset contains such a mapping in a list format.


3.2.4  Pre-built data

The tutorial below shows a simple example to be executed in a reasonable computing time. The datasets of pre-built series of metabolic sub-networks and gene set enrichment [@tran] are also provided.


3.3  Analysis

3.3.1  Model rescuing

Both models of iMK735 grow, nevertheless we should still apply the rescue procedure to produce the rescue reactions for all metabolites in growth and maintenance (if any) reactions (see [@tran]). In the version 1.2.1, we remove the option to rescue only substrates in these reactions, which helps to reduce the bug of infinite loop in LP solving. For instance, we here perform the rescue process while targeting a growth of 20% of the initial objective value for the hypoxic model.

target.ratio <- 0.2
hmodel <- iMK735$hypoxia$model
hobj <- iMK735$hypoxia$obj
hmodel.rescue <- rescue(model = hmodel,
                        target = c(target.ratio*hobj),
                        react = c(which(obj_coef(hmodel) == 1)))

In the hmodel.rescue object, the rescued field represents the rescued model $\mathcal{M}''$, which is the same as the initial model $\mathcal{M}$ since the latter grows, as presented in Fig 1. The rescue field presents the rescue model $\mathcal{M}'$, which contains rescue reactions on every metabolite in growth and maintenance reactions, and the coef field indicates the coefficients of those rescue reactions used for optimization in the rescue process. RECO_PUSH_MNXC3 denotes the compartment of artificial metabolites, e.g. X$'$. The rescue procecure can also be performed on maintenance reactions, which are restricted to non-empty fixed fluxes.


Schema of GSMN rescue process. $\mathcal{M}$, original GSMN with growth reaction X + Y ---> Z + Biomass. $\mathcal{M}'$, expanded GSMN with the full set of rescue ($r_x$) and help ($h_x$) reactions for every metabolite x in the biomass reaction. $\mathcal{M}''$, example of a minimal rescued GSMN in the particular case where only metabolite Y needs to be rescued. [@tran]{width=35%}

3.3.2  Model cleaning

We set the TOLERANCE to 1e-8, which indicates that values less than 1e-8 are considered as 0, to deal with numerical imprecision.


The blocked reactions from the rescue models are determined by a flux variability analysis via the fluxVar function from sybil. Those reactions as well as related genes and metabolites are then removed from the models. mc.cores can be set appropriately to perform parallel computation of fluxVar. A high mc.cores is recommended as fluxVar is time consuming.

## not run
mc.cores <- 10
fva <- multiDel(model=hmodel.rescue$rescue,
reacs.blo <- names(which(setNames(unlist(lapply(fva, blReact)), 
hmodel.clean <- rmReact(hmodel.rescue$rescue, reacs.blo, rm_met=T)

hmodel.clean, considered as the comprehensive model for hypoxic conditions, can be loaded from the iMK735 dataset. It is noted that the TOLERANCE set in SYBIL_SETTINGS(“TOLERANCE”) determined the threshold for a reaction to be blocked. It should be adequately selected so that multiple (if any) input models, for example those in hypoxic and normoxic environment, contain the same gene set after cleaning by removing blocked reaction and propagating.

hmodel.clean <- iMK735$hypoxia$comp

Now we convert the growth objective of the comprehensive model to a weighted objective function on rescue reactions with the determined coefficients. Hereafter, the goal is to minimize this function.

hmodel.weight <- changeObjFunc(hmodel.clean, react=rownames(hmodel.rescue$coef), 

The obtained objective of 0 above indicates that there is no need to rescue the hmodel.clean since it grows.

3.3.4  Weighting scheme

We now compute weights for rescue reactions to account for the importance and dependency of metabolites to rescue (see Weighting scheme for model fitness in Tran et al. (2018))

mc.cores <- 1
rescue.weight <- (weightReacts(hmodel.weight, mc.cores=mc.cores, gene.num=1))$weight
str(rescue.weight, vec.len=2)

3.3.4  GO annotation

We compute the set of preliminary GO terms in biological process category using topGO with fisher statistic and weight01 algorithm. The whole GO annotation and gene universe are used. The aim of the following R script is to preliminarily filter the set of GO terms of interest. The resulting 135 GO terms are filtered by p-value < 0.1 and contain at least 3 genes and at most 50 genes from the model.

if (!requireNamespace("topGO", quietly = TRUE)) {
    print("Please install topGO: BiocManager::install('topGO')")
} else {
    GO2geneID <- inverseList(yarli2GO)
    str(head(GO2geneID), vec.len=3) <- names(yarli2GO)
    gene.list <- factor(as.integer( %in% sybil::allGenes(hmodel.clean)))
    names(gene.list) <-
    GOdata <- new("topGOdata",
                  ontology = "BP",
                  nodeSize = 5,
                  allGenes = gene.list,
                  annot    = annFUN.gene2GO,
                  gene2GO  = yarli2GO
    result <- runTest(GOdata, statistic="fisher", algorithm="weight01")
    table  <- GenTable(GOdata,
                       weight   = result,
                       orderBy  = "weight",
                       numChar  = 10000,
                       topNodes = result@geneData[4]
    table$weight <- as.numeric(sub("<", "", table$weight))
    table <- table[!$weight), ]
    MINSIG <- 3
    MAXSIG <- 50
    WCUTOFF <- 0.1
    GO.interest <- table[table$Significant >= MINSIG & table$Significant <= MAXSIG & 
                         table$weight < WCUTOFF, ]$GO.ID
    GO2geneID.interest.proteome <- genesInTerm(GOdata, GO.interest)
    GO2geneID.interest <- lapply(GO2geneID.interest.proteome, function(git) {
        intersect(sybil::allGenes(hmodel.clean), git)
    str(head(GO2geneID.interest), vec.len=3)

GO.interest contains other GO terms than those in GO2geneID, as topGO allows propagating in the gene ontology.

3.3.5  Expression-based gene removal

step indicates the difference of gene numbers to remove between consecutive sub-model constructions, then determines numbers of genes to remove in the simulation. Here we set step = 50 and draw.num = 4 to reduce the computing time in this tutorial, i.e. the 0, 50, 100, etc. first genes in certain ranking will be successively removed from the comprehensive model, and 4 random removals will be performed. The series of metabolic sub-networks is constructed for the hypoxic $upc2\Delta$ (UH) condition with various gene rankings as below.

cond <- "UH"
step <- 50
draw.num <- 4
reps.i <- grep(cond, colnames(exprMaguire$expr), value=T)
ranks <- lapply(reps.i, function(ri) {
        # ranks1. voom-normalized expression
        expr = exprMaguire$expr[, ri, drop=T],
        # ranks2. pkm normalized expression
        pkmExpr  = exprMaguire$pkmExpr[, ri, drop=T], 
        # ranks3. relative expression power 1
        relExpr1 = relativeExpr(exprMaguire$expr, power=1)[, ri, drop=T], 
        # ranks4. relative expression power 2
        relExpr2 = relativeExpr(exprMaguire$expr, power=2)[, ri, drop=T],  
        # ranks5. relative expression power 3
        relExpr3 = relativeExpr(exprMaguire$expr, power=3)[, ri, drop=T], 
        # ranks6. reverse expression (the worst)
        revExpr  = 1/(1 + exprMaguire$expr[, ri, drop=T]),                 
        # ranks7. z-score expression
        zExpr    = zscoreExpr(exprMaguire$expr)[, ri, drop=T]              
names(ranks) <- reps.i
fitnessUH <- fitness(model         = hmodel.weight,
                     ranks         = ranks,
                     rescue.weight = rescue.weight,
                     step          = step,
                     draw.num      = draw.num,
                     mc.cores      = mc.cores)
submnetsUH <- submnet(model        = hmodel.weight,
                      fn           = fitnessUH,
                = "expr",
                      gene.sets    = list("GO:0006696"=
                      mc.cores     = mc.cores)

It may happen that several genes have identical values for their expression in a sample, which triggers a random ranking among these genes, and thus introduces an unexpected difference among samples. To overcome such a randomness, a negligible value could be added to the expression of each gene in each sample using the overall expression in all the samples.

## not run
pseudo.rank <- base::rank(rowSums(exprMaguire$expr), 
exprMaguire$expr <- exprMaguire$expr + pseudo.rank

ranks can be set to the only expression you want to use, e.g. expr, and draw.num set to 0 to skip the ranking evaluation step. It is observed that the optimizeProb function from sybil may stay in an infinite loop with some system configuration. A timeout limit should be given in the fitness function of metaboGSE. It helps to stop the occasional infinite loop in optimizeProb. Such a limit is set by default to 12 seconds, which is sufficient for the model iMK735 from Y. lipolytica and iMM1415 from mouse. A higher limit may be required for a larger model.

knitr::kable(submnetsUH$fitness.random, digits=3)
knitr::kable(submnetsUH$fitness.ranks$UH1, digits=3)

The yarliSubmnets dataset contains the series of sub-networks built with step = 1 and draw.num = 50, indicating the gene-by-gene removal.


The sub-network construction can be visualized via the simulateSubmnet function, which produces a plot for each condition. Figures 2 and 3 show the fitness of submodels obtained by removing genes following different rankings for the hypoxic $upc2\Delta$ condition.

## not run    

submnetsUH with step = 50 and draw.num = 4{width=50%}

yarliSubmnets with step = 1 and draw.num = 50{width=50%}

3.3.6  GO term enrichment

We evaluate the significance of given gene sets with the metaboGSE function and randomization tests. nrand = 1000 is used in the test for the significance of the gene sets against random sets in each individual condition. nperm = 1000 is used in the test for the significance of difference between conditions.

## not run
GSE <- metaboGSE(yarliSubmnets, method="perm", nperm=1000, nrand=1000, 
                 mc.cores=mc.cores, prefix="/tmp/summary")

This step is time consuming. The yarliGSE dataset can be loaded instead.

GSE <- yarliGSE
str(GSE[["GO:0006696"]], vec.len=2)
GS.sig.all <-, function(gsm) {
      p.Cond=if (is.null(gsm$res$p.Cond)) NA else min(gsm$res$p.Cond),
  })), stringsAsFactors=F)
GS.sig.all$FDR  <- p.adjust(as.numeric(GS.sig.all$p.Val), method="BH")
GS.sig.all <- GS.sig.all[!$FDR), ]
GS.sig <- GS.sig.all[as.numeric(GS.sig.all$FDR) < 0.05, , drop=F]
head(GS.sig[order(as.numeric(GS.sig$Statistic), decreasing=T), ])


Try the metaboGSE package in your browser

Any scripts or data that you put into this service are public.

metaboGSE documentation built on Oct. 23, 2020, 8:14 p.m.