knitr::opts_chunk$set(fig.width=12, fig.height=4, fig.path='./', fig.align='center', echo = TRUE) knitr::opts_knit$set(root.dir = normalizePath('..'))
The metaboGSE package is designed for the integration of transcriptomic data and genome-scale metabolic networks (GSMN) by constructing condition-specific series metabolic sub-networks by means of RNA-seq data and providing a gene set enrichment analysis with the aid of such sub-network series [@tran].
metaboGSE depends on the sybil package and was evaluated with GLPK and COIN-OR Clp solvers via the glpkAPI and clpAPI packages, respectively, available in CRAN^[https://cran.r-project.org
]. The solvers and their R interface API package are not automatically installed along with metaboGSE. The simplex method from GLPK and inibarrier method from Clp were investigated and yielded identical results. glpkAPI simplex was observed to be somehow faster. Other solvers and methods implemented for sybil should also work, however have not been tested. Eventually it should be mentioned that metaboGSE is very demanding for the solvers and may reveal sporadic computational instability, possibly due to the solver compilation and execution environment.
library(metaboGSE) SYBIL_SETTINGS("SOLVER", "glpkAPI") SYBIL_SETTINGS("METHOD", "simplex") SYBIL_SETTINGS("OPT_DIRECTION", "max")
A metabolic network can be imported from tabular or SBML inputs (see sybil's manual for more details). Two Yarrowia lipolytica models in normoxic and hypoxic environments are provided in the iMK735
dataset [@kavscek] and presented in the MetaNetX/MNXref namespace [@moretti]. They are identical apart from the bounds of exchange oxygen flux.
data(iMK735)
iMK735[1]
Normalized (or raw) RNA-seq counts should be provided as a matrix with gene per row and library per column. The exprMaguire
[@maguire] dataset contains two matrices representing log2 voom-normalized count (expr
) and RPKM (pkmExpr
) per library.
data(exprMaguire) names(exprMaguire) str(exprMaguire$expr, vec.len=3)
Here we apply metaboGSE
for Gene Ontology enrichment analysis. A mapping between genes and GO terms should be provided. The yarli2GO
dataset contains such a mapping in a list
format.
data(yarli2GO) length(yarli2GO) str(head(yarli2GO))
The tutorial below shows a simple example to be executed in a reasonable computing time. The datasets of pre-built series of metabolic sub-networks and gene set enrichment [@tran] are also provided.
data(yarliSubmnets) names(yarliSubmnets) data(yarliGSE) length(yarliGSE)
Both models of iMK735 grow, nevertheless we should still apply the rescue procedure to produce the rescue reactions for all metabolites in growth and maintenance (if any) reactions (see [@tran]). In the version 1.2.1, we remove the option to rescue only substrates in these reactions, which helps to reduce the bug of infinite loop in LP solving. For instance, we here perform the rescue process while targeting a growth of 20% of the initial objective value for the hypoxic model.
target.ratio <- 0.2 hmodel <- iMK735$hypoxia$model hobj <- iMK735$hypoxia$obj hmodel.rescue <- rescue(model = hmodel, target = c(target.ratio*hobj), react = c(which(obj_coef(hmodel) == 1)))
In the hmodel.rescue
object, the rescued
field represents the rescued model $\mathcal{M}''$, which is the same as the initial model $\mathcal{M}$ since the latter grows, as presented in Fig 1. The rescue
field presents the rescue model $\mathcal{M}'$, which contains rescue reactions on every metabolite in growth and maintenance reactions, and the coef
field indicates the coefficients of those rescue reactions used for optimization in the rescue process. RECO_PUSH_MNXC3 denotes the compartment of artificial metabolites, e.g. X$'$. The rescue procecure can also be performed on maintenance reactions, which are restricted to non-empty fixed fluxes.
hmodel.rescue$rescue head(hmodel.rescue$coef)
{width=35%}
We set the TOLERANCE to 1e-8, which indicates that values less than 1e-8 are considered as 0, to deal with numerical imprecision.
SYBIL_SETTINGS("TOLERANCE", 1e-08)
The blocked reactions from the rescue models are determined by a flux variability analysis via the fluxVar
function from sybil. Those reactions as well as related genes and metabolites are then removed from the models. mc.cores
can be set appropriately to perform parallel computation of fluxVar
. A high mc.cores
is recommended as fluxVar
is time consuming.
## not run mc.cores <- 10 fva <- multiDel(model=hmodel.rescue$rescue, nProc=mc.cores, todo="fluxVar", fixObjVal=F, del1=react_id(hmodel.rescue$rescue)) reacs.blo <- names(which(setNames(unlist(lapply(fva, blReact)), react_id(hmodel.rescue$rescue)))) hmodel.clean <- rmReact(hmodel.rescue$rescue, reacs.blo, rm_met=T) ##
hmodel.clean
, considered as the comprehensive model for hypoxic conditions, can be loaded from the iMK735
dataset. It is noted that the TOLERANCE set in SYBIL_SETTINGS(“TOLERANCE”) determined the threshold for a reaction to be blocked. It should be adequately selected so that multiple (if any) input models, for example those in hypoxic and normoxic environment, contain the same gene set after cleaning by removing blocked reaction and propagating.
hmodel.clean <- iMK735$hypoxia$comp hmodel.clean
Now we convert the growth objective of the comprehensive model to a weighted objective function on rescue reactions with the determined coefficients. Hereafter, the goal is to minimize this function.
SYBIL_SETTINGS("OPT_DIRECTION", "min") hmodel.weight <- changeObjFunc(hmodel.clean, react=rownames(hmodel.rescue$coef), obj_coef=hmodel.rescue$coef) hmodel.weight optimizeProb(hmodel.weight)
The obtained objective of 0 above indicates that there is no need to rescue the hmodel.clean
since it grows.
We now compute weights for rescue reactions to account for the importance and dependency of metabolites to rescue (see Weighting scheme for model fitness in Tran et al. (2018))
mc.cores <- 1 rescue.weight <- (weightReacts(hmodel.weight, mc.cores=mc.cores, gene.num=1))$weight str(rescue.weight, vec.len=2)
We compute the set of preliminary GO terms in biological process category using topGO
with fisher statistic and weight01 algorithm. The whole GO annotation and gene universe are used. The aim of the following R script is to preliminarily filter the set of GO terms of interest. The resulting 135 GO terms are filtered by p-value < 0.1 and contain at least 3 genes and at most 50 genes from the model.
if (!requireNamespace("topGO", quietly = TRUE)) { print("Please install topGO: BiocManager::install('topGO')") } else { require(topGO) GO2geneID <- inverseList(yarli2GO) length(GO2geneID) str(head(GO2geneID), vec.len=3) gene.name <- names(yarli2GO) gene.list <- factor(as.integer(gene.name %in% sybil::allGenes(hmodel.clean))) names(gene.list) <- gene.name GOdata <- new("topGOdata", ontology = "BP", nodeSize = 5, allGenes = gene.list, annot = annFUN.gene2GO, gene2GO = yarli2GO ) result <- runTest(GOdata, statistic="fisher", algorithm="weight01") table <- GenTable(GOdata, weight = result, orderBy = "weight", numChar = 10000, topNodes = result@geneData[4] ) table$weight <- as.numeric(sub("<", "", table$weight)) table <- table[!is.na(table$weight), ] MINSIG <- 3 MAXSIG <- 50 WCUTOFF <- 0.1 GO.interest <- table[table$Significant >= MINSIG & table$Significant <= MAXSIG & table$weight < WCUTOFF, ]$GO.ID GO2geneID.interest.proteome <- genesInTerm(GOdata, GO.interest) GO2geneID.interest <- lapply(GO2geneID.interest.proteome, function(git) { intersect(sybil::allGenes(hmodel.clean), git) }) length(GO.interest) str(head(GO2geneID.interest), vec.len=3) }
GO.interest contains other GO terms than those in GO2geneID, as topGO
allows propagating in the gene ontology.
step
indicates the difference of gene numbers to remove between consecutive sub-model constructions, then determines numbers of genes to remove in the simulation. Here we set step = 50
and draw.num = 4
to reduce the computing time in this tutorial, i.e. the 0, 50, 100, etc. first genes in certain ranking will be successively removed from the comprehensive model, and 4 random removals will be performed. The series of metabolic sub-networks is constructed for the hypoxic $upc2\Delta$ (UH) condition with various gene rankings as below.
cond <- "UH" step <- 50 draw.num <- 4 reps.i <- grep(cond, colnames(exprMaguire$expr), value=T) ranks <- lapply(reps.i, function(ri) { data.frame( # ranks1. voom-normalized expression expr = exprMaguire$expr[, ri, drop=T], # ranks2. pkm normalized expression pkmExpr = exprMaguire$pkmExpr[, ri, drop=T], # ranks3. relative expression power 1 relExpr1 = relativeExpr(exprMaguire$expr, power=1)[, ri, drop=T], # ranks4. relative expression power 2 relExpr2 = relativeExpr(exprMaguire$expr, power=2)[, ri, drop=T], # ranks5. relative expression power 3 relExpr3 = relativeExpr(exprMaguire$expr, power=3)[, ri, drop=T], # ranks6. reverse expression (the worst) revExpr = 1/(1 + exprMaguire$expr[, ri, drop=T]), # ranks7. z-score expression zExpr = zscoreExpr(exprMaguire$expr)[, ri, drop=T] ) }) names(ranks) <- reps.i fitnessUH <- fitness(model = hmodel.weight, ranks = ranks, rescue.weight = rescue.weight, step = step, draw.num = draw.num, mc.cores = mc.cores) submnetsUH <- submnet(model = hmodel.weight, fn = fitnessUH, rank.best = "expr", gene.sets = list("GO:0006696"= c("euk:Q6C8C2_YARLI","euk:ERG6_YARLI","euk:Q6C231_YARLI", "euk:Q6CFB6_YARLI","euk:Q6C6W3_YARLI","euk:Q6C8J1_YARLI", "euk:Q6CB38_YARLI","euk:Q6CEF6_YARLI","euk:ERG27_YARLI", "euk:Q6CGM4_YARLI","euk:FDFT_YARLI","euk:F2Z6C9_YARLI", "euk:Q6C5R8_YARLI","euk:Q6C704_YARLI","euk:Q6CDK2_YARLI", "euk:Q6CFP4_YARLI","euk:Q6BZW0_YARLI","euk:Q6C2X2_YARLI", "euk:Q6C6U3_YARLI")), mc.cores = mc.cores)
It may happen that several genes have identical values for their expression in a sample, which triggers a random ranking among these genes, and thus introduces an unexpected difference among samples. To overcome such a randomness, a negligible value could be added to the expression of each gene in each sample using the overall expression in all the samples.
## not run pseudo.rank <- base::rank(rowSums(exprMaguire$expr), ties.method='first')/nrow(exprMaguire$expr)*1e-6 exprMaguire$expr <- exprMaguire$expr + pseudo.rank ##
ranks
can be set to the only expression you want to use, e.g. expr
, and draw.num
set to 0 to skip the ranking evaluation step. It is observed that the optimizeProb
function from sybil may stay in an infinite loop with some system configuration. A timeout limit should be given in the fitness
function of metaboGSE. It helps to stop the occasional infinite loop in optimizeProb
. Such a limit is set by default to 12 seconds, which is sufficient for the model iMK735 from Y. lipolytica and iMM1415 from mouse. A higher limit may be required for a larger model.
submnetsUH$condition knitr::kable(submnetsUH$gene.del) knitr::kable(submnetsUH$fitness.random, digits=3) knitr::kable(submnetsUH$fitness.ranks$UH1, digits=3)
The yarliSubmnets
dataset contains the series of sub-networks built with step = 1
and draw.num = 50
, indicating the gene-by-gene removal.
data(yarliSubmnets) str(yarliSubmnets$UH$gene.del) dim(yarliSubmnets$UH$fitness.random) str(yarliSubmnets$UH$fitness.ranks)
The sub-network construction can be visualized via the simulateSubmnet
function, which produces a plot for each condition. Figures 2 and 3 show the fitness of submodels obtained by removing genes following different rankings for the hypoxic $upc2\Delta$ condition.
## not run simulateSubmnet(sgd=submnetsUH) ##
{width=50%}
{width=50%}
We evaluate the significance of given gene sets with the metaboGSE
function and randomization tests. nrand = 1000
is used in the test for the significance of the gene sets against random sets in each individual condition. nperm = 1000
is used in the test for the significance of difference between conditions.
## not run GSE <- metaboGSE(yarliSubmnets, method="perm", nperm=1000, nrand=1000, mc.cores=mc.cores, prefix="/tmp/summary") ##
This step is time consuming. The yarliGSE
dataset can be loaded instead.
data(yarliGSE) GSE <- yarliGSE str(GSE[["GO:0006696"]], vec.len=2) GSE[["GO:0006696"]]$res$p.Val
GS.sig.all <- as.data.frame(t(sapply(GSE, function(gsm) { c(GS.ID=gsm$res$GS.ID, Description=gsm$res$Description, Statistic=gsm$res$statistic, p.Cond=if (is.null(gsm$res$p.Cond)) NA else min(gsm$res$p.Cond), p.Val=gsm$res$p.Val) })), stringsAsFactors=F) GS.sig.all$FDR <- p.adjust(as.numeric(GS.sig.all$p.Val), method="BH") GS.sig.all <- GS.sig.all[!is.na(GS.sig.all$FDR), ] dim(GS.sig.all)
GS.sig <- GS.sig.all[as.numeric(GS.sig.all$FDR) < 0.05, , drop=F] head(GS.sig[order(as.numeric(GS.sig$Statistic), decreasing=T), ])
REFERENCES
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.