NetGen: Network-based Generative Model

NetGen is a network-based generative model for functional enrichment analysis. We first load the required packages

library(CopTea)
options(scipen=0)

In this example, we use a small subset of GO annotation database, which contains 300 biological processes categories.

load("GO_BP_300.RData")

The annotation data is stored in a matrix with dimensions:

dim(annotation)

which indicates there are total 6447 genes and 300 GO terms in this annotation dataset.

Next, we load the protein-protein interaction (PPI) dataset:

load("PPI.RData")   

The PPI network is given by its adjacent matrix as follows:

dim(adj_matrix)

We simulate the gene list of interest:

load("active_gene.RData")

The list consists of 84 active genes which are derived from the true categories as follows.

True_Categories <- c("GO:0019614", "GO:1903249", "GO:2000506", "GO:0015985", "GO:0071962")

Mode1: fixed parameter strategy

In the first part of this example, we try to identify the most enriched categories using a given parameter setting.

Enriched_Categories <- netgen(annotation, adj_matrix, active_gene,
                              p1 = 0.8, p2 = 0.1, q = 0.001, alpha = 5, trace=TRUE)

The most enriched categories identified by NetGen are

Enriched_Categories

and the false negative categories are

setdiff(True_Categories, Enriched_Categories[,1])

Mode2: mixed parameter strategy

Instead of using a fixed parameter setting, we can run NetGen with several different parameter settings, and then select the result of highest enrichment significance.

p1 <- c(0.5, 0.8)
p2 <- c(0.1, 0.3)
q  <- 0.001
Enriched_Categories <- netgen(annotation, adj_matrix, active_gene,
                              p1, p2, q, alpha = 3, trace=FALSE)

The combined p-values of mixed parameter strategy are

Enriched_Categories$Term_combined_pvalue

And the most enriched categories and its corresponding parameter combination are

Enriched_Categories$mix_result[which.min(Enriched_Categories$Term_combined_pvalue)]

CEA: Combination-based Enrichment Analysis model

CEA is a novel combination-based method for gene set functional enrichment analysis. It is based on a multi-objective optimization framework, and the adapted IMPROVED GREEDY algorithm was used to approximatively solve the problem.

We first load the required packages

library(CopTea)
options(scipen=0)

In this example, we use the same GO annotation database and the active gene list.

load("GO_BP_300.RData")
load("active_gene.RData")

The list consists of 84 active genes which are derived from the true categories as follows:

True_Categories <- c("GO:0019614", "GO:1903249", "GO:2000506", "GO:0015985", "GO:0071962")

Note that, not all the active genes are annotated in the annotation matrix.

sum(active_gene %in% rownames(annotation))

We use the CEA function to identify the most enriched catgories.

Enrich_result <- CEA(annotation, active_gene, d = 0, times = 5, trace = TRUE)

The result contains the following components:

names(Enrich_result)

For example, we select the most enriched 5 category sets as the final outputs. These are:

Enrich_result$category[1:5]

The related Fisher's exact test p-values and coverages are:

Enrich_result$p.values[1:5]
Enrich_result$coverage[1:5]

The false negative categories in the first category set are

setdiff(True_Categories, Enrich_result$category[[1]])

We can obtain a more enriched result by setting a larger tolerance parameter d as:

Enrich_result <- CEA(annotation, active_gene, d = 1, times = 500, trace = FALSE)

The Fisher's exact test p-value of the most enriched category is:

Enrich_result$p.values[1]


wulingyun/CopTea documentation built on Dec. 4, 2019, 2:59 p.m.