Automatic Functional Enrichment on Signature Genes
In cola: A Framework for Consensus Partitioning

library(markdown)
options(markdown.HTML.options = c("use_xhtml", "smartypants", "base64_images", "mathjax", "highlight_code"))
options(markdown.HTML.stylesheet = system.file("resources", "markdown.css", package = "markdown"))
library(knitr)
knitr::opts_chunk$set(
    error = FALSE,
    tidy  = FALSE,
    message = FALSE,
    warning = FALSE,
    fig.align = "center")
options(width = 100)
options(rmarkdown.html_vignette.check_title = FALSE)
library(cola)

If the matrix rows can correspond to genes (e.g. the gene expression matrix, or the methylation array data where CpG sites can be annotated to the transcription start site of genes), cola performs functional enrichment by the functional_enrichment() function to the signatures by ClusterProfiler, DOSE or ReactomePA packages.

We first demonstrate the usage of functional_enrichment() function by the TCGA GBM dataset. In following example code, TCGA_GBM_subgroup.rds is generated by the code demonstrated here. We download the result file that has already been generated.

download.file("https://jokergoo.github.io/cola_examples/TCGA_GBM/TCGA_GBM_subgroup.rds", 
    destfile = "TCGA_GBM_subgroup.rds", quiet = TRUE)
rl = readRDS("TCGA_GBM_subgroup.rds")
file.remove("TCGA_GBM_subgroup.rds")

We select result from a single method ATC:skmeans:

library(cola)
res = rl["ATC:skmeans"]
res

We check how the signature genes looks like under 4-group classification:

set.seed(123)
get_signatures(res, k = 4)

Rows are split into four groups with different expression patterns among samples. The functional enrichment will be applied to genes in each row-cluster.

To apply functional enrichment, the important thing is to check the gene ID type in the input matrix. The helper function rownames() directly returns the row names of the matrix stored in res.

head(rownames(res))

The gene ID is symbol. For all enrichment analysis provided by ClusterProfiler, DOSE or ReactomePA, the core ID type is Entrez ID, thus we need to convert from symbol to Entrez ID.

To make it easy, cola automatically tests the gene IDs types and it automatically recognizes three ID types of Ensembl ID, RefSeq ID and gene symbol, which covers most cases of the analysis. If user's gene ID type is one of the three supported ones, simply run functional_enrichment() on res only with specifying the number of subgroups.

lt = functional_enrichment(res, k = 4)

if(file.exists("lt_functional_enrichment_TCGA_GBM.rds")) {
    lt = readRDS("lt_functional_enrichment_TCGA_GBM.rds")
} else {
    lt = functional_enrichment(res, k = 4)
    saveRDS(lt, "lt_functional_enrichment_TCGA_GBM.rds", compress = "xz")
}

By default, functional_enrichment() runs enrichment on Gene Ontology, biological function ontologies. ontology can be set as follows:

BP/MF/CC, org_db argument should be set to the corresponding database, such as "org.Hs.eg.db",
KEGG, organism argument should be set to corresponding species abbreviation, such as "hsa",
DO, only works for human,
MSigDb, only works for human, the path of gmt file should be specified by gmt_file argument. You should only use the gmt files where genes are annotated with the Entrez IDs.
Reactome, organism argument should be set to the corresponding species, such as "human".

ontology can be set as a vector of multiple ontologies.

The value of lt is a list of data frames for different ontologies combined with different k-means groups. Since k-means clustering has already been applied in previous get_signatures(), the k-means clustering result is stored in res object and functional_enrichment() directly uses the grouping from it.

names(lt)
head(lt[[1]])

If the gene ID type is not any of Ensembl ID, RefSeq ID or gene symbol, user needs to provide a named vector which provides mapping between user's ID types to Entrez IDs.

In following example we demonstrate how to properly set the ID mapping by the Golub leukemia dataset. The result file is already generated and integrate in cola package.

data(golub_cola)

To simplify, we only take result from one method:

res = golub_cola["ATC:skmeans"]
head(rownames(res))
set.seed(123)
get_signatures(res, k = 3)

The Golub leukemia dataset is a microarray dataset where the gene ID is the probe ID. Thankfully, there is already an annotation package from Bioconductor (hu6800.db) that provides mapping between the probe ID to Entrez ID.

library(hu6800.db)
x = hu6800ENTREZID
mapped_probes = mappedkeys(x)
id_mapping = unlist(as.list(x[mapped_probes]))
head(id_mapping)

Proportion of probe IDs that can be mapped:

sum(!is.na(id_mapping[rownames(res)]))/nrow(res)

As you see, the format of id_mapping is simple. Names of the vector are the probe IDs and the values are the Entrez IDs. We can directly assign the ID mapping variable to id_mapping argument.

lt = functional_enrichment(res, k = 3, id_mapping = id_mapping)

functional_enrichment() can also be applied to two other classes of objects:

The ConsensusPartitionList object which is generated by run_all_partition_methods() function. The result is a list (for each method) of lists (for each ontology) of data frames.
Simply a vector of gene IDs. The result is a data frame.