knitr::opts_chunk$set(comment="", fig.align="center", fig.width=8.75, cache=FALSE) library(tidyverse) devtools::load_all(".")
When dealing with hundreds of genesets, it's often useful to understand the relationships between them. This allows researchers to summarize many enriched pathways as more general biological processes. To do this, we rely on curated relationships defined between them. For example, Reactome conveniently defines their genesets in a hiearchy of pathways. This data can be formatted into a relational genesets object called rgsets
.
genesets <- hyperdb_rgsets("REACTOME", version="70.0")
Relational genesets have three data atrributes including gsets, nodes, and edges. The genesets
attribute includes the geneset information for the leaf nodes of the hiearchy, the nodes
attribute describes all nodes in the hierarchy, including internal nodes, and the edges
attribute describes the edges in the hiearchy.
print(genesets)
Passing relational genesets works natively with hypeR()
.
data(limma) signature <- limma %>% dplyr::arrange(desc(t)) %>% magrittr::use_series(symbol) hyp_obj <- hypeR(signature, genesets, test="kstest", fdr=0.01)
One can visualize the top enriched genesets using hyp_hmap()
which will return a hierarchy map. Each node represents a geneset, where the shade of the gold border indicates the normalized significance of enrichment. Hover over the leaf nodes to view the raw value. Double click internal nodes to cluster their first degree connections. Edges represent a directed relationship between genesets in the hiearchy. Note: This function only works when the hyp
object was initialized with an rgsets
object.
vn <- hyp_hmap(hyp_obj, top=30) vn
Save to an html file.
vn <- visNetwork::visOptions(vn, width="1280px", height="720px") visNetwork::visSave(vn, "hyp_obj.html”)
One of the downsides of the above interactive object is it's difficult to render high quality figures in specific formats. You could take a screenshot after interacting with the figure, however that might not be high enough resolution for a publication. To help with this, you can alternatively return an igraph
object pre-formatted with enrichment data.
ig <- hyp_hmap(hyp_obj, top=30, graph=TRUE) print(ig)
Colors, weights, and normalized sizing properties from the interactive visualization are copied over to the vertex properties of the igraph
object.
head(igraph::as_data_frame(ig, what="vertices"))[,c("name", "type", "fdr", "size", "color")]
Because genesets labels are typically long, it's static visualizations are difficult. You'll need to make custom adjustments depending on your needs.
# Consistent layout set.seed(1) # Overriding existing edge/vertex properties plot(ig, vertex.label=substr(ifelse(V(ig)$type == "leaf", V(ig)$label, ""), 1, 32), vertex.size=5)
set.seed(1) tiff("hyp_obj.tiff", units="in", width=8, height=8, res=300) plot(ig) dev.off()
You can also directly copy enrichment data from a hyp
object to an igraph
object.
ig <- hyp_to_graph(hyp_obj) print(ig) df.v <- ig %>% igraph::as_data_frame(what="vertices") %>% dplyr::arrange(fdr) head(df.v)
Notice that internal nodes that don't have associated genesets (just generalized labels) will not have enrichment data associated with them.
tail(df.v)
This is an experimental feature and not officially apart of the package.
Genesets are often redundant, hard to parse, and lack relational structure. Thus it can be useful to infer which genesets are similar before analyzing enrichment results to help group multiple enriched genesets or pathways into more generalizable processes. Here we show a data-driven method using Hierarchical Set for turning flat genesets into rgsets
objects.
The clustering, in contrast to more traditional approaches to hierarchical clustering, does not rely on a derived distance measure between the sets, but instead works directly with the set data itself using set algebra. This has two results: First, it is much easier to reason about the result in a set context, and second, you are not forced to provide a distance if none exists (sets are completely independent). The clustering is based on a generalization of Jaccard similarity, called Set Family Homogeneity, that, in its simplest form, is defined as the size of the intersection of the sets in the family divided by the size of the union of the sets in the family. The clustering progresses by iteratively merging the set families that shows the highest set family homogeneity and is terminated once all remaining set family pairs have a homogeneity of 0. Note that this means that the clustering does not necessarily end with the all sets in one overall cluster, but possibly split into several hierarchies - this is intentional.
What if we want to use the KEGG genesets from MSigDB - but we want to organize them into a hierarchy?
gsets <- msigdb_gsets("Homo sapiens", "C2", "CP:KEGG", clean=TRUE) genesets <- gsets$genesets length(genesets)
We need some additional packages to help with this...
suppressPackageStartupMessages(library(hierarchicalSets)) suppressPackageStartupMessages(library(qdapTools))
mat <- genesets %>% qdapTools::mtabulate() %>% as.matrix() %>% t() %>% hierarchicalSets::format_sets() hierarchy <- hierarchicalSets::create_hierarchy(mat, intersectLimit=1) print(hierarchy)
plot(hierarchy, type='intersectStack', showHierarchy=TRUE, label=FALSE)
There is clearly a lot of overlap between these genesets...
plot(hierarchy, type='outlyingElements', quantiles=0.75, alpha=0.5, label=FALSE)
suppressPackageStartupMessages(library(dendextend)) suppressPackageStartupMessages(library(stringi))
A helper function for recursively reading and extracting trees from the clustering structure.
find.trees <- function(d) { subtrees <- dendextend::partition_leaves(d) leaves <- subtrees[[1]] find.paths <- function(leaf) { which(sapply(subtrees, function(x) leaf %in% x)) } paths <- lapply(leaves, find.paths) edges <- data.frame(from=c(), to=c()) if (length(d) > 1) { for (path in paths) { for (i in seq(1, length(path)-1)) { edges <- rbind(edges, data.frame(from=path[i], to=path[i+1])) } } edges <- dplyr::distinct(edges) edges$from <- paste0("N", edges$from) edges$to <- paste0("N", edges$to) } names(subtrees) <- paste0("N", seq(1:length(subtrees))) nodes <- data.frame(id=names(subtrees)) rownames(nodes) <- nodes$id nodes$label <- "" leaves <- sapply(subtrees, function(x) length(x) == 1) nodes$label[leaves] <- sapply(subtrees[leaves], function(x) x[[1]]) nodes$id <- NULL # Internal nodes will not have labels, so we can generate unique hash identifiers ids <- stringi::stri_rand_strings(nrow(nodes), 32) names(ids) <- rownames(nodes) rownames(nodes) <- ids[rownames(nodes)] edges$from <- ids[edges$from] edges$to <- ids[edges$to] return(list("edges"=edges, "nodes"=nodes)) }
trees <- lapply(hierarchy$clusters, find.trees) length(trees)
Now we can merge the independent trees into a single graph...
nodes.all <- lapply(trees, function(x) x$nodes) nodes <- do.call(rbind, nodes.all) head(nodes)
edges.all <- lapply(trees, function(x) x$edges) edges <- data.frame(do.call(rbind, edges.all)) head(edges)
Create the relational genesets object...
rgsets_obj <- rgsets$new(genesets, nodes, edges, name="MSIGDB_KEGG", version=msigdb_version()) rgsets_obj
hyp <- hypeR(signature, rgsets_obj, test="kstest") hyp_dots(hyp, fdr=0.1, top=25)
hyp_hmap(hyp, fdr=0.1)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.