Downloading Genesets

knitr::opts_chunk$set(comment="", cache=FALSE, fig.align="center")
devtools::load_all(".")
library(tidyverse)
library(qusage)
library(igraph)

Object Oriented Genesets

Genesets are simply a named list of character vectors which can be directly passed to hyper(). Alternatively, one can pass a gsets object, which can retain the name and version of the genesets one uses. This versioning will be included when exporting results or generating reports, which will ensure your results are reproducible.

genesets <- list("GSET1" = c("GENE1", "GENE2", "GENE3"),
                 "GSET2" = c("GENE4", "GENE6"),
                 "GSET3" = c("GENE7", "GENE8", "GENE9"))

Creating a gsets object is easy...

genesets <- gsets$new(genesets, name="Example Genesets", version="v1.0")
print(genesets)

And can be passed directly to hyper()...

hypeR(signature, genesets)

To aid in workflow efficiency, hypeR enables users to download genesets, wrapped as gsets objects, from multiple data sources.

Downloading msigdb Genesets

Most researchers will find the genesets hosted by msigdb are adequate to perform geneset enrichment analysis. There are various types of genesets available across multiple species.

msigdb_info()

Here we download the Hallmarks genesets...

HALLMARK <- msigdb_gsets(species="Homo sapiens", category="H")
print(HALLMARK)

We can also clean them up by removing the first leading common substring...

HALLMARK <- msigdb_gsets(species="Homo sapiens", category="H", clean=TRUE)
print(HALLMARK)

This can be passed directly to hypeR()...

hypeR(signature, genesets=HALLMARK)

Other commonly used genesets include Biocarta, Kegg, and Reactome...

BIOCARTA <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:BIOCARTA")
KEGG     <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:KEGG")
REACTOME <- msigdb_gsets(species="Homo sapiens", category="C2", subcategory="CP:REACTOME")

Downloading enrichr Genesets

If msigdb genesets are not sufficient, we have also provided another set of functions for downloading and loading other publicly available genesets. This is facilitated by interfacing with the publicly available libraries hosted by enrichr.

available <- enrichr_available()
reactable(available)
ATLAS <- enrichr_gsets("Human_Gene_Atlas")
print(ATLAS)

Note: These libraries do not have a systematic versioning scheme, however the date downloaded will be recorded.

Additionally download other species if you aren't working with human or mouse genes!

yeast <- enrichr_gsets("GO_Biological_Process_2018", db="YeastEnrichr")
worm <- enrichr_gsets("GO_Biological_Process_2018", db="WormEnrichr")
fish <- enrichr_gsets("GO_Biological_Process_2018", db="FishEnrichr")
fly <- enrichr_gsets("GO_Biological_Process_2018", db="FlyEnrichr")

Relational Genesets

When dealing with hundreds of genesets, it's often useful to understand the relationships between them. This allows researchers to summarize many enriched pathways as more general biological processes. To do this, we rely on curated relationships defined between them. For example, Reactome conveniently defines their genesets in a hiearchy of pathways. This data can be formatted into a relational genesets object called rgsets.

We currently curate some relational genesets for use with hypeR and plan to add more continuously.

hyperdb_info()

Downloading relational genesets is easy...

genesets <- hyperdb_rgsets("REACTOME", "70.0")

And can be passed directly to hyper()...

hypeR(signature, genesets)

Creating Relational Genesets

We try to provide relational genesets for popular databases that include hierarchical information. For users who want to create their own, we provide this example.

Raw data for gsets, nodes, and edges can be directly downloaded.

genesets.url <- "https://reactome.org/download/current/ReactomePathways.gmt.zip"
nodes.url <- "https://reactome.org/download/current/ReactomePathways.txt"
edges.url <- "https://reactome.org/download/current/ReactomePathwaysRelation.txt"

Loading Data

# Genesets
genesets.tmp <- tempfile(fileext=".gmt.zip")
download.file(genesets.url, destfile = genesets.tmp, mode = "wb")
genesets.raw <- genesets.tmp %>%
                unzip() %>%
                read.gmt() %>%
                lapply(function(x) {
                    toupper(x[x != "Reactome Pathway"])
                })
# Nodes
nodes.raw <- nodes.url %>%
             read.delim(sep="\t", 
                        header=FALSE, 
                        fill=TRUE, 
                        col.names=c("id", "label", "species"), 
                        stringsAsFactors=FALSE)
# Edges
edges.raw <- edges.url %>%
             read.delim(sep="\t", 
                        header=FALSE, 
                        fill=TRUE, 
                        col.names=c("from", "to"),
                        stringsAsFactors=FALSE)

Organizing a Hierarchy

# Species-specific nodes
nodes <- nodes.raw %>%
         dplyr::filter( label %in% names(genesets.raw) ) %>%
         dplyr::filter( species == "Homo sapiens" ) %>%
         dplyr::filter(! duplicated(id) ) %>%
         magrittr::set_rownames( .$id ) %>%
         { .[, "label", drop=FALSE] }

# Species-specific edges
edges <- edges.raw %>%
         dplyr::filter( from %in% rownames(nodes) ) %>%
         dplyr::filter( to %in% rownames(nodes) )

# Leaf genesets
genesets <- nodes %>%
            rownames() %>%
            .[! . %in% edges$from] %>%
            sapply( function(x) nodes[x, "label"] ) %>%
            genesets.raw[.]
nodes

A single-column data frame of labels where the rownames are unique identifiers. Leaf node labels should have an associated geneset, while internal nodes do not have to. The only genesets tested, will be those in the list of genesets.

head(nodes)
edges

A dataframe with two columns of identifiers, indicating directed edges between nodes in the hierarchy.

head(edges)
genesets

A list of character vectors, named by the geneset labels. Typically, genesets will be at the leaves of the hierarchy, while not required.

head(names(genesets))

The rgsets Object

genesets <- rgsets$new(genesets, nodes, edges, name="REACTOME", version="v70.0")
print(genesets)


Try the hypeR package in your browser

Any scripts or data that you put into this service are public.

hypeR documentation built on Nov. 8, 2020, 8:19 p.m.