In Bioconductor/BiocSet: Representing Different Biological Sets

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)

Introduction

BiocSet is a package that represents element sets in a tibble format with the BiocSet class. Element sets are read in and converted into a tibble format. From here, typical dplyr operations can be performed on the element set. BiocSet also provides functionality for mapping different ID types and providing reference urls for elements and sets.

Installation

Install the most recent version from Bioconductor:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("BiocSet")

The development version is also available for install from GitHub:

BiocManager::install("Kayla-Morrell/BiocSet")

Then load BiocSet:

library(BiocSet)

BiocSet

Input and Output

BiocSet can create a BiocSet object using two different input methods. The first is to input named character vectors of element sets. BiocSet returns three tibbles, es_element which contains the elements, es_set which contains the sets and es_elementset which contains elements and sets together.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl

The second method of creating a BiocSet object would be to read in a .gmt file. Using import(), a path to a downloaded .gmt file is read in and a BiocSet object is returned. The example below uses a hallmark element set downloaded from GSEA, which is also included with this package. This BiocSet includes a source column within the es_elementset tibble for reference as to where the element set came from.

gmtFile <- system.file(package = "BiocSet",
                        "extdata",
                        "hallmark.gene.symbol.gmt")
tbl2 <- import(gmtFile)
tbl2

export() allows for a BiocSet object to be exported into a temporary file with the extention .gmt.

fl <- tempfile(fileext = ".gmt")
gmt <- export(tbl2, fl)
gmt

We also support importing files to create an OBOSet object which extends the BiocSet class. The default only displays the most basic amount of columns needed to perform certain tasks, but the argument extract_tag = everything allows for additional columns to be imported as well. The following workflow will demonstrate the use of both export and import to create a smaller subset of the large obo file that is accessable from GeneOntology.

download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo")

foo <- import("obo_file.obo", extract_tag = "everything")

small_tst <- es_element(foo)[1,] %>%
    unnest("ancestors") %>%
    select("element", "ancestors") %>%
    unlist() %>%
    unique()

small_oboset <- foo %>% filter_elementset(element %in% small_tst)

fl <- tempfile(fileext = ".obo")
export(small_oboset, fl)

Then you can import this smaller obo file which we utilize here, in tests, and example code.

oboFile <- system.file(package = "BiocSet",
                        "extdata",
                        "sample_go.obo")
obo <- import(oboFile)
obo

Implemented functions

A feature available to BiocSet is the ability to activate different tibbles to perform certain functions on. When a BiocSet is created, the tibble es_elementset is automatically activated and all functions will be performed on this tibble. BiocSet adopts the use of many common dplyr functions such as filter(), select(), mutate(), summarise(), and arrange(). With each of the functions the user is able to pick a different tibble to activate and work on by using 'verb_tibble'. After the function is executed than the 'active' tibble is returned back to the tibble that was active before the function call. Some examples are shown below of how these functions work.

tbl <- BiocSet(set1 = letters, set2 = LETTERS)
tbl
tbl %>% filter_element(element == "a" | element == "A")
tbl %>% mutate_set(pval = rnorm(1:2))
tbl %>% arrange_elementset(desc(element))

Set operations

BiocSet also allows for common set operations to be performed on the BiocSet object. union() and intersection() are the two set operations available in BiocSet. We demonstate how a user can find the union between two BiocSet objects or within a single BiocSet object. Intersection is used in the same way.

# union of two BiocSet objects
es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)])
es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)])
union(es1, es2)

# union within a single BiocSet object
es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)])
union_single(es3)

Case study I

Next, we demonstrate the a couple uses of BiocSet with an experiment dataset airway from the package airway. This data is from an RNA-Seq experiment on airway smooth muscle (ASM) cell lines.

The first step is to load the library and the necessary data.

library(airway)
data("airway")
se <- airway

The function go_sets() discovers the keys from the org object and uses AnnotationDbi::select to create a mapping to GO ids. go_sets() also allows the user to indicate which evidence type or ontology type they would like when selecting the GO ids. The default is using all evidence types and all ontology types. We represent these identifieres as a BiocSet object. Using the go_sets function we are able to map the Ensembl ids and GO ids from the genome wide annotation for Human data in the org.Hs.eg.db package. The Ensembl ids are treated as elements while the GO ids are treated as sets.

library(org.Hs.eg.db)
go <- go_sets(org.Hs.eg.db, "ENSEMBL")
go

# an example of subsetting by evidence type
go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))

Some users may not be interested in reporting the non-descriptive elements. We demonstrate subsetting the airway data to include non-zero assays and then filter out the non-descriptive elements.

se1 = se[rowSums(assay(se)) != 0,]
go %>% filter_element(element %in% rownames(se1))

It may also be of interest to users to know how many elements are in each set. Using the count function we are able to calculate the elements per set.

go %>% group_by(set) %>% dplyr::count()

It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality.

drop <- es_activate(go, elementset) %>% group_by(set) %>%
    dplyr::count() %>% filter(n == 0) %>% pull(set)
go %>% filter_set(!(set %in% drop))

To simplify mapping we created a couple map functions. map_unique() is used when there is known 1:1 mapping, This takes four arguements, a BiocSet object, an AnnotationDbi object, the id to map from, and the id to map to. map_multiple() needs a fifth argument to indicate how the function should treat an element when there is multiple mapping. Both functions utilize mapIds from AnnotationDbi and return a BiocSet object. In the example below we show how to use map_unique to map go's ids from Ensembl to gene symbols.

go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")

Another functionality of BiocSet is the ability to add information to the tibbles. Using the GO.db library we are able to map definitions to the GO ids. From there we can add the mapping to the tibble using map_add() and the mutate function.

library(GO.db)
map <- map_add_set(go, GO.db, "GOID", "DEFINITION")
go %>% mutate_set(definition = map)

The library KEGGREST is a client interface to the KEGG REST server. KEGG contains pathway maps that represent interaction, reaction and relation networks for various biological processes and diseases. BiocSet has a function that utilizes KEGGREST to develop a BiocSet object that contains the elements for every pathway map in KEGG.

Due to limiations of the KEGGREST package, kegg_sets can take some time to run depending on the amount of pathways for the species of interest. Therefore we demonstrate using BiocFileCache to make the data available to the user.

library(BiocFileCache)
rname <- "kegg_hsa"
exists <- NROW(bfcquery(query=rname, field="rname")) != 0L
if (!exists)
{
    kegg <- kegg_sets("hsa")
    fl <- bfcnew(rname = rname, ext = ".gmt")
    export(kegg_sets("hsa"), fl)
}
kegg <- import(bfcrpath(rname=rname))

Within the kegg_sets() function we remove pathways that do not contain any elements. We then mutate the element tibble using the map_add function to contain both Ensembl and Entrez ids.

map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL")
kegg <- kegg %>% mutate_element(ensembl = map)

Since we are working with ASM data we thought we would subset the airway data to contain only the elements in the asthma pathway. This filter is performed on the KEGG id, which for asthma is "hsa05310".

asthma <- kegg %>% filter_set(set == "hsa05310")

se <- se[rownames(se) %in% es_element(asthma)$ensembl,]

se

The filtering can also be done for multiple pathways.

pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970")
multipaths <- kegg %>% filter_set(set %in% pathways)

multipaths

Case study II

This example will start out the same way that Case Study I started, by loading in the airway dataset, but we will also do some reformating of the data. The end goal is to be able to perform a Gene Set Enrichment Analysis on the data and return a BiocSet object of the gene sets.

data("airway")
airway$dex <- relevel(airway$dex, "untrt")

Similar to other analyses we perform a differential expression analysis on the airway data using the library DESeq2 and then store the results into a tibble.

library(DESeq2)
library(tibble)
des <- DESeqDataSet(airway, design = ~ cell + dex)
des <- DESeq(des)
res <- results(des)

tbl <- res %>% 
    as.data.frame() %>%
    as_tibble(rownames = "ENSEMBL")

Since we want to use limma::goana() to perform the GSEA we will need to have ENTREZ identifiers in the data, as well as filter out some NA information. Later on this will be considered our 'element' tibble.

tbl <- tbl %>% 
    mutate(
        ENTREZID = mapIds(
            org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL"
        ) %>% unname()
    )

tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID))
tbl

Now that the data is ready for GSEA we can go ahead and use goana() and make the results into a tibble. This tibble will be considered our 'set' tibble.

library(limma)
go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>%
    as.data.frame() %>%
    as_tibble(rownames = "GOALL")
go_ids

The last thing we need to do is create a tibble that we will consider our 'elementset' tibble. This tibble will be a mapping of all the elements and sets.

foo <- AnnotationDbi::select(
    org.Hs.eg.db,
    tbl$ENTREZID,
    "GOALL",
    "ENTREZID") %>% as_tibble()
foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct()
foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL)
foo

The function BiocSet_from_elementset() allows for users to create a BiocSet object from tibbles. This function is helpful when there is metadata contained in the tibble that should be in the BiocSet object. For this function to work properly, the columns that are being joined on need to be named correctly. For instance, in order to use this function on the tibbles we created we need to change the column in the 'element' tibble to 'element', the column in the 'set' tibble to 'set' and the same will be for the 'elementset' tibble. We demonstrate this below and then create the BiocSet object with the simple function.

foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL)
tbl <- tbl %>% dplyr::rename(element = ENTREZID)
go_ids <- go_ids %>% dplyr::rename(set = GOALL)
es <- BiocSet_from_elementset(foo, tbl, go_ids)
es

For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame.

tibble_from_element(es)

head(data.frame_from_elementset(es))

Information look up

A final feature to BiocSet is the ability to add reference information about all of the elements/sets. A user could utilize the function url_ref() to add information to the BiocSet object. If a user has a question about a specific id then they can follow the reference url to get more informtion. Below is an example of using url_ref() to add reference urls to the go data set we worked with above.