knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
BiocSet
is a package that represents element sets in a tibble format with the
BiocSet
class. Element sets are read in and converted into a tibble format.
From here, typical dplyr
operations can be performed on the element set.
BiocSet
also provides functionality for mapping different ID types and
providing reference urls for elements and sets.
Install the most recent version from Bioconductor:
if(!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("BiocSet")
The development version is also available for install from GitHub:
BiocManager::install("Kayla-Morrell/BiocSet")
Then load BiocSet
:
library(BiocSet)
BiocSet
can create a BiocSet
object using two different input methods. The
first is to input named character vectors of element sets. BiocSet
returns
three tibbles, es_element
which contains the elements, es_set
which contains
the sets and es_elementset
which contains elements and sets together.
tbl <- BiocSet(set1 = letters, set2 = LETTERS) tbl
The second method of creating a BiocSet
object would be to read in a .gmt
file. Using import()
, a path to a downloaded .gmt
file is read in and a
BiocSet
object is returned. The example below uses a hallmark element set
downloaded from GSEA, which is also included with this package. This
BiocSet
includes a source
column within the es_elementset
tibble for
reference as to where the element set came from.
gmtFile <- system.file(package = "BiocSet", "extdata", "hallmark.gene.symbol.gmt") tbl2 <- import(gmtFile) tbl2
export()
allows for a BiocSet
object to be exported into a temporary file
with the extention .gmt
.
fl <- tempfile(fileext = ".gmt") gmt <- export(tbl2, fl) gmt
We also support importing files to create an OBOSet
object which extends
the BiocSet
class. The default only displays the most basic amount of columns
needed to perform certain tasks, but the argument extract_tag = everything
allows for additional columns to be imported as well. The following workflow
will demonstrate the use of both export and import to create a smaller subset
of the large obo file that is accessable from GeneOntology.
download.file("http://current.geneontology.org/ontology/go.obo", "obo_file.obo") foo <- import("obo_file.obo", extract_tag = "everything") small_tst <- es_element(foo)[1,] %>% unnest("ancestors") %>% select("element", "ancestors") %>% unlist() %>% unique() small_oboset <- foo %>% filter_elementset(element %in% small_tst) fl <- tempfile(fileext = ".obo") export(small_oboset, fl)
Then you can import this smaller obo file which we utilize here, in tests, and example code.
oboFile <- system.file(package = "BiocSet", "extdata", "sample_go.obo") obo <- import(oboFile) obo
A feature available to BiocSet
is the ability to activate different
tibbles to perform certain functions on. When a BiocSet
is created, the tibble
es_elementset
is automatically activated and all functions will be performed
on this tibble. BiocSet
adopts the use of many common dplyr
functions such
as filter()
, select()
, mutate()
, summarise()
, and arrange()
. With
each of the functions the user is able to pick a different tibble to activate
and work on by using 'verb_tibble'. After the function is executed than the
'active' tibble is returned back to the tibble that was active before the
function call. Some examples are shown below of how these functions work.
tbl <- BiocSet(set1 = letters, set2 = LETTERS) tbl tbl %>% filter_element(element == "a" | element == "A") tbl %>% mutate_set(pval = rnorm(1:2)) tbl %>% arrange_elementset(desc(element))
BiocSet
also allows for common set operations to be performed on the BiocSet
object. union()
and intersection()
are the two set operations available in
BiocSet
. We demonstate how a user can find the union between two BiocSet
objects or within a single BiocSet
object. Intersection is used in the same
way.
# union of two BiocSet objects es1 <- BiocSet(set1 = letters[c(1:3)], set2 = LETTERS[c(1:3)]) es2 <- BiocSet(set1 = letters[c(2:4)], set2 = LETTERS[c(2:4)]) union(es1, es2) # union within a single BiocSet object es3 <- BiocSet(set1 = letters[c(1:10)], set2 = letters[c(4:20)]) union_single(es3)
Next, we demonstrate the a couple uses of BiocSet
with an experiment dataset
airway
from the package airway
. This data is from an RNA-Seq experiment on
airway smooth muscle (ASM) cell lines.
The first step is to load the library and the necessary data.
library(airway) data("airway") se <- airway
The function go_sets()
discovers the keys from the org object and uses
AnnotationDbi::select
to create a mapping to GO ids. go_sets()
also allows
the user to indicate which evidence type or ontology type they would like when
selecting the GO ids. The default is using all evidence types and all ontology
types. We represent these identifieres as a BiocSet
object. Using the
go_sets
function we are able to map the Ensembl ids and GO ids from the genome
wide annotation for Human data in the org.Hs.eg.db
package. The Ensembl ids
are treated as elements while the GO ids are treated as sets.
library(org.Hs.eg.db) go <- go_sets(org.Hs.eg.db, "ENSEMBL") go # an example of subsetting by evidence type go_sets(org.Hs.eg.db, "ENSEMBL", evidence = c("IPI", "TAS"))
Some users may not be interested in reporting the non-descriptive elements. We
demonstrate subsetting the airway
data to include non-zero assays and then
filter out the non-descriptive elements.
se1 = se[rowSums(assay(se)) != 0,] go %>% filter_element(element %in% rownames(se1))
It may also be of interest to users to know how many elements are in each set.
Using the count
function we are able to calculate the elements per set.
go %>% group_by(set) %>% dplyr::count()
It may also be helpful to remove sets that are empty. Since we have shown how to calculate the number of elements per set, we know that this data set does not contain any empty sets. We decide to demonstrate regardless for those users that may need this functionality.
drop <- es_activate(go, elementset) %>% group_by(set) %>% dplyr::count() %>% filter(n == 0) %>% pull(set) go %>% filter_set(!(set %in% drop))
To simplify mapping we created a couple map
functions. map_unique()
is used
when there is known 1:1 mapping, This takes four arguements, a BiocSet
object, an AnnotationDbi
object, the id to map from, and the id to map to.
map_multiple()
needs a fifth argument to indicate how the function should
treat an element when there is multiple mapping. Both functions utilize
mapIds
from AnnotationDbi
and return a BiocSet
object. In the example
below we show how to use map_unique
to map go
's ids from Ensembl to gene
symbols.
go %>% map_unique(org.Hs.eg.db, "ENSEMBL", "SYMBOL")
Another functionality of BiocSet
is the ability to add information to the
tibbles. Using the GO.db
library we are able to map definitions to the GO
ids. From there we can add the mapping to the tibble using map_add()
and the
mutate function.
library(GO.db) map <- map_add_set(go, GO.db, "GOID", "DEFINITION") go %>% mutate_set(definition = map)
The library KEGGREST
is a client interface to the KEGG REST server.
KEGG contains pathway maps that represent interaction, reaction and relation
networks for various biological processes and diseases. BiocSet
has a
function that utilizes KEGGREST
to develop a BiocSet
object that contains
the elements for every pathway map in KEGG.
Due to limiations of the KEGGREST package, kegg_sets
can take some time to
run depending on the amount of pathways for the species of interest. Therefore
we demonstrate using BiocFileCache
to make the data available to the user.
library(BiocFileCache) rname <- "kegg_hsa" exists <- NROW(bfcquery(query=rname, field="rname")) != 0L if (!exists) { kegg <- kegg_sets("hsa") fl <- bfcnew(rname = rname, ext = ".gmt") export(kegg_sets("hsa"), fl) } kegg <- import(bfcrpath(rname=rname))
Within the kegg_sets()
function we remove pathways that do not contain any
elements. We then mutate the element tibble using the map_add
function to
contain both Ensembl and Entrez ids.
map <- map_add_element(kegg, org.Hs.eg.db, "ENTREZID", "ENSEMBL") kegg <- kegg %>% mutate_element(ensembl = map)
Since we are working with ASM data we thought we would subset the airway
data
to contain only the elements in the asthma pathway. This filter is performed
on the KEGG id, which for asthma is "hsa05310".
asthma <- kegg %>% filter_set(set == "hsa05310") se <- se[rownames(se) %in% es_element(asthma)$ensembl,] se
The filtering can also be done for multiple pathways.
pathways <- c("hsa05310", "hsa04110", "hsa05224", "hsa04970") multipaths <- kegg %>% filter_set(set %in% pathways) multipaths
This example will start out the same way that Case Study I started, by loading
in the airway
dataset, but we will also do some reformating of the data. The
end goal is to be able to perform a Gene Set Enrichment Analysis on the data and
return a BiocSet object of the gene sets.
data("airway") airway$dex <- relevel(airway$dex, "untrt")
Similar to other analyses we perform a differential expression analysis on the
airway
data using the library DESeq2
and then store the results into a
tibble.
library(DESeq2) library(tibble) des <- DESeqDataSet(airway, design = ~ cell + dex) des <- DESeq(des) res <- results(des) tbl <- res %>% as.data.frame() %>% as_tibble(rownames = "ENSEMBL")
Since we want to use limma::goana()
to perform the GSEA we will need to have
ENTREZ identifiers in the data, as well as filter out some NA
information.
Later on this will be considered our 'element' tibble.
tbl <- tbl %>% mutate( ENTREZID = mapIds( org.Hs.eg.db, ENSEMBL, "ENTREZID", "ENSEMBL" ) %>% unname() ) tbl <- tbl %>% filter(!is.na(padj), !is.na(ENTREZID)) tbl
Now that the data is ready for GSEA we can go ahead and use goana()
and make
the results into a tibble. This tibble will be considered our 'set' tibble.
library(limma) go_ids <- goana(tbl$ENTREZID[tbl$padj < 0.05], tbl$ENTREZID, "Hs") %>% as.data.frame() %>% as_tibble(rownames = "GOALL") go_ids
The last thing we need to do is create a tibble that we will consider our 'elementset' tibble. This tibble will be a mapping of all the elements and sets.
foo <- AnnotationDbi::select( org.Hs.eg.db, tbl$ENTREZID, "GOALL", "ENTREZID") %>% as_tibble() foo <- foo %>% dplyr::select(-EVIDENCEALL) %>% distinct() foo <- foo %>% filter(ONTOLOGYALL == "BP") %>% dplyr::select(-ONTOLOGYALL) foo
The function BiocSet_from_elementset()
allows for users to create a BiocSet
object from tibbles. This function is helpful when there is metadata contained
in the tibble that should be in the BiocSet object. For this function to work
properly, the columns that are being joined on need to be named correctly. For
instance, in order to use this function on the tibbles we created we need to
change the column in the 'element' tibble to 'element', the column in the 'set'
tibble to 'set' and the same will be for the 'elementset' tibble. We demonstrate
this below and then create the BiocSet object with the simple function.
foo <- foo %>% dplyr::rename(element = ENTREZID, set = GOALL) tbl <- tbl %>% dplyr::rename(element = ENTREZID) go_ids <- go_ids %>% dplyr::rename(set = GOALL) es <- BiocSet_from_elementset(foo, tbl, go_ids) es
For those users that may need to put this metadata filled BiocSet object back into an object similar to GRanges or SummarizedExperiment, we have created functions that allow for the BiocSet object to be created into a tibble or data.frame.
tibble_from_element(es) head(data.frame_from_elementset(es))
A final feature to BiocSet
is the ability to add reference information about
all of the elements/sets. A user could utilize the function url_ref()
to add
information to the BiocSet
object. If a user has a question about a specific
id then they can follow the reference url to get more informtion. Below is an
example of using url_ref()
to add reference urls to the go
data set we
worked with above.
url_ref(go)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.