library(knitr) opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE)
Original data from the Broad Institute's Molecular Signature Database (MSigDB)^[http://www.broad.mit.edu/gsea/msigdb/index.jsp], redistributed as separate R data files containing named lists of gene sets, available from WEHI.^[http://bioinf.wehi.edu.au/software/MSigDB/] The following description applies to the R-formatted data:
The gene sets contained in the MSigDB are from a wide variety of sources, and relate to a variety of species, mostly human. Our work at the WEHI predominately uses mouse models of human disease. To facilitate use of the MSigDB in our work, we have created a pure mouse version of the MSigDB by mapping all sets to mouse orthologs. A pure human version is also provided.
Prodecure:
1. The current MSigDB v5.2 xml file was downloaded.
2. Human Entrez Gene IDs were mapped to Mouse Entrez Gene IDs, using the HGNC Comparison of Orthology Predictions (HCOP) (downloaded 11 Octtober 2016).
3. Each collection was converted to a list in R, and written to a RData file using save()
.
See the script in data-raw/
to see how the data frames (tibbles) were created.
There are three data frames (tibbles) this package. The msigdf.human
data frame has columns for each MSigDB collection (c1-7 and hallmark), each gene set, and Entrez ID, where each row is a single Entrez gene ID. The msigdf.mouse
data frame has the same structure for mouse orthologs. The msigdf.urls
data frame links the name of the gene set to the URL on the Broad's website.
The data sets in this package have several million rows. The package imports the tibble package so they're displayed nicely.
library(tidyverse) library(msigdf)
Take a look:
msigdf.human %>% head() msigdf.mouse %>% head() msigdf.urls %>% as.data.frame() %>% head()
Just get the entries for the KEGG non-homologous end joining pathway:
msigdf.human %>% filter(geneset=="KEGG_NON_HOMOLOGOUS_END_JOINING")
Some software, e.g., GAGE might require gene sets to be a named list of Entrez IDs, where the name of each element in the list is the name of the pathway. This is how the data was originally structured, and we can return to it with plyr::dlply()
. Here, let's use only the hallmark sets, and after we dlply
the data into this named list format, get just the first few pathways, and in each of those, just display the first few entrez IDs.
msigdf.human %>% filter(collection=="hallmark") %>% select(geneset, entrez) %>% group_by(geneset) %>% summarize(entrez=list(entrez)) %>% deframe() %>% head() %>% map(head)
For demonstration purposes, create a single object containing both human and mouse data:
msigdf <- bind_rows( msigdf.human %>% mutate(org="human"), msigdf.mouse %>% mutate(org="mouse") )
head(msigdf) tail(msigdf)
The number of gene sets in each collection is the same for each organism:
msigdf %>% group_by(org, collection) %>% summarize(ngenesets=n_distinct(geneset)) %>% spread(org, ngenesets)
But the number of mouse genes in each collection is much greater, due to the one-to-many ortholog mapping.
msigdf %>% count(org, collection) %>% spread(org, n)
Look at the first few gene sets just in the 50-geneset hallmark collection. In each gene set, the number of mouse genes is greater than the number of human genes.
msigdf %>% count(org, collection, geneset) %>% filter(collection=="hallmark") %>% spread(org, n)
Get the URL for the hallmark set with the fewest number of genes (Notch signaling). Optionally, %>%
this to browseURL
to open it up in your browser.
msigdf.human %>% filter(collection=="hallmark") %>% count(geneset) %>% arrange((n)) %>% head(1) %>% inner_join(msigdf.urls, by="geneset") %>% pull(url)
Just look at the number of genes in each KEGG pathway (sorted descending by the number of genes in that pathway):
msigdf.human %>% filter(collection=="c2" & grepl("^KEGG_", geneset)) %>% count(geneset) %>% arrange(desc(n))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.