library(knitr)
opts_chunk$set(message=FALSE, warning=FALSE, eval=TRUE, echo=TRUE)

Data sources

Original data from the Broad Institute's Molecular Signature Database (MSigDB)^[http://www.broad.mit.edu/gsea/msigdb/index.jsp], redistributed as separate R data files containing named lists of gene sets, available from WEHI.^[http://bioinf.wehi.edu.au/software/MSigDB/] The following description applies to the R-formatted data:


The gene sets contained in the MSigDB are from a wide variety of sources, and relate to a variety of species, mostly human. Our work at the WEHI predominately uses mouse models of human disease. To facilitate use of the MSigDB in our work, we have created a pure mouse version of the MSigDB by mapping all sets to mouse orthologs. A pure human version is also provided.

Prodecure:

1. The current MSigDB v5.2 xml file was downloaded. 2. Human Entrez Gene IDs were mapped to Mouse Entrez Gene IDs, using the HGNC Comparison of Orthology Predictions (HCOP) (downloaded 11 Octtober 2016). 3. Each collection was converted to a list in R, and written to a RData file using save().


See the script in data-raw/ to see how the data frames (tibbles) were created.

Example usage

There are three data frames (tibbles) this package. The msigdf.human data frame has columns for each MSigDB collection (c1-7 and hallmark), each gene set, and Entrez ID, where each row is a single Entrez gene ID. The msigdf.mouse data frame has the same structure for mouse orthologs. The msigdf.urls data frame links the name of the gene set to the URL on the Broad's website.

The data sets in this package have several million rows. The package imports the tibble package so they're displayed nicely.

library(tidyverse)
library(msigdf)

Take a look:

msigdf.human %>% head()
msigdf.mouse %>% head()
msigdf.urls %>% as.data.frame() %>% head()

Just get the entries for the KEGG non-homologous end joining pathway:

msigdf.human %>% 
  filter(geneset=="KEGG_NON_HOMOLOGOUS_END_JOINING")

Some software, e.g., GAGE might require gene sets to be a named list of Entrez IDs, where the name of each element in the list is the name of the pathway. This is how the data was originally structured, and we can return to it with plyr::dlply(). Here, let's use only the hallmark sets, and after we dlply the data into this named list format, get just the first few pathways, and in each of those, just display the first few entrez IDs.

msigdf.human %>% 
  filter(collection=="hallmark") %>% 
  select(geneset, entrez) %>% 
  group_by(geneset) %>% 
  summarize(entrez=list(entrez)) %>% 
  deframe() %>% 
  head() %>% 
  map(head)

Further exploration

For demonstration purposes, create a single object containing both human and mouse data:

msigdf <- bind_rows(
  msigdf.human %>% mutate(org="human"),
  msigdf.mouse %>% mutate(org="mouse")
)
head(msigdf)
tail(msigdf)

The number of gene sets in each collection is the same for each organism:

msigdf %>%
  group_by(org, collection) %>%
  summarize(ngenesets=n_distinct(geneset)) %>%
  spread(org, ngenesets)

But the number of mouse genes in each collection is much greater, due to the one-to-many ortholog mapping.

msigdf %>%
  count(org, collection) %>%
  spread(org, n)

Look at the first few gene sets just in the 50-geneset hallmark collection. In each gene set, the number of mouse genes is greater than the number of human genes.

msigdf %>%
  count(org, collection, geneset) %>%
  filter(collection=="hallmark") %>%
  spread(org, n)

Get the URL for the hallmark set with the fewest number of genes (Notch signaling). Optionally, %>% this to browseURL to open it up in your browser.

msigdf.human %>%
  filter(collection=="hallmark") %>%
  count(geneset) %>%
  arrange((n)) %>%
  head(1) %>%
  inner_join(msigdf.urls, by="geneset") %>%
  pull(url)

Just look at the number of genes in each KEGG pathway (sorted descending by the number of genes in that pathway):

msigdf.human %>%
  filter(collection=="c2" & grepl("^KEGG_", geneset)) %>%
  count(geneset) %>% 
  arrange(desc(n))

Session info {.unnumbered}

sessionInfo()


stephenturner/msigdf documentation built on May 30, 2019, 3:35 p.m.