knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)

The Molecular Signatures Database (MSigDB)

The molecular signatures database (MSigDB) is one of the largest collections of molecular signatures or gene expression signatures. A variety of gene expression signatures are hosted on this database including experimentally derived signatures and signatures representing pathways and ontologies from other curated databases. This rich collection of gene expression signatures (>25,000) can facilitate a wide variety of signature-based analyses, the most popular being gene set enrichment analyses. These signatures can be used to perform enrichment analysis in a DE experiment using tools such as GSEA, fry (from limma) and camera (from limma). Alternatively, they can be used to perform single-sample gene-set analysis of individual transcriptomic profiles using approaches such as singscore, ssGSEA and GSVA.

This package provides the gene sets in the MSigDB in the form of GeneSet objects. This data structure is specifically designed to store information about gene sets, including their member genes and metadata. Other packages, such as msigdbr and EGSEAdata provide these gene sets too, however, they do so by storing them as lists or tibbles. These structures are not specific to gene sets therefore do not allow storage of important metadata associated with each gene set, for example, their short and long descriptions. Additionally, the lack of structure allows creation of invalid gene sets. Accessory functions implemented in the GSEABase package provide a neat interface to interact with GeneSet objects.

This package can be installed using the code below:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("msigdb")

Download data from the msigdb R package

This ExperimentHub package processes the latest version of the MSigDB database into R objects that can be queried using the GSEABase R/Bioconductor package. The entire database is stored in a GeneSetCollection object which in turn stores each signature as a GeneSet object. All empty gene expression signatures (i.e. no genes formed the signature) have been dropped. Data in this package can be downloaded using the ExperimentHub interface as shown below.

To download the data, we first need to get a list of the data available in the msigdb package and determine the unique identifiers for each data. The query() function assists in getting this list.

library(msigdb)
library(ExperimentHub)
library(GSEABase)
eh = ExperimentHub()
query(eh , 'msigdb')

Data can then be downloaded using the unique identifier.

eh[['EH5421']]

Data can also be downloaded using the custom accessor `msigdb::getMsigdb()`:

#use the custom accessor to select a specific version of MSigDB
msigdb.hs = getMsigdb(org = 'hs', id = 'SYM', version = '7.4')
msigdb.hs

Downloading and integrating KEGG gene sets

KEGG gene sets cannot be integrated within this ExperimentHub package due to licensing limitations. However, users can download, process and integrate the data directly from the MSigDB when needed. This can be done using the code that follows.

msigdb.hs = appendKEGG(msigdb.hs)
msigdb.hs

Accessing the GeneSet and GeneSetCollection objects

A GeneSetCollection object is effectively a list therefore all list processing functions such as length and lapply can be used to process its constituents

length(msigdb.hs)

Each signature is stored in a GeneSet object and can be processed using functions in the GSEABase R/Bioconductor package.

gs = msigdb.hs[[1000]]
gs
#get genes in the signature
geneIds(gs)
#get collection type
collectionType(gs)
#get MSigDB category
bcCategory(collectionType(gs))
#get MSigDB subcategory
bcSubCategory(collectionType(gs))
#get description
description(gs)
#get details
details(gs)

We can also summarise some of these values across the entire database. Description of these codes can be found at the MSigDB website (https://www.gsea-msigdb.org/gsea/msigdb).

#calculate the number of signatures in each category
table(sapply(lapply(msigdb.hs, collectionType), bcCategory))
#calculate the number of signatures in each subcategory
table(sapply(lapply(msigdb.hs, collectionType), bcSubCategory))
#plot the distribution of sizes
hist(sapply(lapply(msigdb.hs, geneIds), length),
     main = 'MSigDB signature size distribution',
     xlab = 'Signature size')

Subset collections from the MSigDB

Most gene set analysis is performed within specific collections rather than across the entire database. This package comes with functions to subset specific collections. The list of all collections and sub-collections present within a GeneSetCollection object can be listed using the functions below:

listCollections(msigdb.hs)
listSubCollections(msigdb.hs)

Specific collections can be retrieved using the code below:

#retrieeve the hallmarks gene sets
subsetCollection(msigdb.hs, 'h')
#retrieve the biological processes category of gene ontology
subsetCollection(msigdb.hs, 'c5', 'GO:BP')

Preparing collections for limma::fry

Any gene-set collection can be easily transformed for usage with limma::fry by first transforming it into a list of gene IDs and following that with a transformation to indices as shown below.

library(limma)

#create expression data
allg = unique(unlist(geneIds(msigdb.hs)))
emat = matrix(0, nrow = length(allg), ncol = 6)
rownames(emat) = allg
colnames(emat) = paste0('sample', 1:6)
head(emat)
#retrieve collections
hallmarks = subsetCollection(msigdb.hs, 'h')
msigdb_ids = geneIds(hallmarks)

#convert gene sets into a list of gene indices
fry_indices = ids2indices(msigdb_ids, rownames(emat))
fry_indices[1:2]

Accessing the mouse MSigDB

The mouse MSigDB has been created in collaboration with Gordon K. Smyth and Alex Garnham from WEHI. The code they use to generate the mouse MSigDB has been used in this package. Detailed description of the steps conducted to convert human gene expression signatures to mouse can be found at http://bioinf.wehi.edu.au/MSigDB. Mouse homologs for human genes were obtained using the HCOP database (as of 18/03/2021).

All the above functions apply to the mouse MSigDB and can be used to interact with the collection.

msigdb.mm = getMsigdb(org = 'mm', id = 'SYM', version = '7.4')
msigdb.mm

Session information

sessionInfo()


DavisLaboratory/msigdb documentation built on July 18, 2023, 7:54 p.m.