Introduction to msigdbr
In msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# increase the screen width
options(width = 90)
# reduce the minimum number of characters for the tibble column titles (default: 15)
options(pillar.min_title_chars = 10)
# increase the maximum number of rows printed (default: 20)
options(tibble.print_max = 25)

Overview

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

in an R-friendly tidy/long format with one gene per row
for multiple frequently studied model organisms, such as mouse, rat, pig, zebrafish, fly, and yeast, in addition to the original human genes
as gene symbols as well as NCBI Entrez and Ensembl IDs
that can be installed and loaded as a package without requiring additional external files

Please be aware that the homologs were computationally predicted for distinct genes. The full pathways may not be well conserved across species.

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

All gene sets in the database can be retrieved without specifying a collection/category.

all_gene_sets = msigdbr(species = "Mus musculus")
head(all_gene_sets)

There is a helper function to show the available species. Either scientific or common names are acceptable.

msigdbr_species()

You can retrieve data for a specific collection, such as the hallmark gene sets.

h_gene_sets = msigdbr(species = "mouse", category = "H")
head(h_gene_sets)

Retrieve mouse C2 (curated) CGP (chemical and genetic perturbations) gene sets.

cgp_gene_sets = msigdbr(species = "mouse", category = "C2", subcategory = "CGP")
head(cgp_gene_sets)

There is a helper function to show the available collections.

msigdbr_collections()

The msigdbr() function output is a data frame and can be manipulated using more standard methods.

all_gene_sets %>%
  dplyr::filter(gs_cat == "H") %>%
  head()

Pathway enrichment analysis

The msigdbr output can be used with various popular pathway analysis packages.

Use the gene sets data frame for clusterProfiler with genes as Entrez Gene IDs.

msigdbr_t2g = msigdbr_df %>% dplyr::distinct(gs_name, entrez_gene) %>% as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for clusterProfiler with genes as gene symbols.

msigdbr_t2g = msigdbr_df %>% dplyr::distinct(gs_name, gene_symbol) %>% as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for fgsea.

msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)

Use the gene sets data frame for GSVA.

msigdbr_list = split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)

Potential questions or concerns

Which version of MSigDB was used?

This package was generated with MSigDB v7.5.1 (released January 2022). The MSigDB version is used as the base of the msigdsbr package version. You can check the installed version with packageVersion("msigdbr").

Can I download the gene sets directly from MSigDB instead of using this package?

Yes. You can then import the GMT files (with getGmt() from the GSEABase package, for example). The GMTs only include the human genes, even for gene sets generated from mouse experiments. If you are working with non-human data, you then have to convert the MSigDB genes to your organism or your genes to human.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most genes, but not all.

Can I convert human genes to any organism myself instead of using this package?

Yes. A popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren't there already other similar tools?

There are a few other resources that and provide some of the functionality and served as an inspiration for this package. Ge Lab Gene Set Files has GMT files for many species. WEHI provides MSigDB gene sets in R format for human and mouse. MSigDF is based on the WEHI resource, but is converted to a more tidyverse-friendly data frame. These are updated at varying frequencies and may not use the latest version of MSigDB.

What if I have other questions?

You can submit feedback and report bugs on GitHub.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), and also the source for the gene set.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

For information on how to cite cite an R package such as msigdbr, you can execute citation("msigdbr").