knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# reduce the minimum number of characters for the tibble column titles (default: 15)
options(pillar.min_title_chars = 10)
# increase the maximum number of rows printed (default: 20)
options(tibble.print_max = 25)

Overview

Pathway analysis is a common task in genomics research and there are many available R-based software tools. Depending on the tool, it may be necessary to import the pathways, translate genes to the appropriate species, convert between symbols and IDs, and format the resulting object.

The msigdbr R package provides Molecular Signatures Database (MSigDB) gene sets typically used with the Gene Set Enrichment Analysis (GSEA) software:

Please be aware that the orthologs were computationally predicted at the gene level. The full pathways may not be well conserved across species.

Installation

The package can be installed from CRAN.

install.packages("msigdbr")

Usage

Load package.

library(msigdbr)

All gene sets in the database can be retrieved by specifying a species of interest.

all_gene_sets <- msigdbr(species = "Mus musculus")
head(all_gene_sets)

You can retrieve data just for a specific collection, such as the hallmark gene sets.

h_gene_sets <- msigdbr(species = "mouse", collection = "H")
head(h_gene_sets)

You can specify a sub-collection, such as C2 (curated) CGP (chemical and genetic perturbations) gene sets.

cgp_gene_sets <- msigdbr(species = "mouse", collection = "C2", subcollection = "CGP")
head(cgp_gene_sets)

If you require more precise filtering, the msigdbr() function output is a data frame that can be manipulated using standard methods. For example, you can subset to a specific collection using dplyr.

all_gene_sets |>
  dplyr::filter(gs_collection == "H") |>
  head()

Helper functions

There are msigdbr_species() and msigdbr_collections() helper functions to assist with setting the msigdbr() parameters.

You can check the available species with msigdbr_species(). Both scientific and common names are acceptable for the msigdbr() function.

msigdbr_species()

You can check the available collections with msigdbr_collections().

msigdbr_collections()

Pathway enrichment analysis

The msigdbr output can be used with various pathway analysis packages.

Use the gene sets data frame for clusterProfiler with genes as NCBI Entrez Gene IDs.

msigdbr_t2g <- msigdbr_df |>
  dplyr::distinct(gs_name, ncbi_gene) |>
  as.data.frame()
enricher(gene = gene_ids_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for clusterProfiler with genes as gene symbols.

msigdbr_t2g <- msigdbr_df |>
  dplyr::distinct(gs_name, gene_symbol) |>
  as.data.frame()
enricher(gene = gene_symbols_vector, TERM2GENE = msigdbr_t2g, ...)

Use the gene sets data frame for fgsea.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
fgsea(pathways = msigdbr_list, ...)

Use the gene sets data frame for GSVA.

msigdbr_list <- split(x = msigdbr_df$gene_symbol, f = msigdbr_df$gs_name)
gsva(gset.idx.list = msigdbr_list, ...)

Potential questions and concerns

Which version of MSigDB was used?

The MSigDB version is stored in the db_version column of the returned data frame. You can check the version used with unique(msigdbr_df$db_version).

Why use this package when I can download the gene sets directly from MSigDB?

This package makes it more convenient to work with MSigDB gene sets in R. You can download the GMT files and import them (with getGmt() from the GSEABase package, for example). You then have to format the output to be compatible with downstream tools. If you are working with non-human data, you then have to convert the MSigDB genes to your organism.

Can I convert between human and mouse genes just by adjusting gene capitalization?

That will work for most, but not all, genes.

Can I convert human genes to any organism myself instead of using this package?

One popular method is using the biomaRt package. You may still end up with dozens of homologs for some genes, so additional cleanup may be helpful.

Aren't there already other similar tools?

There are a few resources that provide some of the msigdbr functionality and served as an inspiration for this package. WEHI provides MSigDB gene sets in R format for human and mouse. MSigDF and a more recent ToledoEM/msigdf fork provide a tidyverse-friendly data frame. These are updated at varying frequencies and may not be based on the latest version of MSigDB. Since 2022, the GSEA/MSigDB team provides collections that are natively mouse and don't require orthology conversion.

What if I have other questions?

You can submit feedback and report bugs on GitHub.

Details

The Molecular Signatures Database (MSigDB) is a collection of gene sets originally created for use with the Gene Set Enrichment Analysis (GSEA) software. To cite use of the underlying MSigDB data, reference Subramanian, Tamayo, et al. (2005, PNAS) and one or more of the following as appropriate: Liberzon, et al. (2011, Bioinformatics), Liberzon, et al. (2015, Cell Systems), Castanza, et al. (2023, Nature Methods) and also the source for the gene set.

Gene homologs are provided by HUGO Gene Nomenclature Committee at the European Bioinformatics Institute which integrates the orthology assertions predicted for human genes by eggNOG, Ensembl Compara, HGNC, HomoloGene, Inparanoid, NCBI Gene Orthology, OMA, OrthoDB, OrthoMCL, Panther, PhylomeDB, TreeFam and ZFIN. For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

For information on how to cite cite an R package such as msigdbr, you can execute citation("msigdbr").



igordot/msigdbr documentation built on March 5, 2025, 8:42 p.m.