In maialab/hgnc: Download and Import the HUGO Gene Nomenclature Committee ('HGNC') Data Set into R

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

hgnc

The HUGO Gene Nomenclature Committee (HGNC) is a committee of the Human Genome Organisation (HUGO) that sets the standards for human gene nomenclature.

The HGNC approves a unique and meaningful name for every known human gene, based on a group of experts. In addition, the HGNC also provides the mapping between gene symbols to gene entries in other popular databases or resources: the HGNC complete gene set.

The goal of {hgnc} is to easily download and import the latest HGNC complete gene data set into R.

This data set provides a useful mapping of HGNC symbols to gene entries in other popular databases or resouces, such as, the Entrez gene identifier or the UCSC gene identifier, among many others. Check the documentation of the function import_hgnc_dataset() for a description of the several fields available.

Installation

Install {hgnc} from CRAN:

install.packages("hgnc")

You can install the development version of {hgnc} like so:

# install.packages("remotes")
remotes::install_github("maialab/hgnc")

Usage

Basic usage

To import the latest HGNC gene data set in tabular format directly into memory as a tibble do as follows:

library(hgnc)

# Date of HGNC last update
last_update()

# Direct URL to the latest archive in TSV format
(url <- latest_archive_url())

# Import the data set in tidy tabular format
# NB: Multiple-value columns are kept as list-columns
hgnc_dataset <- import_hgnc_dataset(url)

dplyr::glimpse(hgnc_dataset)

The original data set does not contain the column hgnc_id2, which is added as a convenience by {hgnc}; this is because although the HGNC identifiers should formally contain the prefix "HGNC:", it is often found elsewhere that they are stripped of this prefix, so the column hgnc_id2 is also provided whose values only contain the integer part.

hgnc_dataset %>%
  dplyr::select(c('hgnc_id', 'hgnc_id2'))

Locus groups

The HGNC defines a group name (locus_group) for a set of related locus types. Here's how you can quickly check how many gene entries there are per locus group.

hgnc_dataset %>%
  dplyr::count(locus_group, sort = TRUE)

locus_type provides a finer classification:

hgnc_dataset %>%
  dplyr::group_by(locus_group) %>%
  dplyr::count(locus_type, sort = TRUE) %>%
  dplyr::arrange(locus_group) %>%
  print(n = Inf)

Downloading to disk

If you prefer to download the data set as a file to disk first, you can use download_archive(). Then, you can use import_hgnc_dataset() to import the downloaded file into R.

Downloading other archives

Besides the latest archive, the HUGO Gene Nomenclature Committee (HGNC) website also provides monthly and quarterly updates. Use list_archives() to list the currently available for download archives. The column url contains the direct download link that you can pass to import_hgnc_dataset() to import the data into R.

list_archives()

Motivation

You could go to www.genenames.org and download the files yourself. So why the need for this R package?

{hgnc} really is just a convenience package. The main advantage is that the function import_hgnc_dataset() reads in the data in tabular format with all the columns with the appropriate type (so you don't have to specify it yourself). As an extra step, those variables that contain multiple values are encoded as list-columns.

Remember that list-columns can be expanded with tidyr::unnest(). E.g., alias_symbol is a list-column containing multiple alternative aliases to the standard symbol:

hgnc_dataset %>%
  dplyr::filter(symbol == 'TP53') %>%
  dplyr::select(c('symbol', 'alias_symbol'))

hgnc_dataset %>%
  dplyr::filter(symbol == 'TP53') %>%
  dplyr::select(c('symbol', 'alias_symbol')) %>%
  tidyr::unnest(cols = 'alias_symbol')

In addition, we also provide the function filter_by_keyword() that allows filtering the data set based on a keyword or regular expression. By default this function will look into all columns that contain gene symbols or names (symbol, name, alias_symbol, alias_name, prev_symbol and prev_name). It works automatically with list-columns too.

Look for entries in the data set that contain the keyword "TP53":

hgnc_dataset %>%
  filter_by_keyword('TP53') %>%
  dplyr::select(1:4)

Restrict the search to the symbol column:

hgnc_dataset %>%
  filter_by_keyword('TP53', cols = 'symbol') %>%
  dplyr::select(1:4)

Search for the whole word "TP53" exactly by taking advantage of regular expressions:

hgnc_dataset %>%
  filter_by_keyword('^TP53$', cols = 'symbol') %>%
  dplyr::select(1:4)

Citing the HGNC

To cite HGNC nomenclature resources use:

Tweedie S, Braschi B, Gray KA, Jones TEM, Seal RL, Yates B, Bruford EA. Genenames.org: the HGNC and VGNC resources in 2021. Nucleic Acids Res. 49, D939--D946 (2021). doi: 10.1093/nar/gkaa980

To cite data within the database use the following format:

HGNC Database, HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom www.genenames.org.

Please include the month and year you retrieved the data cited.

maialab/hgnc documentation built on July 2, 2023, 11:41 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com