knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The HUGO Gene Nomenclature Committee (HGNC) is a committee of the Human Genome Organisation (HUGO) that sets the standards for human gene nomenclature.
The HGNC approves a unique and meaningful name for every known human gene, based on a group of experts. In addition, the HGNC also provides the mapping between gene symbols to gene entries in other popular databases or resources: the HGNC complete gene set.
The goal of {hgnc}
is to easily download and import the latest HGNC complete
gene data set into R.
This data set provides a useful mapping of HGNC symbols to gene entries in other
popular databases or resouces, such as, the Entrez gene identifier or the UCSC
gene identifier, among many others. Check the documentation of the function
import_hgnc_dataset()
for a description of the several fields available.
Install {hgnc}
from CRAN:
install.packages("hgnc")
You can install the development version of {hgnc}
like so:
# install.packages("remotes") remotes::install_github("maialab/hgnc")
To import the latest HGNC gene data set in tabular format directly into memory as a tibble do as follows:
library(hgnc) # Date of HGNC last update last_update() # Direct URL to the latest archive in TSV format (url <- latest_archive_url()) # Import the data set in tidy tabular format # NB: Multiple-value columns are kept as list-columns hgnc_dataset <- import_hgnc_dataset(url) dplyr::glimpse(hgnc_dataset)
The original data set does not contain the column hgnc_id2
, which is added as
a convenience by {hgnc}
; this is because although the HGNC identifiers should
formally contain the prefix "HGNC:"
, it is often found elsewhere that they are
stripped of this prefix, so the column hgnc_id2
is also provided whose values
only contain the integer part.
hgnc_dataset %>% dplyr::select(c('hgnc_id', 'hgnc_id2'))
The HGNC defines a group name (locus_group
) for a set of related locus types.
Here's how you can quickly check how many gene entries there are per locus
group.
hgnc_dataset %>% dplyr::count(locus_group, sort = TRUE)
locus_type
provides a finer classification:
hgnc_dataset %>% dplyr::group_by(locus_group) %>% dplyr::count(locus_type, sort = TRUE) %>% dplyr::arrange(locus_group) %>% print(n = Inf)
If you prefer to download the data set as a file to disk first, you can use
download_archive()
. Then, you can use import_hgnc_dataset()
to import
the downloaded file into R.
Besides the latest archive, the HUGO Gene Nomenclature Committee
(HGNC) website also provides monthly and quarterly
updates. Use list_archives()
to list the currently available for download
archives. The column url
contains the direct download link that you can pass
to import_hgnc_dataset()
to import the data into R.
list_archives()
You could go to www.genenames.org and download the files yourself. So why the need for this R package?
{hgnc}
really is just a convenience package. The main advantage is that the
function import_hgnc_dataset()
reads in the data in tabular format with
all the columns with the appropriate type (so you don't have to specify it
yourself). As an extra step, those variables that contain multiple values are
encoded as list-columns.
Remember that list-columns can be expanded with tidyr::unnest()
. E.g.,
alias_symbol
is a list-column containing multiple alternative aliases to the
standard symbol
:
hgnc_dataset %>% dplyr::filter(symbol == 'TP53') %>% dplyr::select(c('symbol', 'alias_symbol')) hgnc_dataset %>% dplyr::filter(symbol == 'TP53') %>% dplyr::select(c('symbol', 'alias_symbol')) %>% tidyr::unnest(cols = 'alias_symbol')
In addition, we also provide the function filter_by_keyword()
that allows
filtering the data set based on a keyword or regular expression. By default this
function will look into all columns that contain gene symbols or names
(symbol
, name
, alias_symbol
, alias_name
, prev_symbol
and prev_name
).
It works automatically with list-columns too.
Look for entries in the data set that contain the keyword "TP53"
:
hgnc_dataset %>% filter_by_keyword('TP53') %>% dplyr::select(1:4)
Restrict the search to the symbol
column:
hgnc_dataset %>% filter_by_keyword('TP53', cols = 'symbol') %>% dplyr::select(1:4)
Search for the whole word "TP53"
exactly by taking advantage of regular
expressions:
hgnc_dataset %>% filter_by_keyword('^TP53$', cols = 'symbol') %>% dplyr::select(1:4)
To cite HGNC nomenclature resources use:
To cite data within the database use the following format:
Please include the month and year you retrieved the data cited.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.