likely_symbol: Retrieve Symbol Aliases and Previous symbols to determine a...

View source: R/likely_symbol.R

likely_symbolR Documentation

Retrieve Symbol Aliases and Previous symbols to determine a likely current symbol

Description

likely_symbol() downloads the latest version of the HGNC gene symbol database as a text file and query it to obtain symbol aliases, previous symbols and all symbols currently in use. (Optionally) assuming the input ID to be either an Alias or a Symbol or a Previous Symbol it performs multiple queries and compares the results of all possible combinations to determine a likely current Symbol. The downloaded HGNC table is cached for the duration of the R session to avoid repeated downloads.

Usage

likely_symbol(
  syms,
  alias_sym = TRUE,
  prev_sym = TRUE,
  orgnsm = "human",
  hgnc = NULL,
  hgnc_url = NULL,
  output = c("likely", "symbols", "all"),
  index_threshold = 10L,
  refresh = FALSE,
  verbose = TRUE
)

Arguments

syms

(character). Vector of Gene Symbols to be tested.

alias_sym

(logical). Should the input be assumed to be an Alias? Defaults to TRUE.

prev_sym

(logical). Should the input be assumed to be a Previous Symbol? Defaults to TRUE.

orgnsm

(character). The organism for which the Symbols are tested.

hgnc

(data.frame). An optional data frame with the needed HGNC annotations. (Needs to match the format available at hgnc_url!) When supplied, bypasses both the cache and any download.

hgnc_url

(character). URL where to download the HGNC annotation dataset. Defaults to "https://storage.googleapis.com/public-download-files/hgnc/tsv/tsv/hgnc_complete_set.txt".

output

(character). One of "likely", "symbols" and "all". Determines the scope of the output data frame. Defaults to "likely" which will return the input Symbol and the determined likely Symbol.

index_threshold

(integer). Minimum number of unique input symbols above which inverted indices are pre-built for alias and previous symbol lookups, giving a substantial speedup for large inputs. Below this threshold the original row-scan is used, which is faster for very small inputs (e.g. a single symbol lookup) where the index-building overhead would dominate. Defaults to 10L.

refresh

(logical). Should the cached HGNC table be discarded and re-downloaded? Defaults to FALSE. Use TRUE to force a fresh download within the same R session, e.g. after a known HGNC update.

verbose

(logical). Should messages be written to the console? Defaults to TRUE.

Details

The HGNC table is downloaded once per R session and cached in a package-level environment. Subsequent calls reuse the cached table without any network access. If the cached table is more than 3 days old a warning message is emitted recommending a refresh, since the HGNC database is updated monthly. To force a fresh download within the same session use refresh = TRUE or start a new R session.

When the number of unique input symbols is at or above index_threshold, inverted indices (hash tables) are pre-built from the HGNC table so that each per-symbol lookup is O(1) rather than O(nrow(hgnc)), giving roughly a 50-100x speedup for batch inputs. For small inputs the original row-scan is retained to avoid the index-building overhead.

Value

A data.frame with the following columns depending on the output setting. output="likely":

'likely_symbol'
'input_symbol'

output="symbols":

'current_symbols'
'likely_symbol'
'input_symbol'
'all_symbols'

output="all":

'orig_input'
'organism'
'current_symbols'
'likely_symbol'
'input_symbol'
'all_symbols'

Note

Only fully implemented for Human for now.

Examples

## Not run: 
# Single symbol lookup (uses row-scan, no index overhead)
likely_symbol("CCBL1")

# Second call reuses cached HGNC table — no download
likely_symbol("KAAT1")

# Force a fresh download within the same session
likely_symbol("CCBL1", refresh = TRUE)

# Batch lookup (builds index for speed)
likely_symbol(c("ABCC4", "ACPP", "KIAA1524"))

# Supply a pre-loaded table to bypass cache and download entirely
likely_symbol(c("ABCC4", "ACPP"), hgnc = my_hgnc_table)

## End(Not run)

convertid documentation built on April 1, 2026, 5:06 p.m.