freshenGenes: Freshen gene annotations using Bioconductor annotation data
In jmw86069/genejam: Gene Jam

freshenGenes

R Documentation

Freshen gene annotations using Bioconductor annotation data

Description

Freshen gene annotations using Bioconductor annotation data

Usage

freshenGenes(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL"),
  split = "[ ]*[,/;]+[ ]*",
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "intermediate",
  ignore.case = FALSE,
  verbose = FALSE,
  ...
)

Arguments

`x`	character vector or `data.frame` with one or most columns containing gene symbols.
`ann_lib`	character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.
`try_list`	character vector indicating one or more names of annotations to use for the input gene symbols in `x`. The annotation should typically return the Entrez gene ID, usually given by `'2EG'` at the end of the name. For example `SYMBOL2EG` will be used with ann_lib `"org.Hs.eg.db"` to produce annotation name `"org.Hs.egSYMBOL2EG"`. Note that when the `'2EG'` form of annotation does not exist (or another suitable suffix defined in argument `"revmap_suffix"` in `get_anno_db()`), it will be derived using `AnnotationDbi::revmap()`. For example if `"org.Hs.egALIAS"` is requested, but only `"org.Hs.egALIAS2EG"` is available, then `AnnotationDbi::revmap(org.Hs.egALIAS2EG)` is used to create the equivalent of `"org.Hs.egALIAS"`.
`final`	character vector to use for the final conversion step. When `final` is `NULL` no conversion is performed. When `final` contains multiple values, each value is returned in the output. For example, `final=c("SYMBOL","GENENAME")` will return a column `"SYMBOL"` and a column `"GENENAME"`.
`split`	character value used to separate delimited values in `x` by the function `base::strsplit()`. The default will split values separated by comma `⁠,⁠` semicolon `⁠;⁠` or forward slash `/`, and will trim whitespace before and after these delimiters.
`sep`	character value used to concatenate multiple entries in the same field. The default `sep=","` will comma-delimit multiple entries in the same field.
`handle_multiple`	character value indicating how to handle multiple values: `"first_hit"` will query each column of `x` until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in `x`. For example, if one row in `x` contains multiple values, only the first match will be used. `"first_try"` will return the first match from `try_list` for all columns in `x` that contain a match. For example, if one row in `x` contains two values, the first match from `try_list` using one or both columns in `x` will be maintained. Subsequent entries in `try_list` will not be attempted for rows that already have a match. `"all"` will return all possible matches for all entries in `x` using all items in `try_list`.
`empty_rule`	character value indicating how to handle entries which did not have a match, and are therefore empty: `"original"` will use the original entry as the output field; `"empty"` will leave the entry blank.
`include_source`	logical indicating whether to include a column that shows the colname and source matched. For example, if column `"original_gene"` matched `"SYMBOL2EG"` in `"org.Hs.eg.db"` there will be a column `"found_source"` with value `"original_gene.org.Hs.egSYMBOL2EG"`.
`protect_inline_sep`	logical indicating whether to protect inline characters in `sep`, to prevent them from being used to split single values into multiple values. For example, `"GENENAME"` returns the full gene name, which often contains comma `","` characters. These commas do not separate multiple separate values, so they should not be used to split a string like `"H4 clustered histone 10, pseudogene"` into two strings `"H4 clustered histone 10"` and `"pseudogene"`.
`intermediate`	`character` string with colname in `x` that contains intermediate values. These values are expected from output of the first step in the workflow, for example `"SYMBOL2EG"` returns Entrez gene values, so if the input `x` already contains some of these values in a column, assign that colname to `intermediate`.
`ignore.case`	`logical` indicating whether to use case-insensitive matching when `ignore.case=TRUE`, otherwise the default `ignore.case=FALSE` will perform default `mget()` which requires the upper and lowercase characters are an identical match. When `ignore.case=TRUE` this function calls `genejam::imget()`.
`verbose`	logical indicating whether to print verbose output.

Details

This function takes a vector or data.frame of gene symbols, and uses Bioconductor annotation methods to find the most current official gene symbol.

The annotation process runs in two basic steps:

Convert the input gene to Entrez gene ID.
Convert Entrez gene ID to official gene symbol.

Step 1. Convert to Entrez gene ID

The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:

"org.Hs.egSYMBOL2EG" – almost always 1-to-1 match
"org.Hs.egACCNUM2EG" – mostly a 1-to-1 match
"org.Hs.egALIAS2EG" – sometimes a 1-to-1 match, sometimes 1-to-many

When multiple Entrez gene ID values are matched, they are all retained. See argument handle_multiple for custom options.

Step 2. Use Entrez gene ID to return official annotation

The second step converts the Entrez gene ID (or multiple IDs) to the official gene symbol, by default using "org.Hs.egSYMBOL".

The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:

"org.Hs.egSYMBOL" – official Entrez gene symbol
"org.Hs.egALIAS" – set of recognized aliases for an Entrez gene.
"org.Hs.egGENENAME" – official Entrez long gene name

For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.

Note that if the input data already contains Entrez gene ID values, you can define that colname with argument intermediate.

Case-insensitive search

For case-insensitive search, which is particularly useful in non-human organisms because they often use mixed-case, use the argument ignore.case=TRUE. In our benchmark tests it appears to add roughly 0.1 seconds per annotation, regardless of the number of input entries. This appears to be the time it takes to spool the list of annotation keys stored in the SQLite database, and may therefore be dependent upon the size of the annotation file.

Value

data.frame with one or more columns indicating the input data, then a column "intermediate" containing the Entrez gene ID that was matched, then one column for each item in final, by default "SYMBOL".

Examples

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   cat("\nBasic usage\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF")));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally show the annotation source matched
   cat("\nOptionally show the annotation source matched\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Show comma-delimited genes
   cat("\nInput genes are comma-delimited\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF")));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally include more than SYMBOL in the output
   cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n");
   print(freshenGenes(c("APOE", "HIST1H1C"),
      final=c("SYMBOL", "ALIAS", "GENENAME")));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## More advanced, match affymetrix probesets
   if (suppressPackageStartupMessages(require(hgu133plus2.db))) {
      cat("\nAdvanced example including Affymetrix probesets.\n");
      print(freshenGenes(c("227047_x_at","APOE","HIST1H1D","NM_003166,U08032"),
         include_source=TRUE,
         try_list=c("hgu133plus2ENTREZID","REFSEQ2EG","SYMBOL2EG","ACCNUM2EG","ALIAS2EG"),
         final=c("SYMBOL","GENENAME")))
   }
}

jmw86069/genejam documentation built on July 4, 2025, 3:58 a.m.