freshenGenes: Freshen gene annotations using Bioconductor annotation data

freshenGenesR Documentation

Freshen gene annotations using Bioconductor annotation data

Description

Freshen gene annotations using Bioconductor annotation data

Usage

freshenGenes(
  x,
  ann_lib = c("", "org.Hs.eg.db"),
  try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"),
  final = c("SYMBOL"),
  split = "[ ]*[,/;]+[ ]*",
  sep = ",",
  handle_multiple = c("first_try", "first_hit", "all", "best_each"),
  empty_rule = c("empty", "original", "na"),
  include_source = FALSE,
  protect_inline_sep = TRUE,
  intermediate = "intermediate",
  ignore.case = FALSE,
  verbose = FALSE,
  ...
)

Arguments

x

character vector or data.frame with one or most columns containing gene symbols.

ann_lib

character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature.

try_list

character vector indicating one or more names of annotations to use for the input gene symbols in x. The annotation should typically return the Entrez gene ID, usually given by '2EG' at the end of the name. For example SYMBOL2EG will be used with ann_lib "org.Hs.eg.db" to produce annotation name "org.Hs.egSYMBOL2EG". Note that when the '2EG' form of annotation does not exist (or another suitable suffix defined in argument "revmap_suffix" in get_anno_db()), it will be derived using AnnotationDbi::revmap(). For example if "org.Hs.egALIAS" is requested, but only "org.Hs.egALIAS2EG" is available, then AnnotationDbi::revmap(org.Hs.egALIAS2EG) is used to create the equivalent of "org.Hs.egALIAS".

final

character vector to use for the final conversion step. When final is NULL no conversion is performed. When final contains multiple values, each value is returned in the output. For example, final=c("SYMBOL","GENENAME") will return a column "SYMBOL" and a column "GENENAME".

split

character value used to separate delimited values in x by the function base::strsplit(). The default will split values separated by comma , semicolon ; or forward slash /, and will trim whitespace before and after these delimiters.

sep

character value used to concatenate multiple entries in the same field. The default sep="," will comma-delimit multiple entries in the same field.

handle_multiple

character value indicating how to handle multiple values: "first_hit" will query each column of x until it finds the first possible returning match, and will ignore all subsequent possible matches for that row in x. For example, if one row in x contains multiple values, only the first match will be used. "first_try" will return the first match from try_list for all columns in x that contain a match. For example, if one row in x contains two values, the first match from try_list using one or both columns in x will be maintained. Subsequent entries in try_list will not be attempted for rows that already have a match. "all" will return all possible matches for all entries in x using all items in try_list.

empty_rule

character value indicating how to handle entries which did not have a match, and are therefore empty: "original" will use the original entry as the output field; "empty" will leave the entry blank.

include_source

logical indicating whether to include a column that shows the colname and source matched. For example, if column "original_gene" matched "SYMBOL2EG" in "org.Hs.eg.db" there will be a column "found_source" with value "original_gene.org.Hs.egSYMBOL2EG".

protect_inline_sep

logical indicating whether to protect inline characters in sep, to prevent them from being used to split single values into multiple values. For example, "GENENAME" returns the full gene name, which often contains comma "," characters. These commas do not separate multiple separate values, so they should not be used to split a string like "H4 clustered histone 10, pseudogene" into two strings "H4 clustered histone 10" and "pseudogene".

intermediate

character string with colname in x that contains intermediate values. These values are expected from output of the first step in the workflow, for example "SYMBOL2EG" returns Entrez gene values, so if the input x already contains some of these values in a column, assign that colname to intermediate.

ignore.case

logical indicating whether to use case-insensitive matching when ignore.case=TRUE, otherwise the default ignore.case=FALSE will perform default mget() which requires the upper and lowercase characters are an identical match. When ignore.case=TRUE this function calls genejam::imget().

verbose

logical indicating whether to print verbose output.

Details

This function takes a vector or data.frame of gene symbols, and uses Bioconductor annotation methods to find the most current official gene symbol.

The annotation process runs in two basic steps:

  1. Convert the input gene to Entrez gene ID.

  2. Convert Entrez gene ID to official gene symbol.

Step 1. Convert to Entrez gene ID

The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:

  • "org.Hs.egSYMBOL2EG" – almost always 1-to-1 match

  • "org.Hs.egACCNUM2EG" – mostly a 1-to-1 match

  • "org.Hs.egALIAS2EG" – sometimes a 1-to-1 match, sometimes 1-to-many

When multiple Entrez gene ID values are matched, they are all retained. See argument handle_multiple for custom options.

Step 2. Use Entrez gene ID to return official annotation

The second step converts the Entrez gene ID (or multiple IDs) to the official gene symbol, by default using "org.Hs.egSYMBOL".

The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:

  • "org.Hs.egSYMBOL" – official Entrez gene symbol

  • "org.Hs.egALIAS" – set of recognized aliases for an Entrez gene.

  • "org.Hs.egGENENAME" – official Entrez long gene name

For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.

Note that if the input data already contains Entrez gene ID values, you can define that colname with argument intermediate.

Case-insensitive search

For case-insensitive search, which is particularly useful in non-human organisms because they often use mixed-case, use the argument ignore.case=TRUE. In our benchmark tests it appears to add roughly 0.1 seconds per annotation, regardless of the number of input entries. This appears to be the time it takes to spool the list of annotation keys stored in the SQLite database, and may therefore be dependent upon the size of the annotation file.

Value

data.frame with one or more columns indicating the input data, then a column "intermediate" containing the Entrez gene ID that was matched, then one column for each item in final, by default "SYMBOL".

See Also

Other genejam: freshenGenes2(), freshenGenes3(), get_anno_db(), is_empty()

Examples

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   cat("\nBasic usage\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF")));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally show the annotation source matched
   cat("\nOptionally show the annotation source matched\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Show comma-delimited genes
   cat("\nInput genes are comma-delimited\n");
   print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF")));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## Optionally include more than SYMBOL in the output
   cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n");
   print(freshenGenes(c("APOE", "HIST1H1C"),
      final=c("SYMBOL", "ALIAS", "GENENAME")));
}

if (suppressPackageStartupMessages(require(org.Hs.eg.db))) {
   ## More advanced, match affymetrix probesets
   if (suppressPackageStartupMessages(require(hgu133plus2.db))) {
      cat("\nAdvanced example including Affymetrix probesets.\n");
      print(freshenGenes(c("227047_x_at","APOE","HIST1H1D","NM_003166,U08032"),
         include_source=TRUE,
         try_list=c("hgu133plus2ENTREZID","REFSEQ2EG","SYMBOL2EG","ACCNUM2EG","ALIAS2EG"),
         final=c("SYMBOL","GENENAME")))
   }
}


jmw86069/genejam documentation built on Sept. 19, 2022, 1:53 p.m.