freshenGenes | R Documentation |
Freshen gene annotations using Bioconductor annotation data
freshenGenes( x, ann_lib = c("", "org.Hs.eg.db"), try_list = c("SYMBOL2EG", "ACCNUM2EG", "ALIAS2EG"), final = c("SYMBOL"), split = "[ ]*[,/;]+[ ]*", sep = ",", handle_multiple = c("first_try", "first_hit", "all", "best_each"), empty_rule = c("empty", "original", "na"), include_source = FALSE, protect_inline_sep = TRUE, intermediate = "intermediate", ignore.case = FALSE, verbose = FALSE, ... )
x |
character vector or |
ann_lib |
character vector indicating the name or names of the Bioconductor annotation library to use when looking up gene nomenclature. |
try_list |
character vector indicating one or more names of
annotations to use for the input gene symbols in |
final |
character vector to use for the final conversion
step. When |
split |
character value used to separate delimited values in |
sep |
character value used to concatenate multiple entries in
the same field. The default |
handle_multiple |
character value indicating how to handle multiple
values: |
empty_rule |
character value indicating how to handle entries which
did not have a match, and are therefore empty: |
include_source |
logical indicating whether to include a column
that shows the colname and source matched. For example, if column
|
protect_inline_sep |
logical indicating whether to
protect inline characters in |
intermediate |
|
ignore.case |
|
verbose |
logical indicating whether to print verbose output. |
This function takes a vector or data.frame
of gene symbols,
and uses Bioconductor annotation methods to find the most current
official gene symbol.
The annotation process runs in two basic steps:
Convert the input gene to Entrez gene ID.
Convert Entrez gene ID to official gene symbol.
The first step uses an ordered list of annotations, with the assumption that the first match is usually the best, and most specific. By default, the order is:
"org.Hs.egSYMBOL2EG"
– almost always 1-to-1 match
"org.Hs.egACCNUM2EG"
– mostly a 1-to-1 match
"org.Hs.egALIAS2EG"
– sometimes a 1-to-1 match, sometimes 1-to-many
When multiple Entrez gene ID values are matched, they are all
retained. See argument handle_multiple
for custom options.
The second step converts the Entrez gene ID (or multiple IDs)
to the official gene symbol, by default using "org.Hs.egSYMBOL"
.
The second step may optionally include multiple annotation types, each of which will be returned. Some common examples:
"org.Hs.egSYMBOL"
– official Entrez gene symbol
"org.Hs.egALIAS"
– set of recognized aliases for an Entrez gene.
"org.Hs.egGENENAME"
– official Entrez long gene name
For each step, the annotation matched can be returned, as an audit trail to see which annotation was available for each input entry.
Note that if the input data already contains Entrez gene ID
values, you can define that colname with argument intermediate
.
For case-insensitive search, which is particularly useful in non-human
organisms because they often use mixed-case, use the argument
ignore.case=TRUE
. In our benchmark tests it appears to add roughly
0.1 seconds per annotation, regardless of the number of input entries.
This appears to be the time it takes to spool the list of annotation
keys stored in the SQLite database, and may therefore be dependent upon
the size of the annotation file.
data.frame
with one or more columns indicating the input
data, then a column "intermediate"
containing the Entrez gene ID
that was matched, then one column for each item in final
,
by default "SYMBOL"
.
Other genejam:
freshenGenes2()
,
freshenGenes3()
,
get_anno_db()
,
is_empty()
if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { cat("\nBasic usage\n"); print(freshenGenes(c("APOE", "CCN2", "CTGF"))); } if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## Optionally show the annotation source matched cat("\nOptionally show the annotation source matched\n"); print(freshenGenes(c("APOE", "CCN2", "CTGF"), include_source=TRUE)); } if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## Show comma-delimited genes cat("\nInput genes are comma-delimited\n"); print(freshenGenes(c("APOE", "CCN2", "CTGF", "CCN2,CTGF"))); } if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## Optionally include more than SYMBOL in the output cat("\nCustom output to include SYMBOL, ALIAS, GENENAME\n"); print(freshenGenes(c("APOE", "HIST1H1C"), final=c("SYMBOL", "ALIAS", "GENENAME"))); } if (suppressPackageStartupMessages(require(org.Hs.eg.db))) { ## More advanced, match affymetrix probesets if (suppressPackageStartupMessages(require(hgu133plus2.db))) { cat("\nAdvanced example including Affymetrix probesets.\n"); print(freshenGenes(c("227047_x_at","APOE","HIST1H1D","NM_003166,U08032"), include_source=TRUE, try_list=c("hgu133plus2ENTREZID","REFSEQ2EG","SYMBOL2EG","ACCNUM2EG","ALIAS2EG"), final=c("SYMBOL","GENENAME"))) } }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.