```r knitr::opts_chunk$set( collapse=TRUE, warning=FALSE, message=FALSE, comment="#>", fig.path="man/figures/README-" );
Genejam is intended to freshen gene annotations to the current official standard. It is particularly useful when comparing genes in two datasets, for example when those datasets may not be using the same gene symbols to represent equivalent genes. ## Installation Install using the `remotes` package: `remotes::install_github("jmw86069/genejam")` Note: It is recommended not to use the `devtools` package to install Github packages, mostly because the `devtools` package has many more components than are required for installation. Instead, `devtools` includes all components needed to develop R packages, beyond the scope of installing one such R package. ## freshenGenes() The simplest use is to supply a set of gene symbols: ```r genejam::freshenGenes(c("APOE", "APOA", "HIST1H1C"))
For slightly more detail, you can edit the argument final
to include things like "SYMBOL"
(default), "GENENAME"
,
"ALIAS"
, "ACCNUM"
, and more.
genejam::freshenGenes(c("APOE", "APOA", "HIST1H1C"), final=c("SYMBOL", "GENENAME", "ALIAS"))
I frequently find myself wanting gene symbol, and the
long gene name, so I created a simple function freshenGenes2()
that uses default argument final=c("SYMBOL", "GENENAME")
:
genejam::freshenGenes2(c("APOE", "APOA", "HIST1H1C"))
The other common use case is to include other gene aliases,
with the function freshenGenes3()
:
genejam::freshenGenes3(c("APOE", "APOA", "HIST1H1C"))
What if you already have Entrez gene ID, and want associated
annotations? The function freshenGenes()
runs two steps:
intermediate
(which is the Entrez gene ID)intermediate
to the output defined by final
, for
example final=c("SYMBOL")
would create a column "SYMBOL"
.In this example, the Entrez gene ID values are in a column
"ENTREZID"
, so we will use argument intermediate="ENTREZID"
.
In this case, you already have "intermediate"
, so you invoke
the function with a data.frame
with values in a column named
"intermediate"
, and set try_list=NULL
.
df <- data.frame(ENTREZID=c("348", "4018", "3006", "100")); genejam::freshenGenes2(x=df, intermediate="ENTREZID")
Similarly, you can provide input with a mixture of gene symbols and Entrez gene ID values. Shown below is mixed input.
idf <- data.frame(Gene=c("MINA", "", "GABRR3", "GABRR3", ""), ENTREZID=c("", "84864", "", "200959", "200959")) idf
You only need to specify intermediate="ENTREZID"
as before.
genejam::freshenGenes2(x=idf, intermediate="ENTREZID")
Notice the values in "ENTREZID"
are updated based upon the
first step resolution of "Gene"
values to "ENTREZID"
.
The "SYMBOL"
and "GENENAME"
columns are populated
using values in "ENTREZID"
.
The official gene nomenclature is updated multiple times per year, which means one Entrez gene ID may have a different official gene symbol before and after the update. When comparing data from two experiments, it is important to use the same gene nomenclature. Otherwise, there will be differences in results only because the names of some genes are different.
Most microarray platforms provide gene annotations, which are updated much less frequently than the official genes. For example, Affymetrix array "Clariom D Human" was last updated between 2016 and 2018 (this document was written in 2021.) In order to compare microarray results to those from literature, biological pathways, or other experiments, the gene nomenclature needs to be updated to the most current version.
In rare cases an official gene symbol is "moved" from one Entrez gene ID to another, usually when the original Entrez gene ID is deleted. In these cases, the most reasonable link between an experimental asay and the targeted gene is the gene symbol. An alternative is to use a sequence accession number used to design the assay.
As a result, a "best possible" gene annotation strategy is used.
Sometimes an assay measures two genes. The steps in genejam
are designed to maintain multi-gene associations where
necessary. If one gene symbol alias is associated with
two (or more) genes, then all those genes will become
associated. Note this only happens if a higher priority
association was not already found.
Ultimately the workflow is what I and others have been doing all along, to assemble the best available gene annotations for a given dataset, while also leaving behind the fewest possible un-annotated entries. When one source of annotation fails, try another on missing entries; and so on.
The steps used in genejam
are designed for speed, to
the extent that providing 100,000 rows should return
results within a few seconds at most. Only unique
values are queried, and only missing entries are
updated. When multiple values are combined by a
delimiter, a highly optimized method is used.
Lastly, these operations also use optimized
mixed-alphanumeric sorting even in the context
of a list
, so things like
"chr2"
will appear before "chr10"
in sort
order. Incidentally, a sort step is necessary
so you can compare whether two entries are
associated with the same genes. If one is stored
as "APOE,APOE4"
and the other as "APOE4,APOE"
,
this comparison fails.
All that to say, I use these functions a lot so
I need them to be reliable and fast. It takes a
few seconds just to load the associated SQLite
gene annotation data, and the process used by
freshenGenes()
is usually substantially faster
than that step alone.
A full online function reference is available via the pkgdown documentation:
Full genejam command reference: https://jmw86069.github.io/genejam
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.