In jmw86069/genejam: Gene Jam

```r knitr::opts_chunk$set( collapse=TRUE, warning=FALSE, message=FALSE, comment="#>", fig.path="man/figures/README-" );

Genejam is intended to freshen gene annotations to the
current official standard. It is particularly useful when
comparing genes in two datasets, for example when those datasets
may not be using the same gene symbols to represent equivalent
genes.

## Installation

Install using the `remotes` package:

`remotes::install_github("jmw86069/genejam")`

Note: It is recommended not to use the `devtools` package to
install Github packages, mostly because the `devtools` package
has many more components than are required for installation.
Instead, `devtools` includes all components needed to develop
R packages, beyond the scope of installing one such R package.

## freshenGenes()

The simplest use is to supply a set of gene symbols:

```r
genejam::freshenGenes(c("APOE", "APOA", "HIST1H1C"))

For slightly more detail, you can edit the argument final to include things like "SYMBOL" (default), "GENENAME", "ALIAS", "ACCNUM", and more.

genejam::freshenGenes(c("APOE", "APOA", "HIST1H1C"),
   final=c("SYMBOL", 
      "GENENAME",
      "ALIAS"))

Convenience function freshenGenes2()

I frequently find myself wanting gene symbol, and the long gene name, so I created a simple function freshenGenes2() that uses default argument final=c("SYMBOL", "GENENAME"):

genejam::freshenGenes2(c("APOE", "APOA", "HIST1H1C"))

Convenience function freshenGenes3()

The other common use case is to include other gene aliases, with the function freshenGenes3():

genejam::freshenGenes3(c("APOE", "APOA", "HIST1H1C"))

Slightly more advanced

What if you already have Entrez gene ID, and want associated annotations? The function freshenGenes() runs two steps:

convert input to intermediate (which is the Entrez gene ID)
convert intermediate to the output defined by final, for example final=c("SYMBOL") would create a column "SYMBOL".

In this example, the Entrez gene ID values are in a column "ENTREZID", so we will use argument intermediate="ENTREZID". In this case, you already have "intermediate", so you invoke the function with a data.frame with values in a column named "intermediate", and set try_list=NULL.

df <- data.frame(ENTREZID=c("348", "4018", "3006", "100"));
genejam::freshenGenes2(x=df, intermediate="ENTREZID")

Similarly, you can provide input with a mixture of gene symbols and Entrez gene ID values. Shown below is mixed input.

idf <- data.frame(Gene=c("MINA", "", "GABRR3", "GABRR3", ""),
   ENTREZID=c("", "84864", "", "200959", "200959"))
idf

You only need to specify intermediate="ENTREZID" as before.

genejam::freshenGenes2(x=idf, intermediate="ENTREZID")

Notice the values in "ENTREZID" are updated based upon the first step resolution of "Gene" values to "ENTREZID". The "SYMBOL" and "GENENAME" columns are populated using values in "ENTREZID".

When is genejam useful?

The official gene nomenclature is updated multiple times per year, which means one Entrez gene ID may have a different official gene symbol before and after the update. When comparing data from two experiments, it is important to use the same gene nomenclature. Otherwise, there will be differences in results only because the names of some genes are different.

Most microarray platforms provide gene annotations, which are updated much less frequently than the official genes. For example, Affymetrix array "Clariom D Human" was last updated between 2016 and 2018 (this document was written in 2021.) In order to compare microarray results to those from literature, biological pathways, or other experiments, the gene nomenclature needs to be updated to the most current version.

Rationale for the workflow

In rare cases an official gene symbol is "moved" from one Entrez gene ID to another, usually when the original Entrez gene ID is deleted. In these cases, the most reasonable link between an experimental asay and the targeted gene is the gene symbol. An alternative is to use a sequence accession number used to design the assay.

As a result, a "best possible" gene annotation strategy is used.

When Entrez gene ID is available, use it.
When Entrez official gene symbol is available, use that to determine the Entrez gene ID.
When a sequence accession used for assay design is available, use that to determine the associated Entrez gene ID value or values.
When a gene symbol alias is available, use that to determine the Entrez gene ID value or values associated with this alias.

Sometimes an assay measures two genes. The steps in genejam are designed to maintain multi-gene associations where necessary. If one gene symbol alias is associated with two (or more) genes, then all those genes will become associated. Note this only happens if a higher priority association was not already found.

Ultimately the workflow is what I and others have been doing all along, to assemble the best available gene annotations for a given dataset, while also leaving behind the fewest possible un-annotated entries. When one source of annotation fails, try another on missing entries; and so on.

Optimizations

The steps used in genejam are designed for speed, to the extent that providing 100,000 rows should return results within a few seconds at most. Only unique values are queried, and only missing entries are updated. When multiple values are combined by a delimiter, a highly optimized method is used.

Lastly, these operations also use optimized mixed-alphanumeric sorting even in the context of a list, so things like "chr2" will appear before "chr10" in sort order. Incidentally, a sort step is necessary so you can compare whether two entries are associated with the same genes. If one is stored as "APOE,APOE4" and the other as "APOE4,APOE", this comparison fails.

All that to say, I use these functions a lot so I need them to be reliable and fast. It takes a few seconds just to load the associated SQLite gene annotation data, and the process used by freshenGenes() is usually substantially faster than that step alone.