This document describes plans for enhancements to the multienrichjam R package.
The genejam
R package provides freshenGenes()
to update
gene symbols to the most current nomenclature per gene.
Current recommendation is to run genejam::freshenGenes()
before running pathway enrichment. Install from Github:
devtools::install_github("jmw86069/genejam")
freshenGenes()
is most useful when combining
pathway enrichment results from different source data,
for example two different platforms, or from two different
studies.There are multiple strategies widely used that attempt to maintain
stable references to genes, such as using an authoritative
gene identifier like NCBI Entrez Gene ID, or EnsEMBL Gene ID.
In these cases, the official gene symbol can be obtained from
resources like HGNC (Human Genome Nomenclature Committee).
No solution is perfect, but the current rationale for using
gene symbol in multienrichjam
:
NCBI Entrez Gene ID is an integer numeric value, and sometimes multiple numbers refer to the same official gene symbol.
An integer value is risky as a primary identifier, because it
is not individually recognizable as a gene identifier, for
example the number 348
is not itself recognized as the
identifier for gene "APOE"
, and in fact the number 348
could mean anything by itself. However "APOE"
is more
recognizable as a gene symbol.
Sometimes multiple EnsEMBL Gene ID values refer to the same
official gene symbol. Note: EnsEMBL Gene ID values are individually
unique, using the format "ENSG00000000"
that is not
represented anywhere in the world except by EnsEMBL Genes.
However, few people in the world recognize "ENSG00000130203"
as the gene "APOE"
.
freshenGenes()
on each
source data before running pathway enrichment. The major
benefit is that the source data would therefore already be
consistent across comparisons.Some gene symbols will be split or duplicated after conversion.
A "split" occurs when one gene symbol refers to two actual genes,
for example input gene HSPA1
may ultimately refer to both
HSPA1A
and HSPA1B
.
HIST1H2BC
, and HIST1H2BD
both refer to current
official gene symbol H2BC5
.When a split or duplication occurs, it must be recorded.
When a gene split or duplication occurs, it will affect the number of genes returned in the enrichment result, and is therefore not the same number used in the enrichment calculation.
This effect is another reason to support converting gene symbols before running gene set enrichment.
However, the enrichment tool should only count each unique gene once, so the correct methodology is to accept the outcome of the enrichment tool within the caveats and limitations of using that tool. The purpose of modifying the gene symbol afterward is to facilitate comparison to the source data, and comparison across enrichment results. The underlying statistical enrichment test should remain unchanged.
Each enrichment table may have its own distinct original annotation. Therefore, each enrichment must record the gene update process, in order to trace back to the source gene symbol.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.