Installation

To install the development version of inferorg, run

devtools::install_github("camlab-bioml/inferorg")

Inferring organism and gene ID format

Load the package:

library(inferorg)

Basic usage

To infer the organism and gene ID format, call the inferorg function. For example, for the human MHC-I genes:

human_mhc_genes <- c("HLA-A", "HLA-B", "HLA-C")
inferorg(human_mhc_genes)

This returns a list with four entries:

For a full list of supported organisms and formats, see supported ID formats and organisms.

Sometimes identifiers match multiple organisms, such as Tap1 and Tap2 both matching mouse and rat. In this case the confidence score is lower, but the recommended organism is first in the preferred order given by supported ID formats and organisms (i.e. human before mouse before fruit fly):

inferorg(c("Tap1", "Tap2"))

How confidence scores are computed

The confidence scores for each organism and format as follows:

  1. For every organism and format, the number of overlapping IDs in the provided list with those in the database (provided via the annotables package) is computed.
  2. The entry with the highest value is set as the inferred organism and format.
  3. To compute the organism confidence, the entry with the highest value is divided by the sum over all organisms. For $K$ organisms, if the overlap is the same with every organism then the confidence score is $\frac{1}{K}$. If the IDs only overlap with the most likely organism, the confidence score is $1$.
  4. The same procedure is repeated for the format.
  5. In both cases, the final score is multiplied by the proportion of input IDs that overlap with the best guess category. In other words, the confidence score is high only if (i) the IDs map uniquely to that organism/format and (ii) the majority of the provided IDs match.

Automatic conversion

To convert automatically between formats we use the autoconvert function. Under-the-hood, this calls the inferorg function to work out the gene ID format and organism, before converting to the desired format (for that organism).

For example, if we wish to convert the genes of the human MHC-I complex to ensembl IDs, we can call:

human_mhc_genes <- c("HLA-A", "HLA-B", "HLA-C")
autoconvert(human_mhc_genes, to = 'ensgene')

Similarly, we can convert the genes Tap1 and Tap2 in mouse to their entrez IDs:

mouse_tap_genes <- c("Tap1", "Tap2")
autoconvert(mouse_tap_genes, to = "entrez")

and we can convert these back to

autoconvert(c(21354, 21355), to = "symbol")

Note that if the gene ID format and/or organism can't be confidently inferred or any of the genes provided can't be confidently mapped, an NA is returned:

autoconvert(c("fake", "gene"))

But be careful! Sometimes they will match, which is especially an issue for very small input genesets:

autoconvert(c("made", "up", "gene"), to='ensgene')

Supported ID formats and organisms {#supported}

The following organisms are supported:

and the following gene ID formats:

Technical

print(sessionInfo())


camlab-bioml/inferorg documentation built on Feb. 14, 2022, 12:44 a.m.