unify_gene_ids: Unify gene IDs from BioMart and AnnotationDbi lookups

View source: R/unify_gene_ids.R

unify_gene_idsR Documentation

Unify gene IDs from BioMart and AnnotationDbi lookups

Description

Takes a data frame with Ensembl gene IDs (and optionally gene symbols) and returns a deduplicated data frame with unified HGNC symbols, using a priority-based reconciliation of BioMart and AnnotationDbi results.

Usage

unify_gene_ids(
  genes,
  ensg_col = "ensembl_gene_id",
  symbol_col = NULL,
  host = "https://www.ensembl.org",
  biomart_fallback = c("https://uswest.ensembl.org", "https://asia.ensembl.org",
    "https://useast.ensembl.org"),
  keep_intermediates = FALSE,
  verbose = FALSE
)

Arguments

genes

A data frame with at minimum an Ensembl gene ID column or a character vector of Ensembl gene IDs.

ensg_col

Name of the column containing Ensembl gene IDs. Default: "ensembl_gene_id".

symbol_col

Name of the column containing gene symbols, or NULL if absent (ENSG-only mode). Default: NULL.

host

BioMart host URL. Default: "https://www.ensembl.org".

biomart_fallback

Character vector of fallback BioMart host URLs to try if the primary host fails. Set to NULL to disable fallback.

keep_intermediates

Logical; if TRUE, the intermediate lookup columns hgnc_symbol_2 and ensg_2 are retained in the output. Useful for debugging. Default: FALSE.

verbose

Logical; if TRUE, print progress and summary messages. Default: FALSE.

Details

Requires the Bioconductor packages org.Hs.eg.db and AnnotationDbi. These are not hard dependencies but will be checked at runtime with an informative error if missing.

Deduplication passes

The function performs two sequential deduplication passes via the internal dedup_gene_ids() function:

  1. Deduplicate by gene_name (if available) or ensembl_gene_id, resolving multiple ENSG IDs mapping to the same gene name.

  2. Deduplicate by hgnc_symbol, resolving cases where multiple gene names resolve to the same symbol.

Symbol assignment priority

The guiding principle is that AnnotationDbi confirmation outranks BioMart ordering. AnnotationDbi (org.Hs.eg.db) reflects a stable, versioned annotation database, while BioMart returns the current Ensembl release which may be ahead of annotations used to build real-world count matrices. Preferring AnnotationDbi-confirmed IDs therefore maximises compatibility with count matrices from sequencing providers whose pipelines are not frequently updated.

Within each group of rows sharing a gene_name, the following priority order is applied until a single row is selected:

  1. Pre-filter: If any row has hgnc_symbol_2 == gene_name (AnnotationDbi confirms the symbol), rows with hgnc_symbol_2 == NA are discarded first. This ensures that an AnnotationDbi-confirmed row is never passed over in favour of an unconfirmed one merely because the latter happens to have hgnc_symbol == gene_name from BioMart.

  2. BioMart symbol match: Rows where hgnc_symbol == gene_name (and is not a raw ENSG placeholder).

  3. AnnotationDbi symbol match: Rows where hgnc_symbol_2 == gene_name (and is not a raw ENSG placeholder).

  4. Both sources agree: Rows where hgnc_symbol == hgnc_symbol_2, indicating cross-source confirmation.

  5. BioMart ENSG confirmation: Rows whose ensembl_gene_id matches the first entry in the ensg_2 ///-separated list returned by AnnotationDbi. Note that ensg_2 list ordering is not considered a reliable preference signal on its own; this filter is intentionally placed after source-agreement filters.

  6. Drop ENSG placeholders: Rows where hgnc_symbol is still a raw ENSG ID are deprioritised.

  7. Last resort: When all disambiguation fields (hgnc_symbol_2, ensg_2) are NA across the entire group, the first row is taken. When rows are otherwise identical in all metadata, the newer ENSG ID (as returned by BioMart) is preferred as the more current annotation.

The second pass (by hgnc_symbol) applies the same principle but additionally prefers rows whose hgnc_symbol matches gene_name, and uses AnnotationDbi ENSG confirmation as a tiebreaker before falling back to x[1, ].

ENSG placeholder resolution

After the filter chain, any remaining rows where hgnc_symbol is a raw ENSG placeholder are fixed: if hgnc_symbol_2 is available it is used; otherwise gene_name is used (or ensembl_gene_id in ENSG-only mode). This allows rows with ENSG placeholders from BioMart to be correctly resolved in the second pass via their hgnc_symbol_2 value.

BioMart fallback

BioMart queries are attempted with graceful fallback through mirror hosts. If all hosts fail the function proceeds with AnnotationDbi results only. If both BioMart and AnnotationDbi fail entirely, the input is returned with ENSG IDs used as hgnc_symbol values.

Value

A deduplicated data frame with unified HGNC symbols in the hgnc_symbol column, plus hgnc_symbol_2 and ensg_2 columns from the AnnotationDbi lookups.

Examples

## Not run: 
# Example input: two-column data frame with Ensembl IDs and gene symbols,
# as typically produced by a sequencing provider's count matrix annotation
my_genes <- data.frame(
  gene_id   = c("ENSG00000000003", "ENSG00000000419", "ENSG00000000460",
                "ENSG00000012048", "ENSG00000075624", "ENSG00000111640",
                "ENSG00000141510", "ENSG00000146648"),
  gene_name = c("TSPAN6", "DPM1", "FIRRM",
                "BRCA1",  "ACTB",  "GAPDH",
                "TP53",   "EGFR"),
  stringsAsFactors = FALSE
)

# With gene symbols (full mode)
result <- unify_gene_ids(my_genes,
                         ensg_col   = "gene_id",
                         symbol_col = "gene_name",
                         verbose    = TRUE)

# ENSG-only (e.g. from count matrix row names, no symbol column available)
ensg_only <- data.frame(
  ensembl_gene_id  = my_genes$gene_id,
  stringsAsFactors = FALSE
)
result_ensg <- unify_gene_ids(ensg_only, verbose = TRUE)

## End(Not run)


convertid documentation built on April 1, 2026, 5:06 p.m.