View source: R/unify_gene_ids.R
| unify_gene_ids | R Documentation |
Takes a data frame with Ensembl gene IDs (and optionally gene symbols) and returns a deduplicated data frame with unified HGNC symbols, using a priority-based reconciliation of BioMart and AnnotationDbi results.
unify_gene_ids(
genes,
ensg_col = "ensembl_gene_id",
symbol_col = NULL,
host = "https://www.ensembl.org",
biomart_fallback = c("https://uswest.ensembl.org", "https://asia.ensembl.org",
"https://useast.ensembl.org"),
keep_intermediates = FALSE,
verbose = FALSE
)
genes |
A data frame with at minimum an Ensembl gene ID column or a character vector of Ensembl gene IDs. |
ensg_col |
Name of the column containing Ensembl gene IDs.
Default: |
symbol_col |
Name of the column containing gene symbols, or |
host |
BioMart host URL. Default: |
biomart_fallback |
Character vector of fallback BioMart host URLs to try
if the primary host fails. Set to |
keep_intermediates |
Logical; if |
verbose |
Logical; if |
Requires the Bioconductor packages org.Hs.eg.db and AnnotationDbi. These are not hard dependencies but will be checked at runtime with an informative error if missing.
Deduplication passes
The function performs two sequential deduplication passes via the internal
dedup_gene_ids() function:
Deduplicate by gene_name (if available) or ensembl_gene_id,
resolving multiple ENSG IDs mapping to the same gene name.
Deduplicate by hgnc_symbol, resolving cases where multiple
gene names resolve to the same symbol.
Symbol assignment priority
The guiding principle is that AnnotationDbi confirmation outranks BioMart ordering. AnnotationDbi (org.Hs.eg.db) reflects a stable, versioned annotation database, while BioMart returns the current Ensembl release which may be ahead of annotations used to build real-world count matrices. Preferring AnnotationDbi-confirmed IDs therefore maximises compatibility with count matrices from sequencing providers whose pipelines are not frequently updated.
Within each group of rows sharing a gene_name, the following priority
order is applied until a single row is selected:
Pre-filter: If any row has hgnc_symbol_2 == gene_name
(AnnotationDbi confirms the symbol), rows with hgnc_symbol_2 == NA
are discarded first. This ensures that an AnnotationDbi-confirmed row is
never passed over in favour of an unconfirmed one merely because the
latter happens to have hgnc_symbol == gene_name from BioMart.
BioMart symbol match: Rows where hgnc_symbol == gene_name
(and is not a raw ENSG placeholder).
AnnotationDbi symbol match: Rows where
hgnc_symbol_2 == gene_name (and is not a raw ENSG placeholder).
Both sources agree: Rows where
hgnc_symbol == hgnc_symbol_2, indicating cross-source confirmation.
BioMart ENSG confirmation: Rows whose ensembl_gene_id
matches the first entry in the ensg_2 ///-separated list
returned by AnnotationDbi. Note that ensg_2 list ordering is not
considered a reliable preference signal on its own; this filter is
intentionally placed after source-agreement filters.
Drop ENSG placeholders: Rows where hgnc_symbol is
still a raw ENSG ID are deprioritised.
Last resort: When all disambiguation fields
(hgnc_symbol_2, ensg_2) are NA across the entire
group, the first row is taken. When rows are otherwise identical in all
metadata, the newer ENSG ID (as returned by BioMart) is preferred as the
more current annotation.
The second pass (by hgnc_symbol) applies the same principle but
additionally prefers rows whose hgnc_symbol matches gene_name,
and uses AnnotationDbi ENSG confirmation as a tiebreaker before falling back
to x[1, ].
ENSG placeholder resolution
After the filter chain, any remaining rows where hgnc_symbol is a raw
ENSG placeholder are fixed: if hgnc_symbol_2 is available it is used;
otherwise gene_name is used (or ensembl_gene_id in ENSG-only
mode). This allows rows with ENSG placeholders from BioMart to be correctly
resolved in the second pass via their hgnc_symbol_2 value.
BioMart fallback
BioMart queries are attempted with graceful fallback through mirror hosts.
If all hosts fail the function proceeds with AnnotationDbi results only.
If both BioMart and AnnotationDbi fail entirely, the input is returned with
ENSG IDs used as hgnc_symbol values.
A deduplicated data frame with unified HGNC symbols in the
hgnc_symbol column, plus hgnc_symbol_2 and ensg_2
columns from the AnnotationDbi lookups.
## Not run:
# Example input: two-column data frame with Ensembl IDs and gene symbols,
# as typically produced by a sequencing provider's count matrix annotation
my_genes <- data.frame(
gene_id = c("ENSG00000000003", "ENSG00000000419", "ENSG00000000460",
"ENSG00000012048", "ENSG00000075624", "ENSG00000111640",
"ENSG00000141510", "ENSG00000146648"),
gene_name = c("TSPAN6", "DPM1", "FIRRM",
"BRCA1", "ACTB", "GAPDH",
"TP53", "EGFR"),
stringsAsFactors = FALSE
)
# With gene symbols (full mode)
result <- unify_gene_ids(my_genes,
ensg_col = "gene_id",
symbol_col = "gene_name",
verbose = TRUE)
# ENSG-only (e.g. from count matrix row names, no symbol column available)
ensg_only <- data.frame(
ensembl_gene_id = my_genes$gene_id,
stringsAsFactors = FALSE
)
result_ensg <- unify_gene_ids(ensg_only, verbose = TRUE)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.