translate_ids: Translate gene, protein and small molecule identifiers

translate_idsR Documentation

Translate gene, protein and small molecule identifiers

Description

Translates a vector of identifiers, resulting a new vector, or a column of identifiers in a data frame by creating another column with the target identifiers.

Usage

translate_ids(
  d,
  ...,
  uploadlists = FALSE,
  ensembl = FALSE,
  hmdb = FALSE,
  ramp = FALSE,
  chalmers = FALSE,
  entity_type = NULL,
  keep_untranslated = TRUE,
  return_df = FALSE,
  organism = 9606,
  reviewed = TRUE,
  complexes = NULL,
  complexes_one_to_many = NULL
)

Arguments

d

Character vector or data frame.

...

At least two arguments, with or without names. The first of these arguments describes the source identifier, the rest of them describe the target identifier(s). The values of all these arguments must be valid identifier types as shown in Details. The names of the arguments are column names. In case of the first (source) ID the column must exist. For the rest of the IDs new columns will be created with the desired names. For ID types provided as arguments without names, the name of the ID type will be used for column name.

uploadlists

Force using the uploadlists service from UniProt. By default the plain query interface is used (implemented in uniprot_full_id_mapping_table in this package). If any of the provided ID types is only available in the uploadlists service, it will be automatically selected. The plain query interface is preferred because in the long term, with caching, it requires less download and data storage.

ensembl

Logical: use data from Ensembl BioMart instead of UniProt.

hmdb

Logical: use HMDB ID translation data.

ramp

Logical: use RaMP ID translation data.

chalmers

Logical: use ID translation data from Chalmers Sysbio GEM.

entity_type

Character: "gene" and "smol" are short symbols for proteins, genes and small molecules respectively. Several other synonyms are also accepted.

keep_untranslated

In case the output is a data frame, keep the records where the source identifier could not be translated. At these records the target identifier will be NA.

return_df

Return a data frame even if the input is a vector.

organism

Character or integer, name or NCBI Taxonomy ID of the organism (by default 9606 for human). Matters only if uploadlists is FALSE.

reviewed

Translate only reviewed (TRUE), only unreviewed (FALSE) or both (NULL) UniProt records. Matters only if uploadlists is FALSE.

complexes

Logical: translate complexes by their members. Only complexes where all members can be translated will be included in the result. If NULL, the option omnipathr.complex_translation will be used.

complexes_one_to_many

Logical: allow combinatorial expansion or use only the first target identifier for each member of each complex. If NULL, the option omnipathr.complex_translation_one_to_many will be used.

Details

This function, depending on the uploadlists parameter, uses either the uploadlists service of UniProt or plain UniProt queries to obtain identifier translation tables. The possible values for from and to are the identifier type abbreviations used in the UniProt API, please refer to the table here: https://www.uniprot.org/help/api_idmapping. In addition, simple synonyms are available which realize a uniform API for the uploadlists and UniProt query based backends. These are the followings:

OmnipathR Uploadlists UniProt query Ensembl BioMart
uniprot ACC id uniprotswissprot
uniprot_entry ID entry name
trembl reviewed = FALSE reviewed = FALSE uniprotsptrembl
genesymbol GENENAME genes(PREFERRED) external_gene_name
genesymbol_syn genes(ALTERNATIVE) external_synonym
hgnc HGNC_ID database(HGNC) hgnc_symbol
entrez P_ENTREZGENEID database(GeneID)
ensembl ENSEMBL_ID ensembl_gene_id
ensg ENSEMBL_ID ensembl_gene_id
enst ENSEMBL_TRS_ID database(Ensembl) ensembl_transcript_id
ensp ENSEMBL_PRO_ID ensembl_peptide_id
ensgg ENSEMBLGENOME_ID
ensgt ENSEMBLGENOME_TRS_ID
ensgp ENSEMBLGENOME_PRO_ID
protein_name protein names
pir PIR database(PIR)
ccds database(CCDS)
refseqp P_REFSEQ_AC database(refseq)
ipro interpro
ipro_desc interpro_description
ipro_sdesc interpro_short_description
wikigene wikigene_name
rnacentral rnacentral
gene_desc description
wormbase database(WormBase)
flybase database(FlyBase)
xenbase database(Xenbase)
zfin database(ZFIN)
pbd PBD_ID database(PDB) pbd

For a complete list of ID types and their synonyms, including metabolite and chemical ID types which are not shown here, see id_types.

The mapping between identifiers can be ambiguous. In this case one row in the original data frame yields multiple rows or elements in the returned data frame or vector(s).

Value

  • Data frame: if the input is a data frame or the input is a vector and return_df is TRUE.

  • Vector: if the input is a vector, there is only one target ID type and return_df is FALSE.

  • List of vectors: if the input is a vector, there are more than one target ID types and return_df is FALSE. The names of the list will be ID types (as they were column names, see the description of the ... argument), and the list will also include the source IDs.

See Also

  • translate_ids_multi

  • uniprot_id_mapping_table

  • uniprot_full_id_mapping_table

  • ensembl_id_mapping_table

  • hmdb_id_mapping_table

  • id_types

  • ensembl_id_type

  • uniprot_id_type

  • uploadlists_id_type

  • hmdb_id_type

  • chalmers_gem_id_type

Examples

d <- data.frame(uniprot_id = c('P00533', 'Q9ULV1', 'P43897', 'Q9Y2P5'))
d <- translate_ids(d, uniprot_id = uniprot, genesymbol)
d
#   uniprot_id genesymbol
# 1     P00533       EGFR
# 2     Q9ULV1       FZD4
# 3     P43897       TSFM
# 4     Q9Y2P5    SLC27A5


saezlab/OmnipathR documentation built on Oct. 16, 2024, 11:49 a.m.