proteinToGenome: Map protein-relative coordinates to genomic coordinates

proteinToGenomeR Documentation

Map protein-relative coordinates to genomic coordinates


proteinToGenome is a generic function for mapping ranges of protein-relative positions to the genome.

NOTE: This man page is for the proteinToGenome S4 generic function and methods defined in the GenomicFeatures package, which are (loosely) modeled on the proteinToGenome function from the ensembldb package. See ?ensembldb::proteinToGenome for the latter.


## S4 generic function:
proteinToGenome(x, db, ...)  # dispatch is on 2nd argument 'db'

## S4 method for signature 'ANY'
proteinToGenome(x, db)

## S4 method for signature 'GRangesList'
proteinToGenome(x, db)



A named IRanges object (or derivative) containing ranges of protein-relative positions (protein-relative positions are positions relative to a protein sequence).

The names on x must be transcript names present in db. More precisely, for the default proteinToGenome() method, names(x) must be a subset of:

  mcols(transcripts(db, columns="tx_name"))$tx_name

And for the method for GRangesList objects, names(x) must be a subset of:


For the default proteinToGenome() method: A TxDb object or any object that supports transcripts() and cdsBy() (e.g. an EnsDb object from the ensembldb package).

For the method for GRangesList objects: A named GRangesList object (or derivative) where each list element is a GRanges object representing a CDS (the ranges in the GRanges object must represent the CDS parts ordered by ascending exon rank).


Further arguments to be passed to specific methods.


The proteinToGenome() method for GRangesList objects is the workhorse behind the default method. Note that the latter is a thin wrapper around the former, which simply does the following:

  1. Use cdsBy() to extract the CDS from db. The CDS are returned in a GRangesList object with transcript names on it.

  2. Call proteinToGenome() on x and the GRangesList object returned by cdsBy().


A named GRangesList object parallel to x (the transcript names on x are propagated). The i-th list element in the returned object is the result of mapping the range of protein-relative positions x[i] to the genome.

Note that a given range in x can only be mapped to the genome if the name on it is the name of a coding transcript. If it's not (i.e. if it's the name of a non-coding transcript), then an empty GRanges object is placed in the returned object to indicate the impossible mapping, and a warning is issued.

Otherwise, if a given range in x can be mapped to the genome, then the result of the mapping is represented by a non-empty GRanges object. Note that this object represents the original CDS associated to x, trimmed on its 5' end or 3' end, or on both. Furthermore, this object will have the same metadata columns as the GRanges object representing the original CDS, plus the 2 following ones:

  • protein_start: The protein-relative start of the mapping.

  • protein_end: The protein-relative end of the mapping.


Unlike ensembldb::proteinToGenome() which can work either with Ensembl protein IDs or Ensembl transcript IDs on x, the default proteinToGenome() method described above only accepts transcript names on x.

This means that, if the user is in possession of protein IDs, they must first replace them with the corresponding transcript IDs (referred to as transcript names in the context of TxDb objects). How to do this exactly depends on the origin of those IDs (UCSC, Ensembl, GTF/GFF3 file, FlyBase, etc...)


H. Pag├Ęs, using ensembldb::proteinToGenome() for inspiration and design.

See Also

  • The proteinToGenome function in the ensembldb package, which the proteinToGenome() generic and methods documented in this man page are (loosely) modeled on.

  • TxDb objects.

  • EnsDb objects (TxDb-like objects) in the ensembldb package.

  • transcripts for extracting transcripts from a TxDb-like object.

  • cdsBy for extracting CDS from a TxDb-like object.

  • IRanges objects in the IRanges package.

  • GRanges and GRangesList objects in the GenomicRanges package.


## ---------------------------------------------------------------------
## ---------------------------------------------------------------------
CDS1 <- GRanges(c("chrX:11-60:+", "chrX:101-125:+"))
CDS2 <- GRanges(c("chrY:201-230:-", "chrY:101-125:-", "chrY:11-60:-"))
cds_by_tx <- GRangesList(TX1=CDS1, TX2=CDS2)

x1 <- IRanges(start=8, end=20, names="TX1")
proteinToGenome(x1, cds_by_tx)

x2 <- IRanges(start=c(1, 18), end=c(25, 20), names=c("TX1", "TX1"))
proteinToGenome(x2, cds_by_tx)

x3 <- IRanges(start=8, end=15, names="TX2")
proteinToGenome(x3, cds_by_tx)

x4 <- c(x3, x2)
proteinToGenome(x4, cds_by_tx)

## ---------------------------------------------------------------------
## ---------------------------------------------------------------------
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene

## The first transcript (FBtr0309810) is non-coding:
x <- IRanges(c(FBtr0309810="11-55", FBtr0306539="90-300"))
res <- proteinToGenome(x, txdb)

Bioconductor/GenomicFeatures documentation built on Nov. 4, 2022, 11:03 a.m.