proteinToTranscript: Map protein-relative coordinates to positions within the...
In jotsetung/ensembldb: Utilities to create and use Ensembl-based annotation databases

proteinToTranscript

R Documentation

Map protein-relative coordinates to positions within the transcript

Description

proteinToTranscript maps protein-relative coordinates to positions within the encoding transcript. Note that the returned positions are relative to the complete transcript length, which includes the 5' UTR.

The regions within the protein sequence need to be provided as a named IRanges object with the names being protein identifiers and the start and end coordinates (within these proteins) defined by the IRanges object. As an alternative to the IRanges' names, protein identifiers can also be provided through a metadata column (see details below).

Similar to the proteinToGenome() function, proteinToTranscript compares for each protein whether the length of its sequence matches the length of the encoding CDS and throws a warning if that is not the case. Incomplete 3' or 5' CDS of the encoding transcript are the most common reasons for a mismatch between protein and transcript sequences.

Usage

proteinToTranscript(x, db, ...)

## S4 method for signature 'CompressedGRangesList'
proteinToTranscript(x, db, id = "name", idType = "protein_id", fiveUTR)

Arguments

`x`	`IRanges` with the coordinates within the protein(s). The object has also to provide some means to identify the protein (see details).
`db`	For the method for `EnsDb` objects: An `EnsDb` object to be used to retrieve genomic coordinates of encoding transcripts. For the method for `CompressedGRangesList` objects: A `CompressedGRangesList` object generated by `cdsBy()` where by = 'tx' and columns = c('tx_id' ,'protein_id','uniprot_id','protein_sequence').
`...`	Further arguments to be passed on.
`id`	`character(1)` specifying where the protein identifier can be found. Has to be either `"name"` or one of `colnames(mcols(x))`.
`idType`	`character(1)` defining what type of IDs are provided. Has to be one of `"protein_id"` (default), `"uniprot_id"` or `"tx_id"`.
`fiveUTR`	A `CompressedGRangesList` object generated by `fiveUTRsByTranscript()`.

Details

Protein identifiers (supported are Ensembl protein IDs or Uniprot IDs) can be passed to the function as names of the x IRanges object, or alternatively in any one of the metadata columns (mcols) of x.

Value

IRangesList, each element being the mapping results for one of the input ranges in x. Each element is a IRanges object with the positions within the encoding transcript (relative to the start of the transcript, which includes the 5' UTR). The transcript ID is reported as the name of each IRanges. The IRanges can be of length > 1 if the provided protein identifier is annotated to more than one Ensembl protein ID (which can be the case if Uniprot IDs are provided). If the coordinates can not be mapped (because the protein identifier is unknown to the database) an IRanges with negative coordinates is returned.

The following metadata columns are available in each IRanges in the result:

"protein_id": the ID of the Ensembl protein for which the within-protein coordinates were mapped to the genome.
"tx_id": the Ensembl transcript ID of the encoding transcript.
"cds_ok": contains TRUE if the length of the CDS matches the length of the amino acid sequence and FALSE otherwise.
"protein_start": the within-protein sequence start coordinate of the mapping.
"protein_end": the within-protein sequence end coordinate of the mapping.

Note

While mapping of Ensembl protein IDs to Ensembl transcript IDs is 1:1, a single Uniprot identifier can be annotated to several Ensembl protein IDs. proteinToTranscript calculates in such cases transcript-relative coordinates for each annotated Ensembl protein.

Mapping using Uniprot identifiers needs also additional internal checks that can have a significant impact on the performance of the function. It is thus strongly suggested to first identify the Ensembl protein identifiers for the list of input Uniprot identifiers (e.g. using the proteins() function and use these as input for the mapping function.

Author(s)

Johannes Rainer

Examples


library(EnsDb.Hsapiens.v86)
## Restrict all further queries to chromosome x to speed up the examples
edbx <- filter(EnsDb.Hsapiens.v86, filter = ~ seq_name == "X")

## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToTranscript(syp, edbx)
res
## Positions 4 to 17 within the protein span are encoded by the region
## from nt 23 to 64.

## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids

res <- proteinToTranscript(prngs, edbx, idType = "uniprot_id")

## The result is a list, same length as the input object
length(res)
names(res)

## No protein/encoding transcript could be found for the last one
res[[3]]

## The first protein could be mapped to multiple Ensembl proteins. The
## region within all transcripts encoding the region in the protein are
## returned
res[[1]]

## The result for the region within the second protein
res[[2]]

## Meanwhile, this function can be called in parallel processes if you preload
## the CDS data with desired data columns and fiveUTR data

cds <- cdsBy(edbx,columns = c(listColumns(edbx,'tx'),'protein_id','uniprot_id','protein_sequence'))
# cds <- cdsBy(edbx,columns = c(listColumns(edbx,'tx'),'protein_id','protein_sequence'))
# cds <- cdsBy(edbx,columns = c('tx_id','protein_id','protein_sequence'))

fiveUTR <- fiveUTRsByTranscript(edbx)

## Define an IRange with protein-relative coordinates within a protein for
## the gene SYP
syp <- IRanges(start = 4, end = 17)
names(syp) <- "ENSP00000418169"
res <- proteinToTranscript(syp, cds, fiveUTR = fiveUTR)
res
## Positions 4 to 17 within the protein span are encoded by the region
## from nt 23 to 64.

## Perform the mapping for multiple proteins identified by their Uniprot
## IDs.
ids <- c("O15266", "Q9HBJ8", "unexistant")
prngs <- IRanges(start = c(13, 43, 100), end = c(21, 80, 100))
names(prngs) <- ids

res <- proteinToTranscript(prngs, cds, idType = "uniprot_id", fiveUTR = fiveUTR)

jotsetung/ensembldb documentation built on Aug. 21, 2024, 11:23 a.m.

jotsetung/ensembldb index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jotsetung/ensembldb
Utilities to create and use Ensembl-based annotation databases

proteinToTranscript: Map protein-relative coordinates to positions within the...
In jotsetung/ensembldb: Utilities to create and use Ensembl-based annotation databases

Map protein-relative coordinates to positions within the transcript

Description

Usage

Arguments

Details

Value

Note

Author(s)

See Also

Examples

Related to proteinToTranscript in jotsetung/ensembldb...

R Package Documentation

Browse R Packages

We want your feedback!

jotsetung/ensembldb Utilities to create and use Ensembl-based annotation databases

proteinToTranscript: Map protein-relative coordinates to positions within the... In jotsetung/ensembldb: Utilities to create and use Ensembl-based annotation databases

Map protein-relative coordinates to positions within the transcript

Description

Usage

Arguments

Details

Value

Note

Author(s)

See Also

Examples

Related to proteinToTranscript in jotsetung/ensembldb...

R Package Documentation

Browse R Packages

We want your feedback!

jotsetung/ensembldb
Utilities to create and use Ensembl-based annotation databases

proteinToTranscript: Map protein-relative coordinates to positions within the...
In jotsetung/ensembldb: Utilities to create and use Ensembl-based annotation databases