This vignette describes how the
r Biocpkg("ensembldb") package can be used to
retrieve protein annotations and to map peptides within protein sequences to the
EnsDb databases and packages created by the
package version > 1.99.0 can provide protein annotations, but don't necessarily
have too. Only
EnsDb packages created using the Ensembl Perl API contain
protein annotations, while databases created from GTF or GFF files or from
GRanges objects or using the
r Biocpkg("AnnotationHub") don't.
Below we load the
EnsDb object with for all human genes defined in the Ensembl
database version 86 and use the
hasProteinData method to evaluate whether
protein annotations are available.
library(ensembldb) library(EnsDb.Hsapiens.v86) ## Make a shortcut to the object edb <- EnsDb.Hsapiens.v86 hasProteinData(edb)
ensembldb provides the
proteins method to fetch protein annotations from an
EnsDb database (along eventual transcript or gene annotations) and, depending
on the value of the
return.type parameter, return the results as a
AAStringSet. Below we use the function to
retrieve all proteins encoded by the gene ZBTB16. We use a
select entries for that gene. In the simple example below only protein
annotations were retrieved (i.e. the Ensembl protein ID, the amino acid
sequence, the ID of the transcript encoding the protein and the gene name; the
latter two stored in the
EnsDbs provide in
addition also the mapping between Ensembl protein IDs and Uniprot IDs and all
protein domains within the protein sequence. For more information on available
filters, methods and annotation columns see the
prts <- proteins(edb, filter = GenenameFilter("ZBTB16"), return.type = "AAStringSet") prts ## Get access to the additional annotation data mcols(prts)
Proteins method is also implemented for
EnsDb objects that
enables to load a
Proteins object from an
EnsDb database and optionally load
all protein domains as peptide ranges in the
pranges slot (using the
loadProteinDomains parameter which is by default
TRUE). Below we use also a
filter expression in form of a
formula instead of an
library(Pbase) zbtb16 <- Proteins(edb, filter = ~ genename == "ZBTB16")
The proteins' sequences are stored in the
aa slot and the protein domains in
## Get the protein sequences. aa(zbtb16) ## Get the protein domains stored as peptide features. pranges(zbtb16)
Proteins method supports to retrieve additional annotation columns
from the database. Below we repeat the call but fetch in addition also the
Uniprot IDs for the proteins. With that a property of the provided annotation
becomes apparent: while the mapping between Ensembl transcript ID and Ensembl
protein ID is always 1:1, each Ensembl protein can be annotated to more than one
Uniprot ID and each Uniprot ID can be assigned one or more Ensembl protein IDs.
## Loading in addition Uniprot ID annotations zbtb16_2 <- Proteins(edb, columns = c("uniprot_id", "uniprot_db", "uniprot_mapping_type"), filter = GenenameFilter("ZBTB16")) acols(zbtb16_2)
As we can see two transcript IDs (ENST00000335953 and ENST00000392996) and hence
Ensembl protein IDs are each annotated to two Uniprot IDs. To avoid some of
these multi-mappings, we can use an
UniprotMappingTypeFilter to retrieve only
presumably higher quality annotations by selecting only Uniprot IDs mapped to
Ensembl protein IDs using the
"DIRECT" mapping type. We get hence again unique
Ensembl protein IDs respectively transcript IDs.
acols(Proteins(edb, columns = c("uniprot_id", "uniprot_db"), filter = AnnotationFilterList( GenenameFilter("ZBTB16"), UniprotMappingTypeFilter("DIRECT"))))
Note that without specifying a filter we can also retrieve all proteins from the database.
EnsDb databases provide both, protein annotations and gene/transcript
annotations including their genomic coordinates. Peptide features along amino
acid sequences within a
Proteins objects can hence be mapped to the genome
directly using the
mapToGenome method by providing an
EnsDb instance with
genome parameter. The
Proteins object has to provide IDs that allow to
identify the encoding transcripts. Supported IDs are Ensembl protein ID (
= "protein_id"), transcript ID (
idType = "tx_id") or Uniprot ID (
"uniprot_id") which can be provided either as the
names of the
object, or in one of the
acols metadata columns.
In the example below we use the
Proteins object with all proteins of the gene
ZBTB16 that we fetched from the database in the previous section. To identify
the encoding transcripts, we use the Ensembl protein IDs that are provided with
the names of the object.
## We use the Ensembl protein IDs to identify the transcripts in the database names(zbtb16) ## Map all peptide features within the object to the genome. zbtb16_map <- mapToGenome(zbtb16, edb, idType = "protein_id", id = "name") zbtb16_map
The result is a
GRangesList one element for each protein with
the genomic positions of the mapped peptide features (in our example protein
domains within the protein sequence). Below we plot the genomic alignments for
the protein domains of the first protein.
library(Gviz) ## Define a genome axis track gat <- GenomeAxisTrack() ## Get the transcript ID of the first transcript: txid <- acols(zbtb16)$tx_id ## Get a GRanges for the first transcript trt <- getGeneRegionTrackForGviz(edb, filter = TxIdFilter(txid)) ## Add the protein domain name as column "id" to the mcols so it ## can be used in the plot to identify the features. map_1 <- zbtb16_map[] map_1$id <- names(map_1) ## Plotting the transcript and the mapped protein domains. plotTracks(list(gat, GeneRegionTrack(trt, name = "tx"), AnnotationTrack(map_1, groupAnnotation = "id", just.group = "above", name = "Protein domains")), transcriptAnnotation = "transcript")
mapToGenome,Proteins,EnsDb method used above performs all of the
mapping and supports mapping of multiple proteins, we could also fetch the
genomic positions of the encoding transcripts' CDS using the
cdsBy method and
provide the resulting
GRangesList to the
In addition to Ensembl transcript or protein IDs, it is also possible to provide
Uniprot IDs. While the mapping between Ensembl protein IDs and transcript IDs is
1:1, multiple Ensembl protein IDs can be assigned to a single Uniprot ID hence
the mappoing between Uniprot IDs and Ensembl transcript IDs is also 1:n. In
cases in which a Uniprot ID is annotated to multiple transcript IDs, the
mapToGenome method will select automatically the best suited transcript by
comparing the length of the protein sequence with the length of the transcripts'
coding sequences. In the example below we use the test
Proteins data from the
Pbase package and perform the mapping of the peptide features using the
provided Uniprot IDs of the proteins.
data(p) ## We use the Uniprot IDs provided as names for the mapping names(p) res <- mapToGenome(p, edb, idType = "uniprot_id")
Apart from one of the Uniprot IDs, P04075-2, for which no corresponding transcript could be identified, all peptide features could be mapped successfully. As detailed above, the method did select for each Uniprot ID the transcript with the best matching CDS. To illustrate this, we select below all transcripts that are annotated to the first Uniprot ID.
txs <- transcripts(edb, filter = UniprotFilter(names(p))) names(txs)
While there are 6 transcripts annotated to the Uniprot ID from only one the coding sequence matches the length of the protein:
## Fetch the CDS for each transcript cdss <- cdsBy(edb, filter = TxIdFilter(names(txs))) ## The protein sequence length: width(aa(p)) ## The difference between protein sequence and CDS length width(aa(p)) - sum(width(cdss)) / 3
Thus, the lenght of the coding region of the last transcript matches the length of the protein's amino acid sequence (the stop codon, being part of the CDS, is not coding for an amino acid).
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.