Mapping experimental MS data to genomic coordinates

BiocStyle::markdown()

Package: Pbase
Author: Laurent Gatto
Last compiled: r date()
Last modified: r file.info("mapping.Rmd")$mtime

library("Pbase")

Introduction

The aim of this vignette is to document the mapping of proteins and the tandem mass spectrometry-derived peptides to genomic locations.

Pbase:::mapplot()

Protein and genome data

We will use a small Proteins object from the r Biocpkg("Pbase") package to illustrate how to retrieve genome coordinates and map a peptides back to genomic coordinates. See the Pbase-data vignette for an introduction to Proteins data. The main information needed in this vignette consists of protein UniProt identifiers and a set of peptides positions along the protein sequence.

library("Pbase")
data(p)
p
seqnames(p)
pcols(p)[1, c("start", "end")]

We will also require an identifier relating the protein feature of interest to the genome. Below, we use r Biocpkg("biomaRt") to query the matching Ensembl transcript identifier. We start by create a Mart object that stores the connection details to the latest human Ensembl biomart server.

library("biomaRt")
ens <- useMart("ensembl", "hsapiens_gene_ensembl")
ens

bm <- select(ens, keys = seqnames(p),
             keytype = "uniprot_swissprot",
             columns = c(
                 "uniprot_swissprot",
                 "hgnc_symbol",
                 "ensembl_transcript_id"))

bm

As can be seen, there can be multiple transcripts for one protein accession. We have defined the transcripts of interest for our proteins in p; they are stored as protein elements metadata:

acols(p)$ENST

Genomic transcript coordinates

The etrid2grl function takes our transcript identifiers and will query the Ensembl biomart server (note the ens argument) and return a GRangesList object. For each of Ensembl transcript identifiers provided as input, we have the genomic coordinates of that transcript's exons as well as additional information such as the type of exons (protein coding or untranslated region).

grl <- etrid2grl(acols(p)$ENST, ens)
all.equal(names(grl), acols(p)$ENST,
          check.attributes=FALSE)
grl

We also need to retain only coding exons and discard untranslated regions for later peptide mapping, using the proteinCoding function.

pcgrl <- proteinCoding(grl)
pcgrl

Visualisation with r Biocpkg("Gviz") and r Biocpkg("Pviz")

Peptides along proteins

pp <- p[5]
n <- elementNROWS(pranges(pp))
pepstring <- paste(unique(pcols(pp)[[1]]$pepseq), collapse = ", ")

The peptides that have been experimentally observed are available as ranges (coordinates) along the protein sequences. For example, below, we isolate and visualise the r n peptides (r pepstring) have been identified for our protein of interest r seqnames(p)[5].

sort(pranges(p)[5])
plot(p[5])

Exons along the genome

We can also plot the transcript regions inluding (grl) or exclusing (pcgrl) the untranslated regions.

plotAsGeneRegionTrack(grl[[5]], pcgrl[[5]])

Mapping peptides back to the genome

The aim of this document is to document the mapping of peptides, i.e. ranges along a protein sequence to ranges along the genome reference. In other words, our aim is the convert protein coordinates to genome coordinates.

Comparing protein and translated DNA sequences

The first check that we want to implement is to verify that we can regenerate the protein amino acid sequence from the genome regions that we have extracted.

We also need the actual genome sequence (so far, we have only dealt with regions and features). The exons coordinates have been retrieved from the latest Ensembl release, which is based on the human genome assembly GRCh38. We will use a genome package that is based on the same reference genome, namely r Biocannopkg("BSgenome.Hsapiens.NCBI.GRCh38").

We need to make sure that the chromosomes are named the same way in the genome sequence data and our genomics ranges ("chrX", as seen above).

library("BSgenome.Hsapiens.NCBI.GRCh38")
head(seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
if (!"chr1" %in% seqnames(BSgenome.Hsapiens.NCBI.GRCh38))
    seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23] <-
        paste0("chr", seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[1:23])
seqnames(BSgenome.Hsapiens.NCBI.GRCh38)[21:27]

Once we have extracted the actual sequences, we must also make sure that we we reverse the sequences in case out genomic features are on the reverse strand. We the combine (unlist) the exons (coding sequences only, pcgrl) and translate then into a protein sequence.

s <- getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl[[5]])
s

if (isReverse(pcgrl[[5]]))
    s <- rev(s)

aaseq <- translate(unlist(s))
aaseq

We verify that the translated genome sequence and the protein squence we started with match by aligning them.

writePairwiseAlignments(pairwiseAlignment(aa(p[5]), aaseq))

Calculating new coordinates

We can now calculate the peptide coordinate along the genome using the position of the peptides along the protein (in p) and the position of the exons of the protein's transcript along the genome (in pcgrl) using the mapToGenome function.

res <- mapToGenome(p[5], pcgrl[5])
res[[1]]

Plotting

Based on the new peptide genomic coordinates, it is now straightforward to create a new AnnotationTrack and add it the the track visualisation.

plotAsAnnotationTrack(res[[1]], grl[[5]])

Detailed annotation track

Finally, we customise the figure by adding a track with the $MS^2$ spectra. The raw data used to search the protein database an create p are available as an MSnExp object.

data(pms)

library("ggplot2")
details <- function(identifier, ...) {
    p <- plot(pms[[as.numeric(identifier)]], full=TRUE, plot=FALSE) + ggtitle("")
    p <- p + theme_bw() + theme(axis.text.y = element_blank(),
                                axis.text.x = element_blank()) +
                                labs(x = NULL, y = NULL)
    print(p, newpage=FALSE)
}

res <- res[[1]]
deTrack <- AnnotationTrack(start = start(res),
                           end = end(res),
                           genome = "hg38", chromosom = "chrX",
                           id = pcols(p)[[5]]$acquisitionNum,
                           name = "MS2 spectra",
                           stacking = "squish", fun = details)

grTrack <- GeneRegionTrack(grl[[5]],
                           name = acols(p)$ENST[5])
ideoTrack <- IdeogramTrack(genome = "hg38",
                           chromosome = "chrX")
axisTrack <- GenomeAxisTrack()

plotTracks(list(ideoTrack, axisTrack, deTrack, grTrack),
           add53 = TRUE, add35 = TRUE)

Mapping all peptide sets in the Proteins object

Above, we have demonstrated the mapToGenome functionality on one protein only. The same operation can be performed on all the r length(p) of the p object using the pmapToGenome equivalent using the r length(pcgrl) ranges calculated with etrid2grl. The pmapToGenome will map the peptides of i-th protein to the i-th genomic location of pcgrl.

pres <- pmapToGenome(p, pcgrl)
pres

Dealing with multiple transcipts per protein

k <- 6
seqnames(p)[k]

In the code chunk below, we remind ourselves that, querying the Ensembl Biomart server for r seqnames(p)[k], we obtain several possible transcript identifiers, including the identifier of interest r acols(p)$ENST[k].

sel <- bm$uniprot_swissprot == seqnames(p)[k]
bm[sel, ]
acols(p)$ENST[k]

Let's fetch the coordinates of all possible transcipts, making sure that the names of the Ensembl identifiers are used to name the grl ranges (using use.names = TRUE).

eid <- bm[sel, 3]
names(eid) <- bm[sel, 1]
eid

grl <- etrid2grl(eid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
plotAsGeneRegionTrack(pcgrl)

Descriminating transcripts

We extract the transcript sequences, translate them into protein sequences and align each to our protein sequence (originally imported from the fasta database, see ?Proteins for the construction of p).

lseq <- lapply(getSeq(BSgenome.Hsapiens.NCBI.GRCh38, pcgrl),
               function(s) translate(unlist(s)))

laln <- sapply(lseq, pairwiseAlignment, aa(p[k]))
sapply(laln, nmatch)/width(aa(p[k]))
ki <- which.max(sapply(laln, nmatch))

We see that transcript number r ki, r eid[ki], perfectly aligns with our protein sequence. This is also the transcipt that corresponds to the curated Ensembl transcript in acols(p)$ENST.

ki <- which.max(sapply(laln, nmatch))
stopifnot(eid[ki] == acols(p)$ENST[k])
res <- pmapToGenome(p[k], pcgrl[ki])

As shown on the next figure, peptides that span over exon junctions are grouped together and, below, colour-coded.

plotAsAnnotationTrack(res[[1]], pcgrl[[ki]])

One can also apply a many-to-one mapping approach to all proteins in the p object and all the transcripts identifiers fetched with etrid2grl as shown below.

alleid <- bm[, 3]
names(alleid) <- bm[, 1]
grl <- etrid2grl(alleid, ens, use.names = TRUE)
pcgrl <- proteinCoding(grl)
res <- mapToGenome(p, pcgrl)
length(res)

The messages indicate that one protein accession number was not found in the pcgrl ranges (no transcript was found) and several mapping failed. In total, we obtain r length(res) mapping for r length(unique(names(pcgrl))) protein accession numbers.

Session information

sessionInfo()


Try the Pbase package in your browser

Any scripts or data that you put into this service are public.

Pbase documentation built on Nov. 17, 2017, 9:03 a.m.