Mapping between different Genome references

BiocStyle::markdown()

Package: Pbase
Author: Laurent Gatto
Last compiled: r date()
Last modified: r file.info("ensucsc.Rmd")$mtime

This vignette described how to convert coordinates between different genome references. We will use transcript ENST00000373316 and GRCh38 and GRCh37 as working example.

tr <- "ENST00000373316"

GRCh38

Here is use r Biocpkg("Gviz") to query the latest Ensembl biomart and extract the transcript of interest.

suppressMessages(library("Gviz"))
suppressMessages(library("biomaRt"))

h38 <- useMart("ensembl", "hsapiens_gene_ensembl")
tr38 <- BiomartGeneRegionTrack(biomart = h38,
                               transcript = tr)
tr38 <- split(tr38, transcript(tr38))
tr38 <- ranges(tr38[[tr]])
tr38

Note the starting position of the transcript is r min(start(tr38)).

GRCh37

Below, I repeat the same operation without using my own ens Mart instance. As far as I understand, Gviz queries the UCSC genome reference by default.

h37 <- useMart(host = "feb2014.archive.ensembl.org",
                biomart = "ENSEMBL_MART_ENSEMBL",
                dataset = "hsapiens_gene_ensembl")
tr37 <- BiomartGeneRegionTrack(biomart = h37,
                               transcript = tr)
tr37 <- split(tr37, transcript(tr37))
tr37 <- ranges(tr37[[tr]])
tr37

Note the starting position of the transcript is r min(start(tr37)).

These differences seem to stem from different genome builds. Ensembl release 78 uses GRCh38, while UCSC uses GRCh37. Indeed, r Biocpkg("Gviz") sets the Ensembl biomart server to Feb.2014 GRCh37.p13.

Coordinates conversion

We will use the coordinate mapping infrastructure described in the January 2015 Bioconductor Newletter and the Changing genomic coordinate systems with rtracklayer::liftOver workflow.

First, we query r Biocpkg("AnnotationHub") for a chain file to perform the operation we want.

library("AnnotationHub")
hub <- AnnotationHub()
query(hub, 'hg19ToHg38')
chain <- query(hub, 'hg19ToHg38')[[1]]

The liftOver function from the r Biocpkg("rtracklayer") package will use the chain and translate the coordinates of a GRanges object into a new GRangesList object.

library("rtracklayer")
res <- liftOver(tr37, chain)
res <- unlist(res)

## set annotation
names(res) <- NULL
genome(res) <- "hsapiens_gene_ensembl"

all.equal(res, tr38)

Session information

sessionInfo()


Try the Pbase package in your browser

Any scripts or data that you put into this service are public.

Pbase documentation built on Nov. 17, 2017, 9:03 a.m.