knitr::opts_chunk$set(cache=TRUE)
# This chunk contains some setup which we do not want to display in the final
# document, hence `echo=FALSE` is used.
# When you’re reading this document, it may be easier to ignore this chunk for
# now, and continue with the document below, starting at “Introduction”

# Make library loading silent
library = function (...) suppressMessages(base::library(...))

Using biomart to download genomic annotations in R

This is document provides instructions on how to download Ensembl gene annotations from biomart using R.

First, we need to load all neccessary R packages.

library("biomaRt")
library("dplyr")

Downloading transcript metadata

First, let's define which mart and dataset we want to use.

ensembl_mart = useMart("ENSEMBL_MART_ENSEMBL", host = "dec2014.archive.ensembl.org")
ensembl_dataset = useDataset("hsapiens_gene_ensembl",mart=ensembl_mart)
ensembl_dataset

The host helps to make sure that we get the annotations from a specific ensembl version. For example, Ensembl 78 correseponds to host="dec2014.archive.ensembl.org". You can use the Ensembl Archives website to check which host name corresponds to desired Ensembl version. More information using specific ensembl versions with biomaRt can be found in the biomaRt vignette.

We can see all available attributes with the listAttributes command.

attributes = listAttributes(ensembl_dataset)
head(attributes)

Now, let's select gene id, gene name, transcript_id and strand from the biomart and download the corresponding columns.

selected_attributes = c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name", "strand", "gene_biotype", "transcript_biotype")
data = getBM(attributes = selected_attributes, mart = ensembl_dataset)
head(data)

Finally, we need to rename the columns

data = dplyr::rename(data, transcript_id = ensembl_transcript_id, gene_id = ensembl_gene_id, gene_name = external_gene_name)
head(data)

We can now save the metadata into a file to avoid downloading it every time we need to use it.

saveRDS(data, "transcript_metadata.rds")

Next time that we need to access the data we can load it directly from disk.

transcript_metadata = readRDS("transcript_metadata.rds")
head(transcript_metadata)

Downloading transcript and exon coordinates

First, we load the GenomicFeatures packages to download transcript and exon coordinates directly from biomaRt.

library("GenomicFeatures")

Next, we use the makeTranscriptDbFromBiomart function to download a sepcifc version of the the Ensembl annotations, in this case Ensembl 78. Please note that as the database is quite big this can take at least a couple of minutes.

#txdb78 = makeTranscriptDbFromBiomart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host="dec2014.archive.ensembl.org")

We can also save the database to disk to avoid re-downloading it every time we want to use it.

#saveDb(txdb78, "TranscriptDb_GRCh38_78.db")
txdb78 = loadDb("TranscriptDb_GRCh38_78.db")

Finally, we can extract exon coordinates for all annotated transcripts from the database. This command will produce a a list of GRanges objects, each one containing the exons of a single transcript.

exons = exonsBy(txdb78, by = "tx", use.names = TRUE)
exons[["ENST00000392477"]]

References

  1. biomaRt vignette
  2. GenomicFeatures
  3. GRanges


kauralasoo/seqUtils documentation built on May 20, 2019, 7:42 a.m.