knitr::opts_chunk$set(cache=TRUE)
# This chunk contains some setup which we do not want to display in the final # document, hence `echo=FALSE` is used. # When you’re reading this document, it may be easier to ignore this chunk for # now, and continue with the document below, starting at “Introduction” # Make library loading silent library = function (...) suppressMessages(base::library(...))
This is document provides instructions on how to download Ensembl gene annotations from biomart using R.
First, we need to load all neccessary R packages.
library("biomaRt") library("dplyr")
First, let's define which mart and dataset we want to use.
ensembl_mart = useMart("ENSEMBL_MART_ENSEMBL", host = "dec2014.archive.ensembl.org") ensembl_dataset = useDataset("hsapiens_gene_ensembl",mart=ensembl_mart) ensembl_dataset
The host
helps to make sure that we get the annotations from a specific ensembl version. For example, Ensembl 78 correseponds to host="dec2014.archive.ensembl.org"
. You can use the Ensembl Archives website to check which host name corresponds to desired Ensembl version. More information using specific ensembl versions with biomaRt can be found in the biomaRt vignette.
We can see all available attributes with the listAttributes
command.
attributes = listAttributes(ensembl_dataset) head(attributes)
Now, let's select gene id, gene name, transcript_id and strand from the biomart and download the corresponding columns.
selected_attributes = c("ensembl_transcript_id", "ensembl_gene_id", "external_gene_name", "strand", "gene_biotype", "transcript_biotype") data = getBM(attributes = selected_attributes, mart = ensembl_dataset) head(data)
Finally, we need to rename the columns
data = dplyr::rename(data, transcript_id = ensembl_transcript_id, gene_id = ensembl_gene_id, gene_name = external_gene_name) head(data)
We can now save the metadata into a file to avoid downloading it every time we need to use it.
saveRDS(data, "transcript_metadata.rds")
Next time that we need to access the data we can load it directly from disk.
transcript_metadata = readRDS("transcript_metadata.rds") head(transcript_metadata)
First, we load the GenomicFeatures packages to download transcript and exon coordinates directly from biomaRt.
library("GenomicFeatures")
Next, we use the makeTranscriptDbFromBiomart
function to download a sepcifc version of the the Ensembl annotations, in this case Ensembl 78. Please note that as the database is quite big this can take at least a couple of minutes.
#txdb78 = makeTranscriptDbFromBiomart(biomart = "ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host="dec2014.archive.ensembl.org")
We can also save the database to disk to avoid re-downloading it every time we want to use it.
#saveDb(txdb78, "TranscriptDb_GRCh38_78.db") txdb78 = loadDb("TranscriptDb_GRCh38_78.db")
Finally, we can extract exon coordinates for all annotated transcripts from the database. This command will produce a a list of GRanges objects, each one containing the exons of a single transcript.
exons = exonsBy(txdb78, by = "tx", use.names = TRUE) exons[["ENST00000392477"]]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.