library(rGWIPS)
rGWIPS package is a set of functions to automate tasks of getting RiboSeq data for particular genome segments.
GWIPS-viz database aims to provide on-line tools for the analysis and visualization of ribo-seq data obtained with the ribosome profiling technique. This is the source of data required for rGWIPS database.
Let's start with the RiboSeq data for E. coli. GWIPS database hosts data mapped to the most recent assembly version of E. coli K12 sub. MG1655, named ASM584v2. Unfortunately, this genome version isn't available in Bioconductor, so, to avoid mapping errors, let's first build the package for this genome.
The seed file (saved as seedfile
) looks as follows:
Package: BSgenome.Ecoli.Ensembl.ASM584v2 Title: Full genome sequence for Escherichia coli K12 substr. MG1655 ASM584v2 Description: Full genome sequence for Escherichia coli K12 substr. MG1655 ASM584v2 and stored in Biostrings objects. Version: 1.0 organism: Escherichia coli common_name: E. coli provider: Ensembl provider_version: ASM584v2 release_date: Aug. 2014 release_name: ASM584v2, INSDC Assembly GCA_000005845.2 source_url: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-36/ fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/ organism_biocview: Escherichia_coli BSgenomeObjname: Ecoli seqfile_name: Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.2bit SrcDataFiles: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-36/fasta/ bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/ Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa.gz seqs_srcdir: /path/to/fasta/files
2BIT file was generated from FASTA file using faToTwoBit
executable from Kent utils.
Building and installing package:
library(BSgenome) forgeBSgenomeDataPkg("seedfile") #The following command were typed outside of R: R CMD build BSgenome.Ecoli.Ensembl.ASM584v2 R CMD INSTALL BSgenome.Ecoli.Ensembl.ASM584v2_1.0.tar.gz
The GWIPS database offers its mapped data as a downloadable bigWig files at: http://gwips.ucc.ie/downloads/index.html
Let's download data for forward and reverse strands from the study by Liu et. al, 2013. Files required are: Liu13_All.RiboProElongFor.bw
and Liu13_All.RiboProElongRev.bw
.
for_bw = system.file(package = "rGWIPS", "extdata", "Liu13_All.RiboProElongFor.bw") rev_bw = system.file(package = "rGWIPS", "extdata", "Liu13_All.RiboProElongRev.bw") riboseqlist <- loadGWIPSdata(forward = for_bw, reverse = rev_bw) #GWIPS data are now available as global objects gwips_forw and gwips_rev summary(riboseqlist["forward"]) summary(riboseqlist["reverse"])
Now we can load genome and its annotation. The GFF3 file corresponding to this genome build is available from Ensembl.
library(BSgenome.Ecoli.Ensembl.ASM584v2) #if you haven't build the package yourself, the tar.gz archive is in extdata folder gff_file <- system.file(package="rGWIPS", "extdata", "Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.32.gff3.txt") features <- import.gff3(gff_file) summary(features)
Let's assume that we want to get data for the beginning of the genes in E.coli. First we need to construct a new Genomic Ranges object, containing only the ranges that are of interest to us. However, before we need to make sure that seqlevels
in features
object is the same as in riboseqlist
object.
seqlevels(features) seqlevels(riboseqlist["forward"]) # obviously there are differences which will create problems downstream. Let's fix it. riboseqlist["forward"] <- renameSeqlevels(riboseqlist["forward"], "Chromosome") riboseqlist["reverse"] <- renameSeqlevels(riboseqlist["reverse"], "Chromosome") seqlevels(features) seqlevels(riboseqlist["forward"])
It's all fine, so let's select some data.
sel <- selectFeaturesGR(features, type="start", width=15, feature="gene") summary(sel)
You can see that sel
object is much smaller than the original features
. It is important to work on the smallest data set possible, because GRanges
objects can take large amounts of memory.
Now it's time to pull the data out. The format of the output data frame is as follows:
data_forw <- overlapsRiboSeq(selected = sel, riboseq = riboseqlist["forward"], str = "+") data_rev <- overlapsRiboSeq(selected = sel, riboseq = riboseqlist["reverse"], str = "-")
The resulting data frames contain lots of overlapping data. Let's aggregate them.
fd_forw <- aggregateRiboSeq(data_forw, Ecoli, str = "+") fd_rev <- aggregateRiboSeq(data_rev, Ecoli, str = "-")
The format of results is as follows:
Column 1: gene/ID Column 2: start Column 3: end Column 4: seqnames (chromosomes) Column 5-19: RiboSeq data at each position 1 to 15 (we selected width of 15) Column 20-34: nucleotides at each position 1 to 15
Now, once we have the data out, let's play a bit with it. Let's check if the mean ribosome occupany at the beginning of E. coli genes has some interesting pattern along the sequence.
#let's melt the data from the original format, for use with ggplot2 library(data.table) fd_forw_final <- melt(fd_forw, id.vars = c("ID"), measure.vars=c("RS1", "RS2", "RS3","RS4", "RS5", "RS6","RS7", "RS8", "RS9","RS10", "RS11", "RS12","RS13", "RS14", "RS15")) library(ggplot2) ggplot(fd_forw_final, aes(x=variable, y=mean(as.numeric(value)), group=variable)) + geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.