In freesci/rGWIPS: working with RiboSeq data from GWIPS database

library(rGWIPS)

rGWIPS package is a set of functions to automate tasks of getting RiboSeq data for particular genome segments.

GWIPS-viz database aims to provide on-line tools for the analysis and visualization of ribo-seq data obtained with the ribosome profiling technique. This is the source of data required for rGWIPS database.

Example data

Let's start with the RiboSeq data for E. coli. GWIPS database hosts data mapped to the most recent assembly version of E. coli K12 sub. MG1655, named ASM584v2. Unfortunately, this genome version isn't available in Bioconductor, so, to avoid mapping errors, let's first build the package for this genome.

The seed file (saved as seedfile) looks as follows:

Package: BSgenome.Ecoli.Ensembl.ASM584v2
Title: Full genome sequence for Escherichia coli K12 substr. MG1655 ASM584v2
Description: Full genome sequence for Escherichia coli K12 substr. MG1655 ASM584v2 and stored in Biostrings objects.
Version: 1.0
organism: Escherichia coli
common_name: E. coli
provider: Ensembl
provider_version: ASM584v2
release_date: Aug. 2014
release_name: ASM584v2, INSDC Assembly GCA_000005845.2
source_url: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-36/
            fasta/bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/
organism_biocview: Escherichia_coli
BSgenomeObjname: Ecoli
seqfile_name: Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.2bit
SrcDataFiles: ftp://ftp.ensemblgenomes.org/pub/bacteria/release-36/fasta/
              bacteria_0_collection/escherichia_coli_str_k_12_substr_mg1655/dna/
              Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.chromosome.Chromosome.fa.gz
seqs_srcdir: /path/to/fasta/files

2BIT file was generated from FASTA file using faToTwoBit executable from Kent utils.

Building and installing package:

library(BSgenome)
forgeBSgenomeDataPkg("seedfile")

#The following command were typed outside of R:

R CMD build BSgenome.Ecoli.Ensembl.ASM584v2 
R CMD INSTALL BSgenome.Ecoli.Ensembl.ASM584v2_1.0.tar.gz

GWIPS data

The GWIPS database offers its mapped data as a downloadable bigWig files at: http://gwips.ucc.ie/downloads/index.html

Let's download data for forward and reverse strands from the study by Liu et. al, 2013. Files required are: Liu13_All.RiboProElongFor.bw and Liu13_All.RiboProElongRev.bw.

for_bw = system.file(package = "rGWIPS", "extdata", "Liu13_All.RiboProElongFor.bw")
rev_bw = system.file(package = "rGWIPS", "extdata", "Liu13_All.RiboProElongRev.bw")

riboseqlist <- loadGWIPSdata(forward = for_bw, reverse = rev_bw)
#GWIPS data are now available as global objects gwips_forw and gwips_rev
summary(riboseqlist["forward"])
summary(riboseqlist["reverse"])

Genome annotation

Now we can load genome and its annotation. The GFF3 file corresponding to this genome build is available from Ensembl.

library(BSgenome.Ecoli.Ensembl.ASM584v2)
#if you haven't build the package yourself, the tar.gz archive is in extdata folder

gff_file <- system.file(package="rGWIPS", "extdata", "Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.32.gff3.txt")
features <- import.gff3(gff_file)
summary(features)

Selecting ranges

Let's assume that we want to get data for the beginning of the genes in E.coli. First we need to construct a new Genomic Ranges object, containing only the ranges that are of interest to us. However, before we need to make sure that seqlevels in features object is the same as in riboseqlist object.

seqlevels(features)
seqlevels(riboseqlist["forward"])
# obviously there are differences which will create problems downstream. Let's fix it.

riboseqlist["forward"] <- renameSeqlevels(riboseqlist["forward"], "Chromosome")
riboseqlist["reverse"] <- renameSeqlevels(riboseqlist["reverse"], "Chromosome")

seqlevels(features)
seqlevels(riboseqlist["forward"])

It's all fine, so let's select some data.

sel <- selectFeaturesGR(features, type="start", width=15, feature="gene")
summary(sel)

You can see that sel object is much smaller than the original features. It is important to work on the smallest data set possible, because GRanges objects can take large amounts of memory.

Pulling the data out

Now it's time to pull the data out. The format of the output data frame is as follows:

data_forw <- overlapsRiboSeq(selected = sel, riboseq = riboseqlist["forward"], str = "+")
data_rev <- overlapsRiboSeq(selected = sel, riboseq = riboseqlist["reverse"], str = "-")

The resulting data frames contain lots of overlapping data. Let's aggregate them.

fd_forw <- aggregateRiboSeq(data_forw, Ecoli, str = "+")
fd_rev <- aggregateRiboSeq(data_rev, Ecoli, str = "-")

The format of results is as follows:

Column 1: gene/ID
Column 2: start
Column 3: end
Column 4: seqnames (chromosomes)
Column 5-19: RiboSeq data at each position 1 to 15 (we selected width of 15)
Column 20-34: nucleotides at each position 1 to 15

Figures

Now, once we have the data out, let's play a bit with it. Let's check if the mean ribosome occupany at the beginning of E. coli genes has some interesting pattern along the sequence.

#let's melt the data from the original format, for use with ggplot2
library(data.table)
fd_forw_final <- melt(fd_forw, id.vars = c("ID"), measure.vars=c("RS1", "RS2", "RS3","RS4", "RS5", "RS6","RS7", "RS8", "RS9","RS10", "RS11", "RS12","RS13", "RS14", "RS15"))

library(ggplot2)
ggplot(fd_forw_final, aes(x=variable, y=mean(as.numeric(value)), group=variable)) + geom_point()