In ohlerlab/RiboseQC: RiboseQC, a Comprehensive Ribo-Seq Analysis Tool

knitr::opts_chunk$set(echo = TRUE)

Introduction

Ribo-seQC is a package that performs quality control analysis of small RNA-seq data (in .bam format), with a focus on Ribo-seq [1] and related techniques. Thanks to syntax and functions present in packages like GenomicFeatures, rtracklayer or BSgenome, this package can perform comprehensive analyses of alignments on a variety of genomic regions. This vignette illustrates the usage of the package using example data and annotation from three different experiments performed in three different organisms.

Warning: While we encourage users to get familiar with each outlined step, executing all the code on your machine will generate a lot of data (~550 Mb) and it might take some time (5-10 minutes for each analysis and report creation). Please make sure to have enough disk space if you decide to execute code from all the sections.

As a first step, let's load the package:

suppressPackageStartupMessages(library("RiboseQC"))

Exploring the annotation

Let's now download annotation for Homo sapiens: a subset of the GENCODE 25 annotation (https://www.gencodegenes.org/human/release_25.html) and the corresponding genome sequences.

download_testfile = function(filename){
    download.file(paste0("http://bimsbstatic.mdc-berlin.de/ohler/dharnet/riboseqc_testfiles/",filename),destfile = filename)
}
download_testfile(filename = "test_human.2bit")

download_testfile(filename = "test_human.gtf")

To parse the rich information present in our .gtf file we use the prepare_annotation_files function. Such a function creates a TxDb and a compressed Rdata file containing several regions of interest and additional information. Moreover, such function reads a genome file in .2bit to forge a BSgenome package for fast and efficient query of genomic sequences. A .2bit file can be obtained from a .fasta file using the faToTwoBit software from UCSC: https://genome.ucsc.edu/goldenpath/help/twoBit.html - http://hgdownload.soe.ucsc.edu/admin/exe/ )

prepare_annotation_files(annotation_directory = ".",
                         twobit_file = "test_human.2bit",
                         gtf_file = "test_human.gtf",scientific_name = "Human.test",
                         annotation_name = "genc25_22M",export_bed_tables_TxDb = F,forge_BSgenome = T,create_TxDb = T)

We can now read such information using the load_annotation function:

load_annotation("test_human.gtf_Rannot")

Two objects have now been created: a genome_seq object links to the BSgenome package we just created and loaded (containing genome sequences), and a GTF_annotation object containing important information present in our .gtf file. For instance, we can access genomic sequences using commands as:

genome_seq[["chr22"]]
genome_seq[["chrM"]]

Transcript annotation and CDS annotations can be accessed as follows:

GTF_annotation$exons_txs

GTF_annotation$cds_txs

The genomic sequences corresponding to such genomic regions can be easily extracted:

getSeq(genome_seq,GTF_annotation$cds_txs[[4]])

A list of annotated start and stop codons, including the transcripts they map to, can be accessed using:

GTF_annotation$start_stop_codons

CDS annotation in transcript-level coordinates is also reported:

GTF_annotation$cds_txs_coords

A list of gene ids, transcript ids, together with their symbols and biotypes, can be accessed with:

GTF_annotation$trann

The genetic codes used for each chromosomes are accessed using:

GTF_annotation$genetic_codes
getGeneticCode(GTF_annotation$genetic_codes["chr22","genetic_code"])
getGeneticCode(GTF_annotation$genetic_codes["chrM","genetic_code"])

Annotation and genome sequences are linked together in the annotation creation step. The BSgenome package corresponding to the .gtf file used is reported in the GTF_annotation object:

GTF_annotation$genome_package

Create the html report (human)

Let's now download a subset of a Ribo-seq dataset in HEK293 cells [2].

download_testfile(filename = "test_human_hek.bam")

We now perform different sets of calculations on our data (including read-length and organelle-specific metagene analyses), using the annotation we previously created. Several files (including automatically generated P_sites positions, and other output from the analysis pipeline) will be automatically created. Furthermore, an html dynamic report which illustrated the analysis results is created:

RiboseQC_analysis(annotation_file="test_human.gtf_Rannot",bam_files = "test_human_hek.bam",report_file = "test_human_hek.html",write_tmp_files = F)

The html report can be opened by different browsers such as firefox, chrome etc... Moreover, all the generated plots are available in .pdf format and as RDS files in the same folder. For example, the profile of P-site positions along the CDS can be visualized using:

readRDS("test_human_hek.html_plots/rds/sample1_nucl_4_profiles_P_sites_metagene_subcodon_all")[[1]]

Let's now use Ribo-seQC to analyze different samples at once.

Analyze multiple samples (yeast)

Let's now create annotation for Saccharomyces cerevisiae. We will use a custom gtf file coming from the annotation in Ensembl 91, supplied with transcript boundaries from [3].

download_testfile('test_yeast.2bit')
download_testfile('test_yeast.gtf')


prepare_annotation_files(annotation_directory = ".",
                         twobit_file = "test_yeast.2bit",
                         gtf_file = "test_yeast.gtf",scientific_name = "yeast.test",
                         annotation_name = "yeast_custom",export_bed_tables_TxDb = T,forge_BSgenome = T,create_TxDb = T)

As before, we can explore different sequences and regions extracted from annotation.

load_annotation("test_yeast.gtf_Rannot")

genome_seq[["II"]]
genome_seq[["Mito"]]
GTF_annotation$cds_txs
GTF_annotation$trann

We will now download three files corresponding to sample data from a TCP-seq experiment [4], where the three experiments aim at extracting RNA fragments covered by scanning ribosomes (SSU), elongating ribosomes (RS) and input material (input). Ribo-seQC analysis is followed by the creation of a report containing results for all the three experiments.

download_testfile(filename = "test_yeast_TCP_input.bam")

download_testfile(filename = "test_yeast_TCP_RS.bam")

download_testfile(filename = "test_yeast_TCP_SSU.bam")

bam_filepath_y<-c("test_yeast_TCP_input.bam","test_yeast_TCP_RS.bam","test_yeast_TCP_SSU.bam")

RiboseQC_analysis(annotation_file="test_yeast.gtf_Rannot",bam_files = bam_filepath_y,fast_mode = T,report_file = "test_yeast_TCP.html",sample_names = c("input","RS","SSU"),dest_names = c("input","RS","SSU"),write_tmp_files = F)

The report allows to compare different samples side-by-side using tabs for different samples. For example, the 5'end profiles of different read lengths around start and stop codons can be visualized for different experiments. For the single plots:

as_ggplot(readRDS("test_yeast_TCP.html_plots/rds/SSU_nucl_4_profiles_fivepr_metagene_subcodon_log2")[[1]])

as_ggplot(readRDS("test_yeast_TCP.html_plots/rds/RS_nucl_4_profiles_fivepr_metagene_subcodon_log2")[[1]])

Analyze multiple samples (Arabidopsis)

We now create the annotation files for Arabidopsis thaliana. We will use a custom gtf file containing annotation of the Araport11 project (https://www.araport.org/), and non-coding RNA species from TAIR10.

download_testfile("test_arabidopsis.2bit")
download_testfile("test_arabidopsis.gtf.gz")

prepare_annotation_files(annotation_directory = ".",
                         twobit_file = "test_arabidopsis.2bit",
                         gtf_file = "test_arabidopsis.gtf.gz",scientific_name = "arabidopsis.test",
                         annotation_name = "araport11_custom",export_bed_tables_TxDb = T,forge_BSgenome = T,create_TxDb = T)

We will now download two sample datasets from a Ribo-seq experiments in Arabidopsis roots and shoots [5].

annotation="test_arabidopsis.gtf.gz_Rannot"


download_testfile(filename = "test_arabidopsis_root.bam")
download_testfile(filename = "test_arabidopsis_shoot.bam")

bam_filepath=c("test_arabidopsis_root.bam","test_arabidopsis_shoot.bam")

RiboseQC_analysis(annotation_file=annotation,bam_files = bam_filepath,fast_mode = T,report_file = "test_root_shoots.html",dest_names = c("root","shoots"),sample_names = c("root","shoots"),write_tmp_files = F)

Despite the lack of a dedicated treatment to purify chloroplast ribosomes in these Ribo-seq experiments (e.g. using chloramphenicol), it is possible to detect footprints from distinct organelles, enriched in the shoots sample rather than roots one.

For instance, we can visualize the amount of footprints mapping to different organelles and biotypes in the two samples:

readRDS("test_root_shoots.html_plots/rds/all_samples_1_readlocdist")[[1]]

or visualize z-scored positional codon usage values, calculated using A-site positions for chloroplast ribosomes in the shoots sample:

as_ggplot(readRDS("test_root_shoots.html_plots/rds/shoots_ChrC_7_codonusage_positional_A-sites_per_codon_all_zscore")[[1]])

Much more information is available in the html reports, including read length distributions per biotype and compartment, statistics on codon usage, and analysis of the highest mapping positions in the genome.

References

1) Ingolia, N.T. et al. (2009) Genome-wide analysis in vivo of resolution using ribosome profiling. Science 324, 218–223

2) Calviello, L. et al. (2015) Detecting actively translated open reading frames in ribosome profiling data. Nat. Methods 13, 1–9

3) Daechan P. et al. (2014) Simultaneous mapping of transcript ends at single-nucleotide resolution and identification of widespread promoter-associated non-coding RNA governed by TATA elements, Nucleic Acids Research, Volume 42, Issue 6, 1 April 2014, Pages 3736-3749

4) Archer, S.K. et al. (2016) Dynamics of ribosome scanning and recycling revealed by translation complex profiling. Nature 535, 570–574

5) Hsu, P.Y. et al. (2016) Super-resolution ribosome profiling reveals novel translation events in Arabidopsis. PNAS, 113, E7126–E7135