In zslastman/RiboStan: Estimates Isoform Specific RPF Footprint Densities

Introduction

Ribo-seq is a specific form of RNA-seq expression assay in which the sequenced fragments are footprints of actively translating ribosomes. Ribo-seq experiments of necessity have much shorter read-lenegths than RNA-seq experiments, which can complicate quantification. In principle, ribo-seq experiments provide nucleotide resolution information about the location of ribosomes, and can thus be used to elucidate the dynamics of ribosomal elongation and initiation. In practice, determining where the ribosome's A/P-site is in relation to the footprint is complicated by random variations in footprint size and location, which "blur" the positions of ribosomes. Ribostan is a collection of tools for the analysis of ribo-seq data which includes isoform-aware quantification, P/A-site alignment via multiple methods, and uORF identification via multitaper periodicity test (similar to ORFquant and RiboTaper).

Setup

First, let's prepare some working data: a GTF annotation for human chromosome 22, its fasta sequence, and ribo-seq reads mapped in transcript-space to chromosome 22.

library(Ribostan)

# Prepare GTF annotation
anno_file <- here::here('test.gc32.gtf')
if(!file.exists(anno_file)){ 
    library(AnnotationHub)
    ah <- AnnotationHub()
    gencode32 <- ah[['AH75191']]
    seqlevels(gencode32) <- 'chr22'
    rtracklayer::export(gencode32, anno_file)
}

# Prepare fasta sequence
fafile <- here::here('chr22.fa')
if(!file.exists(fafile)){
    library(BSgenome.Hsapiens.UCSC.hg38)
    seq <- Biostrings::DNAStringSet(BSgenome.Hsapiens.UCSC.hg38[['chr22']])
    names(seq) <- 'chr22'
    Biostrings::writeXStringSet(seq, fafile)
}

# Prepare ribo-seq alignments
testbam <- system.file('extdata', 'nchr22.bam', package='Ribostan',
    mustWork=TRUE)

Loading Data

Ribostan includes various functionalities for a) filtering ORFs in an existing annotation and b) finding potential uORFs. By default, load_annotation() will keep only those ORFs which are multiples of 3 bp long, and which begin with a start codon and end wth a stop codon. Ribostan will also search for uORFs for those ORFs, by default allowing uORFs that are as short as 2 bp. The function load_annotation() carries out these filterng steps and creates an object with an attached fasta file for use by other Ribostan functions. The function get_readgr() loads ribosomal footprint alignments (including multimappers).

# Load our annotation, filter out suspect ORFs, and find potential uORFs
chr22_anno <- load_annotation(anno_file, fafile, add_uorfs=TRUE)
rpfs <- get_readgr(testbam, chr22_anno)

From RPF Alignments to P-sites

An RPF alignment is ambiguous with respect to the position of the underlying ribosome because a) it maybe be multimapping and its actual origin is thus uncertain, and b) stochastic processes underlying footprint size and location mean that the precise location of the P-site must be determined. Various methods exist to do this. Ribostan makes use of the method described by Ahmed et al 2019, in which for each phase and read length, an offset is chosen that maximizes the number of reads within the CDS. The function get_offsets() creates a data frame describing the optimal offsets using this process. And the function get_psite_gr() applies these offsets to the alignments, annotating their ORF of origin (which is chosen randomly in the rare case where a footprint's P-site plausibly overlaps more than one ORF since more than one phase/offset is possible).

# Determine offsets by maximum CDS occupancy
offsets_df <- get_offsets(rpfs, chr22_anno)

# Use our offsets to determine P-site locations
psites <- get_psite_gr(rpfs, offsets_df, chr22_anno)

Verifying Offsets with KL-divergence

Most methods of determinig P-site offsets, including the one above, are vulnerable to error when unusual patterns of footprints exist at the start/end of the CDS. An orthogonal means of determining A/P-site offsets is to plot KL-divergence in "metacodon" profiles. KL-divergence measures the degree to which the underlying codon predicts density at a given location relative to it, and will typically have two large peaks due to cut-site bias at 0 and -read_length, along with a peak between these corresponding to the influence of codon-specific dwell time at the A- and P-site (see O'Connor et al 2016). The function get_metacodon_profs() derives average, normalized profiles around each codon, while get_kl_df() derives the KL-divergence per read_length/sample/location. Plotting these provides an orthogonal means of verifying P-site offsets.

# Determine P-site offsets per read length
covgrs <- list(sample1 = rpfs)
metacodondf <- get_metacodon_profs(covgrs, chr22_anno)

kl_df <- get_kl_df(metacodondf, chr22_anno)
kl_offsets <- select_offsets(kl_df)
kl_offsets %>% dplyr::select(p_offset, length=nreadlen) %>%
    readr::write_tsv('offsets_rustvar.tsv')

# Plot KL-divergence
kl_div_plot <- plot_kl_dv(kl_df, kl_offsets)

# Generate data frame with A/P/E-site occupancies
allcodondt <- export_codon_dts(metacodondf, kl_offsets)

Periodicity and Quantification

ORF periodicity is a good diagnostic tool for identifying bona fide translation (non periodic noise such as RPFs with ribosome-like footprints can generate ribo-seq signal in untranslated regions). The function periodicity_filter_uORFs() filters the uORFs found by load_annotation() by searching for periodicity in P-sites. The function get_ritpms() furthermore carries out optimization of ribosome densities, and similarly to software like Salmon or RSEM, performs isoform-aware quantification. Multimapping P-sites can then be sampled according to these TPMs, to give an estimate of true ribosome locations.

# Use our P-sites to filter the annotation 
chr22_anno <- periodicity_filter_uORFs( psites, chr22_anno, remove=TRUE)

# Use Stan to estimate normalized P-site densities (ritpms),
ritpms <- get_ritpms(psites, chr22_anno)

# and at the gene level (ignoring uORFs)
gritpms <- gene_level_expr(ritpms, chr22_anno)

# Remove multimapped alignments from the P-sites
psites <- psites[psites$orf %in% names(chr22_anno$trspacecds)]
psites <- sample_cov_gr(psites, chr22_anno, ritpms)

knitr::kable(tibble::enframe(head(ritpms)))
knitr::kable(head(gritpms))

Estimating Codon Occupancy

With P-site locations determined, the nucleotide resolution information on ribosome positioning can be used to estimate which codons show high or low occupancy, which given relative independence between codons will be linearly proportional to dwell time in a given sample. These codon level occupancies can be averaged for all codons in an ORF to predict which ORFs are fast or slow.

# Get codon-level occupancies using RUST (at least 1000 should be used 
# for n_genes on  a real run)
rust_codon_occ_df <- get_codon_occs(psites, offsets_df, chr22_anno,
    n_genes = 20, method = 'RUST_glm')

# Get predicted mean elongation rates for ORFs,
orf_elong <- get_orf_elong(chr22_anno, rust_codon_occ_df)

# and at the gene level (ingoring uORFs)
gn_elong = gene_level_elong(orf_elong, ritpms, chr22_anno)