In sa-lee/superintronic: Tools for Working with Intron Signal

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(superintronic)
library(magrittr)
library(plyranges)

superintronic centers around exploring coverage over genomic regions via computing simple summary statistics and visualisations. The aim is to provide an extremely modular worklfow via an interface built on top of the plyranges package. This means that you can modify any of the steps provided with the plyranges grammar or just use our defaults.

The basic workflow consists of three steps

Setting up annotations
Computing coverage
Visualising coverage results

Setting up annotation GRanges

We assume you are starting from a GTF/GFF file for your given organism. The reading of the GFF file can be done externally from this package via rtracklayer::import() or plyranges::read_gff().

library(superintronic)
library(magrittr)
library(plyranges)

features <- system.file("extdata", 
                        "Homo_sapiens.GRCh37.75_subset.gtf", 
                        package = "airway") %>% 
  collect_parts() %>% 
  filter(source == "protein_coding", n_olaps == 1)

features

The result is a GRanges object with number of rows equal to genes, and columns containing the intronic and exonic features (as list columns) and the number of times a gene overlaps anyother feature (as well as any associated gene information from the gtf file). Any further transformations can be passed to the plyranges grammar, for example you can use plyranges::filter() to select the criteria for genes you're interested in.

If you would prefer not to read in an entire GFF file into memory as a GRanges object, collect_parts() accepts any of the following as inputs:

character vector containing a path to GFF file
GFF/GTFFile class exported from rtracklayer
TxDb and related Bioconductor objects

Computing coverage over BAM(s)

Making coverage scores into a tidy GRanges

Here we compute a long form GRanges containing coverage scores in parallel over a set of BAM files. We have structured it so you specify a data frame, (like you would get from targets in limma), that contains a column identified by source, that specifies the BAM file names. Other options include specifying a GRanges that represents the genome build, an optional target GRanges for restricting coverage (requiring the BAM to be indexed), an argument for dropping entire regions that have zero coverage, and an argumet for parallel computations.

# signature (dispatch on spec and source)
compute_coverage_long(spec,
                      source,
                      .target = NULL,
                      .genome_info = NULL,
                      .drop_empty = TRUE
                      .parallel = BiocParallel::bpparam()
                      )

This function automatically propagates, the metadata associated with a design onto the resulting GRanges, however a user may also just provide a vector of BAM files and then add any relevant design variables later with join_design().

As an example, let's use BAM files from the airway data package.

design <- read.csv(system.file("extdata", 
                               "sample_table.csv", 
                               package = "airway")) %>% 
  dplyr::select(sample_id = X, cell, dex, albut) %>% 
  dplyr::mutate(bam = dir(system.file("extdata", package = "airway"), 
                          pattern = "*.bam",
                          full.names = TRUE)
  )

cvg <- compute_coverage_long(design, source = "bam")
cvg

Once the coverage is in long form, we can then merge overlapping genomic features over the experimental design via an intersect overlap join and nesting over the union ranges of an index index (usually gene_id).

cvg_over_features <- join_parts(cvg, features) 
cvg_over_features

Rangewise diagnostics

We can then compute a bunch of rangenostics over a given (key and index) accross a column representing the coverage score.

mean
min, median, max
quantile
variance
max_variance_shift
sum_if
lumpiness
stability
peakiness
near_feature_covnostic

Includes functions for purrr like mapping for sliding/tiling/stretching over genomic windows (using Bioconductor's Views infrastructure).

rango(cvg_over_features, y = score, wt = width, .funs, ...)

Visualising coverage scores

Options for visualising coverage over a given range

cvg_over_features <- join_parts(cvg, features) 

cvg_over_features %>% 
  filter(gene_id == "ENSG00000116649") %>% 
  view_coverage(score = score, colour = feature_type)

cvg_over_features %>% 
  filter(gene_id == "ENSG00000116649") %>% 
  view_coverage(score = score, colour = feature_type, 
                facets = dplyr::vars(dex))

This returns a regular old ggplot object, so segments can be overlaid via the / operator:

cov_track <- cvg_over_features %>% 
  filter(gene_id == "ENSG00000116649") %>% 
  view_coverage(score = score, colour = feature_type, 
                facets = dplyr::vars(dex))
annotation <- features %>% 
  filter(gene_id == "ENSG00000116649") %>% 
  unnest_parts() %>% 
  view_segments(colour = feature_type)

cov_track / annotation