knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  warnings = FALSE,
  error = FALSE,
  message = FALSE
)

superintronic

lifecycle Build Status AppVeyor Build Status Codecov test coverage

superintronic centers around exploring coverage over genomic regions via computing simple summary statistics and visualisations. The aim is to provide a modular worklfow via an interface built on top of the plyranges package and emphasising simple

Quick start

Often, as quality control check in sequencing analysis you would like to visualise coverage within each sample over the exon/intron parts of gene of interest. This is simple with superintronic!

As an example, we will use the sample BAM files from the airway data package in Bioconductor and our target gene is SRM

The basic workflow consists of three steps

  1. Setting up your gene region of interest
  2. Computing coverage
  3. Visualising coverage results

(Note that we are considering a single gene for illustrative purposes and in principle you can look at coverage over any genomic region you're interested in!)

1. Setting up annotation GRanges

You can start from a GTF/GFF file for your given organism and collect_parts() to gather up the intronic/exonic regions for each gene. You can then use the plyranges grammar to select your target gene.

library(superintronic)
suppressPackageStartupMessages(library(plyranges))

features <- system.file("extdata", 
                        "Homo_sapiens.GRCh37.75_subset.gtf", 
                        package = "airway") %>% 
  collect_parts() %>% 
  filter(gene_name == "SRM")

features

2. Computing coverage over samples

You can compute a long form GRanges containing coverage scores in parallel over a set of BAM files. All you need to begin is a character vector containing the path the BAM files or a data.frame representing the experimental design that contains a column of BAM files.

design <- read.csv(system.file("extdata", 
                               "sample_table.csv", 
                               package = "airway")) %>% 
  dplyr::select(sample_id = X, cell, dex, albut) %>% 
  dplyr::mutate(bam = dir(system.file("extdata", package = "airway"), 
                          pattern = "*.bam",
                          full.names = TRUE)
  )

cvg <- compute_coverage_long(design, source = "bam")

cvg 

This function automatically propagates, the metadata associated with a design onto the resulting GRanges. You can speed up this computation if your BAM files are indexed by providing a target GRanges so coverage will be restricted to that set of ranges only.

Once the coverage has been computed as a GRanges object, we can then intersect overlapping gene parts (from collect_parts())

cvg_over_features <- cvg %>% 
  select(-bam) %>% 
  join_parts(features)
cvg_over_features

Further summaries can then be computed using rangewise diagnostics, see the rangle() function for details.

3. Visualising coverage scores

Coverage over a gene or a genomic region can be constructed via the view_coverage() function. The coverage score is disjoined and aggregated over all samples that overlaps the target region:

library(ggplot2)
p <- cvg_over_features %>% 
  mutate(strand = feature_strand) %>% 
  view_coverage(score = score, colour = feature_type) + 
  scale_color_brewer(palette = "Dark2") +
  guides(colour = FALSE) +
  labs(title = "Coverage over SRM")
p

By default, the coverage is aligned from the 5' to the 3' end of the target region, and exon regions are coloured green and intron parts are colour orange (if you have set the strand on the input data, otherwise coverage is oriented 3' to 5').

The function returns a ggplot object, so extra annotations or layers can be added via the ggplot2 library.

We can also create track plots to include the gene body by adding in segments with

gene_track <- view_segments(unnest_parts(features), 
                            colour = feature_type)
gene_track

And via the patchwork, generate tracks:

p / gene_track

Coverage scores can also be facetted if additional experimental design information has been added. For example, we can split the coverage by each sample or treatment group or cell type

p <- cvg_over_features %>% 
  mutate(strand = feature_strand) %>% 
  view_coverage(score = score, 
                colour = feature_type, 
                facets = vars(cell)) + 
  scale_color_brewer(palette = "Dark2") +
  guides(colour = FALSE) +
  labs(title = "Coverage over SRM")
p / gene_track + patchwork::plot_layout(heights = c(3, 1))

Learning more

superintronic provides a general API to compute not only coverage, but also aggregate and visualise other measurements of interest over any arbirtrary genomic region. See the vignette for details and the plyranges documentation.



sa-lee/superintronic documentation built on Feb. 18, 2020, 10:36 a.m.