knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
ragg_png = function(..., res = 192) {
  ragg::agg_png(..., res = res, units = "in")
}
knitr::opts_chunk$set(dev = "ragg_png", fig.ext = "png")
options("warn"=-1);

Overview

This vignette is intended to describe how to create a Sashimi plot using RNA-seq data.

Requirements

There are three basic requirements for a Sashimi plot:

  1. Gene-exon structure
  2. RNA-seq coverage data
  3. exon-exon splice junction data

Gene-exon structure

There are several sources of gene-exon structures:

The requirement for Sashimi plots is to flatten exons to disjoint (non-overlapping) exons across the gene region of interest. The exons are used to determine the genomic regions to use for sequence coverage, and splice junction reads.

The exons are also used to define the genome coordinate transformation function used to compress introns to more manageable (smaller) size.

Use of detected transcripts

The recommended workflow is to use detected transcripts derived from splicejam::defineDetectedTx() or some other means. This step typically reduces the total transcripts by 60%, which greatly simplifies the resulting gene structure, and focuses the gene structure on the experimental data.

Gene transcriptome data resources are increasingly comprehensive, as they are derived from numerous tissue samples and cell types. In ideal cases, the data should represent the superset of possible gene transcripts, of which only a fraction are experimentally observed.

In addition, we observed questionable transcript annotations, with no apparent supporting data -- possible artifacts of an automated annotation pipeline. While these apparent errors are found in low percentage, their presence can completely disrupt a gene structure, sometimes combining several genes (70+) into one extremely long gene locus longer than 5 megabases.

As a practical matter, we found it beneficial that many of these transcripts get removed during the splicejam::defineDetectedTx() process, conveniently simplifying the resulting gene-exon models. Note this process currently uses data from an analysis tool such as Kallisto (Bray et al 2016, https://pachterlab.github.io/kallisto/about) or Salmon (Patro et al 2017, https://combine-lab.github.io/salmon/) since these tools estimate transcript isoform abundance.

Starting with TxDb

This example uses Bioconductor data package TxDb.Mmusculus.UCSC.mm10.knownGene which represents UCSC known genes for the mouse mm10 genome assembly.

First we assemble all exons by transcript, then all CDS exons by transcript. We will use these exons to assemble annotated gene models.

options("warn"=-1);
suppressPackageStartupMessages(library(splicejam));
suppressPackageStartupMessages(library(jamba));
suppressPackageStartupMessages(library(kableExtra));
if (suppressPackageStartupMessages(require(TxDb.Mmusculus.UCSC.mm10.knownGene))) {

   # First obtain exons by transcript
   exonsByTxMm10 <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene,
      by="tx",
      use.names=TRUE);
   values(exonsByTxMm10@unlistData)$feature_type <- "exon";
   values(exonsByTxMm10@unlistData)$subclass <- "exon";

   # For added insight, obtain CDS exons by transcript (optional)
   cdsByTxMm10 <- cdsBy(TxDb.Mmusculus.UCSC.mm10.knownGene,
      by="tx",
      use.names=TRUE);
   values(cdsByTxMm10@unlistData)$feature_type <- "cds";
   values(cdsByTxMm10@unlistData)$subclass <- "cds";
}

The TxDb objects do not permit storing gene symbols, so the example uses the Bioconductor annotation package org.Mm.eg.db to assign gene_name values. For values not found in org.Mm.eg.db, we use the format LOC# to indicate the Entrez gene ID, for example Entrez gene ID 1234 would be represented LOC1234.

if (require(TxDb.Mmusculus.UCSC.mm10.knownGene)) {

   # Now prepare tx_name, gene_id, gene_name data.frame,
   # surprisingly difficult
   tx2geneMm10 <- suppressMessages(
      AnnotationDbi::select(TxDb.Mmusculus.UCSC.mm10.knownGene,
         keys(TxDb.Mmusculus.UCSC.mm10.knownGene, "GENEID"),
         columns=c("GENEID","TXNAME"),
         keytype="GENEID"
      )
   );
   tx2geneMm10 <- renameColumn(tx2geneMm10,
      from=c("GENEID", "TXNAME", "TXTYPE"),
      to=c("gene_id", "transcript_id", "transcript_type"));

   # add gene_name using org.Mm.eg.db
   if (suppressPackageStartupMessages(require(org.Mm.eg.db))) {
      gene_ids <- values(genes(TxDb.Mmusculus.UCSC.mm10.knownGene))$gene_id;
      gene_namesL <- mget(gene_ids,
         org.Mm.egSYMBOL,
         ifnotfound=NA);
      ## Convert list to vector taking the first gene_name each
      ## (All genes should only have one SYMBOL but there is no
      ## hard constraint so we should make absolutely sure to
      ## use only one value per gene.)
      gene_names <- unlist(heads(S4Vectors::List(gene_namesL), 1));
      ## Replace NA with LOC# format
      ## Note we use gsub() to ensure the data fits the expected format
      if (any(is.na(gene_names))) {
         gene_na <- which(is.na(gene_names));
         gene_names[gene_na] <- gsub("^([0-9]+)$", "LOC\\1",
            names(gene_names[gene_na]));
      }
      tx2geneMm10$gene_name <- gene_names[as.character(tx2geneMm10$gene_id)];
   } else {
      ## If we have no gene annotations, use the gene_id values
      tx2geneMm10$gene_name <- as.character(tx2geneMm10$gene_id);
   }
   # print the first 20 rows to show the content
   print(head(tx2geneMm10, 20));
}

Next we flatten transcript exons to the gene level using the function splicejam::flattenExonsBy().

We also flatten exons to the transcript level, which has the effect of combining CDS exons alongside non-CDS exons. The result is used when plotting transcript exon structures.

In both cases, the argument genes=c("Gria1", "Ntrk2") is used as a convenience to produce data for only these two genes.

We recommend using detected transcripts here, with the argument detectedTx.

if (require(TxDb.Mmusculus.UCSC.mm10.knownGene)) {
   # flatten exons to the gene level
   # for speed, we will only process "Gria1", and "Ntrk2"
   flatExonsByGeneMm10 <- flattenExonsBy(exonsByTx=exonsByTxMm10,
      cdsByTx=cdsByTxMm10,
      by="gene",
      genes=c("Gria1", "Ntrk2"),
      tx2geneDF=tx2geneMm10,
      verbose=FALSE);

   # to be fancy, also flatten transcripts, to include CDS ranges   
   flatExonsByTxMm10 <- flattenExonsBy(exonsByTx=exonsByTxMm10,
      cdsByTx=cdsByTxMm10,
      tx2geneDF=tx2geneMm10,
      by="tx",
      genes=c("Gria1", "Ntrk2"));
}

Plot gene-exon models using gene2gg()

Once the gene-transcript-exon data is prepared, the exon structure can be plotted using the function splicejam::gene2gg():

if (require(TxDb.Mmusculus.UCSC.mm10.knownGene)) {
   # Pull out Gria1
   grlGria1 <- flatExonsByGeneMm10[["Gria1"]];

   ## Plot a basic gene-exon structure
   ggGria1exons <- gene2gg(gene="Gria1",
      flatExonsByGene=flatExonsByGeneMm10,
      exonLabelSize=6);
   print(ggGria1exons + ggtitle("Gria1 exons"));

   ## Compare to the gene structure without compressing introns
   gg1full <- gene2gg(gene="Gria1",
      flatExonsByGene=flatExonsByGeneMm10,
      compressGaps=FALSE)
   print(gg1full);

   ## Plot a slightly more detailed gene-transcript-exon structure
   ggGria1exonsTx <- gene2gg(gene="Gria1",
      flatExonsByGene=flatExonsByGeneMm10,
      flatExonsByTx=flatExonsByTxMm10,
      tx2geneDF=tx2geneMm10);
   print(ggGria1exonsTx + ggtitle("Gria1 (compressed introns)"));

   ## Notice how difficult it is to see exon15 and exon16 are
   ## mutually exclusive exons
   gg2full <- gene2gg(gene="Gria1",
      flatExonsByGene=flatExonsByGeneMm10,
      flatExonsByTx=flatExonsByTxMm10,
      tx2geneDF=tx2geneMm10,
      compressGaps=FALSE);
   print(gg2full + ggtitle("Gria1 (uncompressed introns)"));

}

Coverage data

There are currently two main sources of coverage data:

The next step is to define biological samples for each set of coverage data. For this step, we use a data.frame called "filesDF", with these colnames:

Coverage in bigWig format

A bigWig coverage file can only store coverage for one strand, when using strand-specific RNA-seq. To provide negative strand data, either provide bigWig coverage scores which are negative values (never above zero, and containing at least one negative value), or supply a negative value for the "scale_factor" field in filesDF.

The example below defines a data.frame named filesDF which contains bigWig coverage, and splice junction data.

## assemble a data.frame
baseurl <- "https://orio.niehs.nih.gov/ucscview/farrisHub/mm10/";

# BED files with junction reads
bedext <- ".STAR_mm10.combinedJunctions.bed";
bwext <- c("492_1.sickle.merged.cutadapt.STAR_mm10.pos.bw",
   "492_1.sickle.merged.cutadapt.STAR_mm10.neg.bw");
c1 <- c("CA1", "CA2");
r1 <- c("CB", "DE");
bedsamples <- paste0(rep(c1, each=2), "_", r1);
bedurls <- paste0(baseurl,
   bedsamples,
   bedext);

# bigWig files with strand-specific read coverage
bwsamples <- paste0(rep(c1, each=4),
   rep(r1, each=2));
bwsamples1 <- paste0(rep(c1, each=4),
   "_",
   rep(r1, each=2));
bwurls <- paste0(baseurl, "NS50211",
   bwsamples,
   bwext);

# Assemble into a data.frame
filesDF <- data.frame(stringsAsFactors=FALSE,
   check.names=FALSE,
   url=c(bedurls, bwurls),
   type=rep(c("junction", "bw"), c(4,8)),
   sample_id=c(bedsamples, bwsamples1),
   scale_factor=rep(c(1,3), c(length(bedurls), length(bwurls))));
filesDF;

Coverage in GRanges format

Example coverage data is provided as data, to demonstrate the format expected. Note that in this case, there is no coverage defined for introns. The bigWig workflow by default adds gaps between exons, and retrieves coverage data for introns as well.

The GRanges object below uses an example set of exons, test_exon_wide_gr, and corresponding coverage GRanges is test_exon_wide_gr. Note that both share the same names and genomic coordinates.

Ultimately, coverage data should be split by the same GRanges used to define exons (and introns) for the Sashimi plot.

data(test_exon_wide_gr);
test_exon_wide_gr;
data(test_cov_wide_gr);
test_cov_wide_gr;

Plotting exon coverage data

Some basic plotting functions are provided, in fact they are rolled into the Sashimi plot itself. They can help visualize the coverage by itself.

The function splicejam::exoncov2polygon() converts coverage to polygons, by default it returns a format convenient for geom_polygon() or ggforce::geom_shape(), but if coord_style="base" then it returns a matrix with blank rows separating each polygon, suitable for R base graphics using graphics::polygon().

The argument covNames must contain one of more columns in colnames(values(test_cov_wide_gr)).

# To plot a simple GRanges object
widecovdf <- exoncov2polygon(test_cov_wide_gr, covNames="sample_A");
suppressPackageStartupMessages(require(ggplot2));
ggWide3 <- ggplot(widecovdf,
      aes(x=x, y=y, group=gr, fill=gr, color=gr)) +
   ggforce::geom_shape(alpha=0.7) +
   colorjam::theme_jam() +
   colorjam::scale_fill_jam() +
   colorjam::scale_color_jam();
print(ggWide3);

Compressing intron coordinates

The plot above demonstrates default genomic coordinates, and shows how large intron regions are compared to typical exons (in mammals anyway.) For better visualization of the transcript exons, we compress introns using the function splicejam::make_ref2compresssed(). This function returns a list with several useful components, one of which is trans_grc which is a ggplot2 transformation function.

The argument nBreaks=10 defines 10 fixed x-axis label positions, which by default use exon boundaries. To label every exon boundary, use nBreaks=Inf.

The key step when using ggplot2 is to use the function ggplot2::scale_x_continuous(), using argument trans=ref2c$trans_grc. This argument defines the x-axis compression, x-axis breaks, x-axis minor breaks, and x-axis labels.

Alternatively, you can use ggplot2::coord_trans(x=ref2c$trans_grc) however, that function only compresses the x-axis coordinates and does not define breaks or labels.

# Now compress the introns keeping axis labels
ref2c <- make_ref2compressed(test_cov_wide_gr,
   nBreaks=10);

# Repeat ggplot using ref2c$trans_grc
ggWide3c <- ggWide3 +
   scale_x_continuous(trans=ref2c$trans_grc) +
   xlab("chr1 (compressed introns)") +
   ggtitle("exons (compressed introns)");
print(ggWide3c);

Splice junction data

Splice junction data can be provided in two forms:

Currently, the rtracklayer::import() function does not support bigBed format, and there is no apparent plan to provide that capability. It is recommended to use the UCSC tool bigBedToBed to convert bigBed files to BED format.

A set of sample splice junction data is provided to show the expected format of junction data. Note it has two columns "score" and "sample_id".

data(test_junc_wide_gr);
test_junc_wide_gr;

Plotting splice junction data

Some basic plotting functions are provided, which are rolled into the Sashimi plot. They can help visualize splice junction data directly, as needed.

The key function here is splicejam::grl2df(...,shape="junction") which configures wide junction arcs suitable for splicejam::geom_diagonal_wide_arc(). (Note that ggplot2::geom_polygon() would also work.)

# To plot junctions, use grl2df(..., shape="junction")
junc_wide_df <-grl2df(test_junc_wide_gr,
   shape="junction");

ggWide1 <- ggplot(junc_wide_df,
      aes(x=x, y=y, group=gr_name, fill=gr_name, color=gr_name)) +
   splicejam::geom_diagonal_wide_arc() +
   colorjam::theme_jam() +
   colorjam::scale_fill_jam(alpha=0.7) +
   colorjam::scale_color_jam() +
   xlab("chr1") +
   ggtitle("junctions (full intron width)")
print(ggWide1);

For more insight into creating a Sashimi plot manually, see the help for data splicejam::test_junc_wide_gr.

Prepare Sashimi Data

The main function prepareSashimi() is intended to assemble the sources of data together into one list of data.frame results which can be used to produce ggplot2 visualizations.

Data obtained from external files

For this example, we use the Gria1 gene in mouse mm10, pulling from a web-accessible track hub. Note that the bigBed files were already converted to BED12 format using the UCSC tool bigBedToBed.

if (exists("flatExonsByGeneMm10")) {
   shGria1 <- prepareSashimi(gene="Gria1",
      flatExonsByGene=flatExonsByGeneMm10,
      minJunctionScore=100,
      sample_id=c("CA1_CB", "CA2_CB"),
      filesDF=filesDF);
}

Data provided using GRanges objects

Alternatively, you can provide data in the form of GRanges objects, and not point to specific files.

Note that filesDF is still required for coverage in GRanges format, in order to define the "sample_id" for each coverage.

data(test_exon_wide_gr);
data(test_junc_wide_gr);
data(test_cov_wide_gr);
testfilesDF <- data.frame(url="sample_A",
   type="coverage_gr",
   sample_id="sample_A",
   scale_factor=1);
testfilesDF;

# Now prepare sashimi data
sh1 <- prepareSashimi(GRangesList(TestGene1=test_exon_gr),
   filesDF=testfilesDF,
   gene="TestGene1",
   sample_id="sample_A",
   covGR=test_cov_gr,
   juncGR=test_junc_gr);

At this point, we can just plot the result.

# Plot sample Sashimi data
plotSashimi(sh1);

Plot Sashimi Data

Once the Sashimi data is prepared, the splicejam::plotSashimi() function is used to create a visualization that can be customized in several ways.

The output of splicejam::plotSashimi() is a ggplot object, which can be printed to create a plot, or can be modified as needed.

if (exists("shGria1")) {
   ggGria1 <- plotSashimi(shGria1,
      junc_color=alpha2col("goldenrod1", 0.4),
      junc_fill=alpha2col("goldenrod1", 0.4),
      show=c("coverage", "junction", "junctionLabels"),
      fill_scheme="sample_id");
   print(ggGria1);
}

Adding gene-exon model to a Sashimi plot

To add a gene-exon model, and even transcript-exon models, we will use the cowplot package to arrange multiple ggplot objects together.

Some guidelines:

The example below demonstrates combining a gene-exon plot with a Sashimi plot, meeting the required guidelines.

if (suppressPackageStartupMessages(require(cowplot)) &&
      exists("ggGria1")) {
   cpGria1 <- cowplot::plot_grid(
      ggGria1 + theme(axis.text.x=element_blank()) + xlab(NULL),
      ggGria1exonsTx + ggtitle(NULL) + expand_limits(y=-2),
      ncol=1,
      align="v", 
      axis="lr",
      rel_heights=c(4,4));
   print(cpGria1);
}

Sashimi plot with unscaled introns

if (require(cowplot) &&
      exists("ggGria1")) {
   cpWide <- cowplot::plot_grid(ncol=1,
      ggGria1 + scale_x_continuous() + theme(axis.text.x=element_blank()) + xlab(NULL),
      gg2full,
      axis="lr",
      align="v");
   print(cpWide);
}

Zoom into specific exons

There are two ways to zoom into a region of interest:

  1. Prepare one Sashimi panel, and use ggforce::facet_zoom()
  2. Prepare Sashimi data for a specific set of exons.

Zoom one panel using facet_zoom

The example below shows the ggforce::facet_zoom() capability.

Note the most important step is to transform the x-axis values when using xlim to subset the range. The splicejam::prepareSashimi() function returns x$ref2c$transform() which transforms genomic coordinates into compressed coordinates.

plotSashimi(sh1) + 
   ggforce::facet_zoom(xlim=sh1$ref2c$transform(c(370, 410)));

Zoom multiple panels

In order to zoom multiple panels, a few steps must be followed:

if (exists("flatExonsByGeneMm10")) {
   zoomStart <- start(flatExonsByGeneMm10[["Gria1"]]["Gria1_exon14"]);
   zoomEnd <- end(flatExonsByGeneMm10[["Gria1"]]["Gria1_exon18a"]);

   # Now prepare sashimi data
   shGria1L <- lapply(c("CA1_CB", "CA2_CB"), function(iSample){
      # devtools::load_all("/Users/wardjm/Projects/Ward/splicejam")
      shX <- prepareSashimi(gene="Gria1",
         flatExonsByGene=flatExonsByGeneMm10,
         minJunctionScore=100,
         sample_id=iSample,
         doStackJunctions=FALSE,
         filesDF=filesDF);
   });
   ggGriaL <- lapply(shGria1L, function(shX){
         ggX <- plotSashimi(shX) +
            ggforce::facet_zoom(xlim=shX$ref2c$transform(c(zoomStart, zoomEnd)));
   });

   # now use cowplot
   if (require(cowplot)) {
      cpMulti <- do.call(cowplot::plot_grid, ggGriaL);
      print(cpMulti);
   }
}


jmw86069/splicejam documentation built on Nov. 4, 2024, 10:53 a.m.