knitr::opts_chunk$set(echo = TRUE, fig.width = 7, fig.height = 5)

Package goals

The GenomicWidgets package enables the creation of interactive visualizations for functional genomics data in R. These interactive visualizations can be included in stand-alone Rmarkdown HTML documents, or can be linked in shiny applications. These visualization are intended to highlight the data at different scales, from a single locus to a few loci to hundreds of regions along the genome, and to enable the views at different scales to be linked together.

Packages for running vignette

In addition to GenomicWidgets, we will load a few other packages for use in this vignette.

library(GenomicWidgets)
library(iheatmapr)
library(GenomicRanges)
library(SummarizedExperiment)

# If TxDb.Hsapiens.UCSC.hg19.knownGene is not installed, you can install it via
# source("https://bioconductor.org/biocLite.R")
# biocLite("TxDb.Hsapiens.UCSC.hg19.knownGene")
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

We will also use data from the genomationData package. If not installed, can be installed via:

source("https://bioconductor.org/biocLite.R")
biocLite("genomationData")

Genomics Heatmaps

The GenomicWidgets package builds on the iheatmapr package to enable creation of complex, interactive heatmaps for genomics data.

Expression heatmap

GenomicWidgets includes an iheatmap method for SummarizedExperiment objects, which are often used for storing expression data. Compared to the iheatmap method for a matrix, the method for SummarizedExperiment has some different default settings that are often appropriate for expression or epigenomics data.

The package includes a small subset of the expression data from the Roadmap Epigenomics project.

data("rpkm_chr21")

We'll subset further to use just the first 50 genes, and only samples from Primary Culture or Tissue.

rpkm_chr21 <- 
  rpkm_chr21[1:50,which(colData(rpkm_chr21)$TYPE %in% c("PrimaryCulture",
                                                        "PrimaryTissue"))]

If we call the iheatmap method on a SummarizedExperiment object, we obtain the following heatmap:

iheatmap(rpkm_chr21,"rpkm")

If instead we had applied the iheatmap method applied to the plain matrix of RPKM values, the basic iheatmap method is used with defaults that are not as suitable, leading to a less interesting heatmap:

iheatmap(assays(rpkm_chr21)[["rpkm"]])

For SummarizedExperiment, hierarchical clustering and standardization of rows is the default.

We can further enhance this heatmap by passing along additional components of the SummarizedExperiment object to additional arguments to the iheatmap method.

iheatmap(rpkm_chr21, "rpkm",
         x = colData(rpkm_chr21)$STD_NAME, 
         y = rowData(rpkm_chr21)$SYMBOL, 
         col_annotation = colData(rpkm_chr21)[,c("TYPE","SEX")])

There is also an add_iheatmap method defined for SummarizedExperiment for adding a heatmap based on a SummarizedExperiment to an existing heatmap. Here we will separate our samples into two groups to demonstrate the add_iheatmap function:

rpkm_group1 <- rpkm_chr21[,which(colData(rpkm_chr21)$TYPE == "PrimaryTissue")]
rpkm_group2 <- rpkm_chr21[,which(colData(rpkm_chr21)$TYPE == "PrimaryCulture")]

iheatmap(rpkm_group1, "rpkm",
         x = colData(rpkm_group1)$STD_NAME, 
         y = rowData(rpkm_group1)$SYMBOL, 
         col_title = "Primary Tissue") %>% 
  add_iheatmap(rpkm_group2, "rpkm",
         x = colData(rpkm_group2)$STD_NAME, 
         col_title = "Primary Culture")

Note that in practice this might not make sense to do, as the rows are now scaled separately. The add_iheatmap function can be useful for juxtaposing a different data type. If a name for the colorbar is not given (via the name argument), the two main heatmaps are put on the same color scale.

Coverage heatmaps

GenomicWidgets also has methods for making coverage heatmaps.

The data

For demonstration of coverage heatmaps, we'll use ChIP-seq data for the H1 ESC cell line from the ENCODE project included in the genomationData Bioconductor package. We'll first read in a sample table and get the correct paths to the files.

genomation_dir <- system.file("extdata", package = "genomationData")

samp.file <- file.path(genomation_dir,'SamplesInfo.txt')
samp.info <- read.table(samp.file, header=TRUE, sep='\t', 
                       stringsAsFactors = FALSE)
samp.info$fileName <- file.path(genomation_dir, samp.info$fileName)

Then we'll read in the location of CTCF peaks. We'll plot the coverage at these peaks.

ctcf.peaks <- genomation::readBroadPeak(system.file("extdata",
                         "wgEncodeBroadHistoneH1hescCtcfStdPk.broadPeak.gz",
                         package = "genomationData"))
ctcf.peaks <- ctcf.peaks[seqnames(ctcf.peaks) == "chr21"]
ctcf.peaks <- ctcf.peaks[order(-ctcf.peaks$signalValue)]
ctcf.peaks <- resize(ctcf.peaks, width = 501, fix = "center")

We'll limit ourselves to considering 50 of the peaks.

ctcf.peaks <- ctcf.peaks[1:50]

Making coverage matrices

We first will make coverage matrices. These will be returned as RangedSummarizedExperiment objects.

ctcf_mats <- make_coverage_matrix(samp.info$fileName[1:5], 
                                  ctcf.peaks, 
                                  input_names = samp.info$sampleName[1:5],
                                  up = 250, 
                                  down = 250, 
                                  binsize = 25)

With bam input, the make_coverage_matrix function computes the coverage based on the basepairs covered by the actual reads; no extending or shifting is done. With bigwig input, the function assumes the bigwig file contains the coverage track.

You can also use your own preferred method for making a coverage matrix. The coverage_heatmap method shown below can accept either a SummarizedExperiment, a plain matrix, a list of matrices, or a ScoreMatrix or ScoreMatrixList object (objects from the genomation package). The genomation package includes the ScoreMatrix, ScoreMatrixBin, and ScoreMatrixList functions to read in coverage and make coverage matrices.

Single coverage

Here we will create a single coverage heatmap for the CTCF ChIP-seq.

coverage_heatmap(ctcf_mats, "Ctcf")

Scaling methods

By default, the coverage matrix shown in the heatmap is scaled by the root mean squared value of the row. This can be adjusted by modifying the scale_method argument. For example, to scale based off the 95th percentile value in the matrix:

coverage_heatmap(ctcf_mats, "Ctcf", scale_method = "PercentileMax")

You can obained the scaled coverage matrices by using the normalize_coverage_matrix function.

normed_ctcf_mats <- normalize_coverage_matrix(ctcf_mats, method = "PercentileMax")

Multiple coverage heatmaps

coverage_heatmap(ctcf_mats, c("Ctcf","Znf143"))

You can also add multiple coverage heatmaps at once to an existing heatmap using add_coverage_heatmaps. We will demonstrate that in the next section.

Combining expression heatmap with coverage heatmap

Get coverage matrix at tss for the Chip-Seq

tss <- promoters(rowRanges(rpkm_chr21), up = 1, down = 1)

tss_mats <- make_coverage_matrix(samp.info$fileName[1:5], 
                                  tss, 
                                  input_names = samp.info$sampleName[1:5],
                                  up = 500, 
                                  down = 500, 
                                  binsize = 25)

Make expression heatmap and then add coverage heatmap for TSS for two of the factors:

iheatmap(rpkm_chr21, "rpkm",
         x = colData(rpkm_chr21)$STD_NAME, 
         y = rowData(rpkm_chr21)$SYMBOL, 
         col_annotation = colData(rpkm_chr21)[,c("TYPE","SEX")]) %>% 
  add_coverage_heatmap(tss_mats, c("P300","Suz12"))

Interactive genome tracks

GenomicWidgets also has functions for interactive local views of genomic signals.

Set track parameters

For plotting genomic signals as tracks, first you specify the source of the data and other options.

track_params <- set_track_parameters(samp.info$fileName[1:3], 
                                    annotation = 
                                      TxDb.Hsapiens.UCSC.hg19.knownGene, 
                                    track_names = samp.info$sampleName[1:3], 
                                    share_y = TRUE)

Those parameters can then be passed along with a range to generate the interactive genome track plot.

plot_tracks(resize(ctcf.peaks[2], width = 5000, fix = "center"), track_params)

Multiple regions

You can easily plot multiple regions by passing along several ranges rather than a single range.

plot_tracks(resize(ctcf.peaks[1:4], width = 5000, fix = "center"), track_params)

Options

Passing more arguments

You can over-ride parameters set by set_track_parameters by passing them again to plot_tracks

plot_tracks(resize(ctcf.peaks[2], width = 5000, fix = "center"),
            track_params,
              showlegend = FALSE)

x axis labels for multiple windows

When plotting mutliple windows, by default the x axis reflects the relative position to the center of the window. This can be modified by setting the offset argument, which tells the function what position should be used as the center anchor. The xaxis title can also be modified via the xtitle argument.

plot_tracks(resize(ctcf.peaks[1:4], width = 5000, fix = "center"), 
            track_params, 
            offset = 0,
            xtitle = "CTCF peaks")

Annotation position

plot_tracks(resize(ctcf.peaks[2], width = 5000, fix = "center"),
            track_params,
              annotation_position = "top")

Grouping

You can group multiple signal together on one set of axes by using the groups argument.

track_params_grouped <- 
  set_track_parameters(samp.info$fileName[1:4], 
                       annotation = TxDb.Hsapiens.UCSC.hg19.knownGene, 
                       track_names = samp.info$sampleName[1:4], 
                       share_y = TRUE,
                       groups = c("CTCF & cohesin","P300","Suz12",
                                  "CTCF & cohesin"))

plot_tracks(resize(ctcf.peaks[2], width = 5000, fix = "center"), 
            track_params_grouped)

Note that at the moment grouping is ignored when plotting multiple loci -- all tracks for the locus will be grouped together.

Locus names

When plotting multiple regions, by default the name is taken from "name" column in the ranges (if it exists). You can alter that by passing something to the locus_names argument.

plot_tracks(resize(ctcf.peaks[1:2], width = 5000, fix = "center"),
            track_params,
            locus_names = c("Peak 1","Peak 2"))

Colors

You can change the colors of the tracks by passing a colors argument.

plot_tracks(resize(ctcf.peaks[1], width = 5000, fix = "center"),
            track_params,
            colors = c("red","blue","black"))

Adjusting layout parameters

You can alter plotly layout parameters by passing a list to the layout argument. Use caution with this argument, as this could disrupt the intended layout if used incorrectly. Consult the plotly.js documentation for layout options. Here we will demonstrate adjusting two of the margins:

plot_tracks(resize(ctcf.peaks[1], width = 5000, fix = "center"),
            track_params,
            layout = list(margin = list(r = 200, t = 20)))

Add in gene or locus level summaries

We can add a boxplot next to each region summarizing the gene expression levels (or any other feature of the locus) for some group of samples. We do that by first specifying the parameters to generate such a summary based on a SummarizedExperiment object. We pass along to the group argument the name of a column in the colData of the SummarizedExperiment object that we will use to group the expression values along the x axis. The ranges argument tells what ranges correspond to what rows of the SummarizedExperiment. By default, if the SummarizedExperiment is a RangedSummarizedExperiment then the rowRanges are used.

summary_params <- 
  set_summary_parameters(rpkm_chr21,
                         groups = "GROUP",
                         ranges = resize(tss, width = 5000, fix = "center"))

We pass the summary parameters to our track parameters, via the summary argument.

track_plus_summary_params <-  set_track_parameters(samp.info$fileName[1:3], 
                                    annotation = 
                                      TxDb.Hsapiens.UCSC.hg19.knownGene, 
                                    track_names = samp.info$sampleName[1:3], 
                                    share_y = TRUE,
                                    summary = summary_params)

Calling plot_tracks with our new track parameters including the summary parameters creates a plot of the ChIP-seq signal around the TSS of those genes as well as a boxplot of the gene expression for different samples for that gene on the right. We can call it using GenomicRanges, but they must match ranges we gave to set_summary_parameters.

plot_tracks(resize(tss, width = 5000, fix = "center")[1:3], 
            track_plus_summary_params)

We can also instead call plot_tracks with rownames of the SummarizedExperiment we passed to set_summary_parameters.

plot_tracks(rownames(rpkm_chr21)[1:3], track_plus_summary_params)

Options

The properties of the summary plots can be altered by adding a named list of new parameter values to the summary_args argument. For example, we can alter the color:

plot_tracks(resize(tss, width = 5000, fix = "center")[1:3], 
            track_plus_summary_params,
            summary_args = list(colors = "black"))

Sometimes there can be issues with spacing of legend and margin. These can be addressed by passing margin and legend positions to layout argument:

plot_tracks(rownames(rpkm_chr21)[1:2], track_plus_summary_params, 
            layout = list(margin = list(r = 200), legend = list(x = 1.2)))

Apps linking heatmaps to genome tracks

We can link our heatmap views of many genes/regions with the more detailed track views by making a shiny app that links the two through a click event. We will need both our heatmap object, a track parameters object, and a link function that helps link the heatmap to the track view. The link function takes as input the heatmap as well as the potential input for plot_tracks. That input should be ordered according to how the rows were ordered in the input data for the heatmap.

hm <- iheatmap(rpkm_chr21, "rpkm",
         x = colData(rpkm_chr21)$STD_NAME, 
         y = rowData(rpkm_chr21)$SYMBOL, 
         col_annotation = colData(rpkm_chr21)[,c("TYPE","SEX")]) %>% 
  add_coverage_heatmap(tss_mats, c("P300","Suz12"))

link_fn <- heatmap_click(hm, rownames(rpkm_chr21))

Note: The following code has to be run interactively!

heatmap_to_tracks_shiny(hm, track_plus_summary_params, link_fn)


skummerf/GenomicWidgets documentation built on May 31, 2019, 6:16 p.m.