In blaserlab/blaseRtools: R Tools for Blaser Lab Data Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

Certain genomic assays are best represented as genomic regions. Representing a genomic region minimally requires identification of the relevant sequence as well as a start and end position for the region of interest. Strand information is often included as well. There are multiple plain-text and binary formats to store this information. In R, the implementaion for storing genomic region data is provided by the GenomicRanges package in the form of a GRanges object. GRanges objects are flexible and powerful containers for genomic region data. In addition to these required values, GRanges allows you to store arbitrary metadata associated with each genomic region or interval. This is very useful.

This is not a tutorial on how to use GenomicRanges. Please see this instead. Another important resource not discussed in detail here is the plyranges package which provides a dplyr-like interface for manipulating GRanges objects. Here instead we present the Trace object and associated functions for visualizing single-cell or bulk genomic region data (ATAC, CHIP-seq etc.). The Trace object is an extension of GRanges with the following differences:

Trace objects have 5 "slots". Each slot is a GRanges object. The slots are
trace_data: The main track you want to show. Usually it will depict binding or accessibility as a continuous variable across a genomic region.
peaks: Defined regions of accessibility or binding represented as binary values (peak present or not) across a genomic region. Usually defined by a peak caller such as MACS.
links: Regions of chromatin that are co-accessible. This is a probabilistic determination based on how often two regions of chromatin are acessbible in multiple single cells. Does not necessarily represent a physical interaction between these regions. Not applicable for bulk assays.
gene_model: The gene model from the reference genome.
plot_range: The overall range of the Trace object.
Validity rules: trace_data, peaks, links and gene model must all come from the same genome and fit within plot_range.
Metadata: The metadata columns are not required to be the same in all of the slots. This is different from the GRangesList object.

An example of each of these slots:

library(blaseRtools)
# trace_data
Trace.data(zf_prkcda_trace)

# peaks
Trace.peaks(zf_prkcda_trace)

# links
Trace.links(zf_prkcda_trace)

# gene model
Trace.gene_model(zf_prkcda_trace)

# plot range
Trace.plot_range(zf_prkcda_trace)

The Trace object is suitable for holding such information centered about a gene of interest. It is as small as possible and can quickly provide data for very informative plots as we will show.

Making a Trace Object

Trace objects can be made either from a Signac/Seurat object or from a GRanges object.

Signac is a package that stores and reads single-cell genomic range data (accessibility, peaks, etc) to and from a Seurat object. Although Signac/Seurat objects are powerful and a standard in the community, they are large, complicated and have lots of hidden slots that are confusing and error prone for the user. So Trace objects can be used to extract the required information for visualization in a structured way and to store the data with a much smaller footprint.

Bulk chromatin analysis outputs data in bigwig files which can be read into R using the import.bw function from rtracklayer. This will generate a GRanges object. It will be extremely large and so Trace objects are much lighter-weight alternative.

You make a Trace object from a Signac/Seurat object like so:

# note this code will not run
zf_prkcda_trace <- bb_makeTrace(obj = zf,
                             genome = "danRer11",
                             gene_to_plot = "prkcda",
                             extend_left = 5000,
                             extend_right = 5000)

# loaded by blaseRtools
zf_prkcda_trace

Peaks and Links, if included in the Seurat/signac object will be transferred to the Trace object.

You make a Trace object from bulk data like so. Filtering the full GRange will speed up processing considerably.

# note this code will not run
e4_PRKCD_trace <- bb_makeTrace(obj = plyranges::filter(e4_huvec_GRange_full.pkc_cxcl8, seqnames == "chr3"),
                              gene_to_plot = "PRKCD",
                              peaks = e4_huvec_peaks.pkc_cxcl8,
                              genome = "hg38",
                              extend_left = 5000,
                              extend_right = 5000)

# loaded by blaseRtools
e4_PRKCD_trace

When you import a bigwig into a GRanges object, it will not necessarily have the same group and coverage metadata. It probably will have only a numeric column called score. If this is the case, the constructor function will rename it coverage and add a group variable with the value "bulk". You can change these options if you bigwig is different, but leaving them as default (group and coverage) will allow the plotting functions to work with defaults, so this is what is recommended.

Plotting Component Tracks

You will make the plots using a set of functions which I describe below. Each track is made separately and composed into a single panel using the patchwork package. Each track is a ggplot object and can be edited using layers like any other ggplot. Like most plotting functions, the first argument is always the data object.

Ploting trace data

You can plot the trace data using bb_plot_trace_data. Without any other arguments, it looks like this:

library(cowplot)
theme_set(theme_cowplot(font_size = 12))
bb_plot_trace_data(zf_prkcda_trace)

For this function, the default y variable is "coverage" and the facet and color variables are "group". These are the default because this is how Signac/Seurat objects hold the data. You set the "group" variable values by using the Idents function to operate on the Signac object before making the Trace object. If there are other variables you wish to display you should add them to the group variable

dat <- Trace.data(zf_prkcda_trace) |> 
  plyranges::mutate(group_alt = paste0(group, "_alt"))  
zf_prkcda_trace <- Trace.setData(zf_prkcda_trace, gr = dat)
bb_plot_trace_data(zf_prkcda_trace, facet_var = "group_alt", color_var = "group_alt")

You can set the color palette:

dat <- Trace.data(zf_prkcda_trace) |> 
  plyranges::mutate(group_alt = paste0(group, "_alt"))  
zf_prkcda_trace <- Trace.setData(zf_prkcda_trace, gr = dat)
cols <- RColorBrewer::brewer.pal(n = 5, name = "Set1")
alt_pal <- c("Erythroid_alt" = cols[1],
             "Neutrophil_alt" = cols[2],
             "Progenitor 1_alt" = cols[3],
             "Progenitor 2_alt" = cols[4],
             "Renal_alt" = cols[5]
             )
bb_plot_trace_data(zf_prkcda_trace, 
                   facet_var = "group_alt", 
                   color_var = "group_alt", 
                   pal = alt_pal)

You can stack the traces if you want:

bb_plot_trace_data(zf_prkcda_trace, 
                   facet_var = NULL, 
                   color_var = "group_alt", 
                   pal = alt_pal, 
                   legend_pos = "right")

For bulk data, you may have imported the values you want to plot under another variable name

Trace.data(e4_PRKCD_trace)

Plotting the Gene Model

This plots the underlying gene model from the reference genome. By default it picks the APPRIS prinicpal1 transcript for each gene. UTRs are shown in half the height of coding exons.

bb_plot_trace_model(zf_prkcda_trace)

Or you can show a specific transcript. There is only 1 transcript for this gene.

bb_plot_trace_model(zf_prkcda_trace, select_transcript = "ENSDART00000033819")

Here is the same gene in humans with multiple transcripts:

# prinicpal1 transcript
bb_plot_trace_model(e4_PRKCD_trace)

# an alternative transcript
bb_plot_trace_model(e4_PRKCD_trace, select_transcript = "ENST00000651505")

If you want you can change the fill color of the exon icons:

# prinicpal1 transcript
bb_plot_trace_model(e4_PRKCD_trace, icon_fill = "orange")

Plotting Peaks

This is relatively straightforward:

bb_plot_trace_peaks(zf_prkcda_trace)

Plotting Links

This function plots bezier curves between the center of linked peaks. The color in the figure indicates the link score.

bb_plot_trace_links(zf_prkcda_trace)

In this case, the color scale could be scaled differently to show the differences more clearly.

bb_plot_trace_links(zf_prkcda_trace, link_range = c(0.4,0.8))

You can also change the colors.

bb_plot_trace_links(zf_prkcda_trace, 
                    link_range = c(0.4,0.8), link_low_color = "purple", link_high_color = "green")

If there are link scores outside the range, they will be shown in dark grey which is a default for ggplot:

bb_plot_trace_links(zf_prkcda_trace, 
                    link_range = c(0.5,0.8), 
                    link_low_color = "purple", 
                    link_high_color = "green")

You can also trim those using the cutoff parameter (default is 0, no cutoff)

bb_plot_trace_links(zf_prkcda_trace, 
                    link_range = c(0.5,0.8), 
                    link_low_color = "purple", 
                    link_high_color = "green", 
                    cutoff = 0.5)

Note: links that extend outside the plot range are also trimmed.

Plotting the Axis

The default shows the plot range in kb, indicating the genome and the chromosome:

bb_plot_trace_axis(zf_prkcda_trace)

You can add a custom axis title if you want. You can do this in the function or as a separate layer.

bb_plot_trace_axis(zf_prkcda_trace, xtitle = "My Custom Title")
bb_plot_trace_axis(zf_prkcda_trace) + labs(x = "Another Custom Title")

As we will see in the next section, putting these modifications within the function calls, rather than as separate layers makes the code look cleaner.

Putting the Plot Together

We use patchwork to compose the tracks into a full plot:

library("patchwork")
bb_plot_trace_data(zf_prkcda_trace) / 
bb_plot_trace_model(zf_prkcda_trace) / 
bb_plot_trace_peaks(zf_prkcda_trace) / 
bb_plot_trace_links(zf_prkcda_trace) / 
bb_plot_trace_axis(zf_prkcda_trace)

This needs some help, so we add a plot layout function call:

bb_plot_trace_data(zf_prkcda_trace) / 
bb_plot_trace_model(zf_prkcda_trace) / 
bb_plot_trace_peaks(zf_prkcda_trace) / 
bb_plot_trace_links(zf_prkcda_trace, link_range = c(0.4,0.8)) / 
bb_plot_trace_axis(zf_prkcda_trace) + 
plot_layout(heights = c(5,1,1,1,0.01), guides = "collect")

This looks pretty good, but the Coverage plot y axes are a little cramped at this size:

(bb_plot_trace_data(zf_prkcda_trace) + scale_y_continuous(breaks = c(0,20))) / 
bb_plot_trace_model(zf_prkcda_trace) / 
bb_plot_trace_peaks(zf_prkcda_trace) / 
bb_plot_trace_links(zf_prkcda_trace, link_range = c(0.4,0.8)) / 
bb_plot_trace_axis(zf_prkcda_trace) + 
plot_layout(heights = c(5,1,1,1,0.01), guides = "collect")

This looks better. Note that I have to enclose the ggplot layers in parentheses to keep from confusing patchwork. Not the end of the world.

It also works for bulk data:

bb_plot_trace_data(e4_PRKCD_trace) / 
bb_plot_trace_model(e4_PRKCD_trace) / 
bb_plot_trace_peaks(e4_PRKCD_trace) / 
bb_plot_trace_axis(e4_PRKCD_trace) + 
plot_layout(heights = c(3,1,1,0.01))

This is good, but the facet variable is weird. The "bulk" term is a placeholder that was introduced when I made the object. I can silence the facet variable by setting facet_var = NULL to make this go away.

bb_plot_trace_data(e4_PRKCD_trace, facet_var = NULL) / 
bb_plot_trace_model(e4_PRKCD_trace) / 
bb_plot_trace_peaks(e4_PRKCD_trace) / 
bb_plot_trace_axis(e4_PRKCD_trace) + 
plot_layout(heights = c(3,1,1,0.01))

Or I can put some meaningful information in the facet variable to serve as a helpful title:

dat <- Trace.data(e4_PRKCD_trace) |> 
  plyranges::mutate(group = "E4-HUVEC")
e4_PRKCD_trace_alt <- Trace.setData(e4_PRKCD_trace, dat)
bb_plot_trace_data(e4_PRKCD_trace_alt) / 
bb_plot_trace_model(e4_PRKCD_trace_alt) / 
bb_plot_trace_peaks(e4_PRKCD_trace_alt) / 
bb_plot_trace_axis(e4_PRKCD_trace_alt) + 
plot_layout(heights = c(3,1,1,0.01))