knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.height = 5, fig.width = 6, fig.align = "center" )
In this vignette, we demonstrate the workflow for finding and visualizing differential loops between two biological conditions using the hictoolsr
, DESeq2
, and plotgardener
packages. First, data pre-processing is required to obtain .hic
files for each biological replicate and condition along with looping interactions. Loops are then merged between conditions and interaction frequency counts are extracted between loop anchors for each condition and replicate. Differential loops can then be called with DESeq2
and the results can be visualized as aggregate peak analysis (APA) plots.
As an example, we will use data from the paper, "Phase separation drives aberrant chromatin looping and cancer development", by Ahn et al. 2021 using GEO links GSE143465 and GSE143465. This paper explores the oncogenic mechanism of a rare fusion protein in acute myeloid leukemia (AML). The fusion protein, NUP98-HOXA9 (NHA9), contains a DNA-binding domain fused to an intrisically disordered region (IDR) which forms phase-separated condensates and leads to changes in 3D chromatin structure and deregulated gene expression. To explore how phase separation leads to changes in chromatin structure the IDR of NHA9 was made incapable of phase separation by mutating phenylalanine (F) amino acid residues to serines (S), then expressed in HEK293T cells (FS mutant) along with the wildtype (WT). Hi-C, ChIP-seq, and RNA-seq was performed in both WT and FS cell lines to compare chromatin structure, NHA9 binding, and gene expression in response to phase separation.
Here, we demonstrate how to find differential loops between the WT and FS biological conditions and visualize the results with aggregate peak analysis plots.
Before we can find differential loops, we must first process raw .fastq
files into .hic
files and identify significant looping interactions. These are pre-processing steps that are conducted outside of hictoolsr
. The following sections outline how to process and call loops from Hi-C data using available tools.
dietJuicer
Raw Hi-C fastqs can be converted into .hic
format using the dietJuicer
pipeline. The data should be processed at two levels:
.hic
files for each. This is necessary for count extraction used as input to DESeq
. Note the exception that sequencing replicates should always be merged, even for count extraction. Follow the directions under the dietJuicerCore
workflow to process these samples..hic
file. These .hic
files are used as input for loop calling. Follow the directions under the dietJuicerMerge
workflow to create merged .hic maps.After creating merged .hic
files with dietJucierMerge
, SIP
(Significant Interaction Peak caller) can be used to identify looping interactions.
Usage:
java -jar SIP_HiC.jar hic <hicFile> <chrSizeFile> <Output> <juicerToolsPath> [options]
Example (submitting job to UNC's longleaf cluster):
sbatch -p general -t 4320 --mem=8G --wrap="java -jar /proj/phanstiel_lab/software/SIP/SIP_HiC_v1.6.1.jar hic /path/to/file.hic /proj/phanstiel_lab/software/resources/hg19_chromSizes_filt.txt /path/to/output/directory /proj/phanstiel_lab/software/juicer/scripts/juicer_tools.jar -g 2.0 -t 2000 -fdr 0.05"
Another loop caller, HiCCUPS, was used to call loops in the data at GSE143465 from Ahn et al. 2021. Since merging functions in hictoolsr
require columns that are only generated through SIP
, loops called with SIP
are included as example data in hictoolsr
.
We called loops in both conditions to identify loops that are unique to each dataset. However, there are often duplicate loops that are present in both datasets. It is important to merge these together to avoid testing duplicate loops and to catch all unique loops. The mergeBedpe()
function does this, using DBSCAN
to combine duplicate loops that are shifted slightly between conditions.
## Load packages library(hictoolsr) library(dbscan) ## Define WT and FS loop file paths wt_loops <- system.file("extdata/WT_5kbLoops.txt", package = "hictoolsr") fs_loops <- system.file("extdata/FS_5kbLoops.txt", package = "hictoolsr") ## Merge loops and convert to GInteractions loops <- mergeBedpe(bedpeFiles = c(wt_loops, fs_loops), res = 10e3) |> as_ginteractions() head(loops)
DESeq2
requires a counts table from replicate .hic
files to call differential loops. The code below shows how to extract these counts remotely using the GEO links to each replicate .hic
file. Alternatively, these files can be downloaded and the paths to each file can be supplied (recommended due to internet instability).
## Hi-C file paths from GEO hicFiles <- c("https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259896/suppl/GSM4259896_HEK_HiC_NUP_IDR_WT_A9_1_1_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259897/suppl/GSM4259897_HEK_HiC_NUP_IDR_WT_A9_1_2_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259898/suppl/GSM4259898_HEK_HiC_NUP_IDR_WT_A9_2_1_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259899/suppl/GSM4259899_HEK_HiC_NUP_IDR_WT_A9_2_2_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259900/suppl/GSM4259900_HEK_HiC_NUP_IDR_FS_A9_1_1_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259901/suppl/GSM4259901_HEK_HiC_NUP_IDR_FS_A9_1_2_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259902/suppl/GSM4259902_HEK_HiC_NUP_IDR_FS_A9_2_1_inter_30.hic", "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM4259nnn/GSM4259903/suppl/GSM4259903_HEK_HiC_NUP_IDR_FS_A9_2_2_inter_30.hic") ## Extract Hi-C counts between loop pixels loopCounts <- extractCounts(bedpe = loops, hic = hicFiles, chroms = c(1:22, "X"), res = 10e3, norm = 'NONE', matrix = 'observed')
Since extracting counts can take some time, an example dataset has been packaged with pre-extracted counts for the code shown above.
data("loopCounts")
Extract counts takes the basename of the filepath that is supplied as the column. Lets simplify the column names:
## Load package library(InteractionSet) ## Simplify column names colnames(mcols(loopCounts)) <- gsub(pattern = "GSM.*_IDR_(WT|FS)_A9_(1|2)_(1|2)_.*", replacement = "\\1_\\2_\\3", x = colnames(mcols(loopCounts))) head(loopCounts)
The following code demonstrates how to call differential loops with DESeq2
using the loopCounts
object created above. First we isolate the count matrix from loop counts:
## Load package library(DESeq2) ## Isolate count matrix cnts <- mcols(loopCounts)[grep("WT|FS", colnames(mcols(loopCounts)))] |> as.matrix() head(cnts)
Then we create column data using the column names from the counts matrix:
## Create colData from column names colData <- do.call(rbind, strsplit(x = colnames(cnts), split = "_")) |> as.data.frame(stringsAsFactors = TRUE) |> `colnames<-`(value = c("condition", "biorep", "techrep")) colData
Next we can build a DESeq data set and compare differential loops between the "WT" and "FS" conditions:
## Build DESeq data set dds <- DESeqDataSetFromMatrix(countData = cnts, colData = colData, design = ~techrep + biorep + condition) ## Run DEseq analysis res <- DESeq(dds) |> lfcShrink(coef = "condition_WT_vs_FS", type="apeglm") summary(res)
We can explore the differential results with an MA-plot:
plotMA(res)
or a PCA plot:
plotPCA(vst(dds), intgroup = "condition") + ggplot2::theme(aspect.ratio = 1)
Let's add the differential output from DESeq2 back to our loopCounts
object, then separate WT-specific and FS-specific loops use a BH-adjusted pvalue of 0.01 and log2FoldChange above or below 0:
## Attach DESeq2 results mcols(loopCounts) <- cbind(mcols(loopCounts), res) ## Separate WT/FS-specific loops wtLoops <- loopCounts[loopCounts$padj <= 0.01 & loopCounts$log2FoldChange > 0] fsLoops <- loopCounts[loopCounts$padj <= 0.01 & loopCounts$log2FoldChange < 0] summary(wtLoops) summary(fsLoops)
Now that we have identified differential loops between "WT" and "FS" conditions we can perform an aggregate (or average) peak analysis (APA) to visualize the quality of loop calls. APAs are pileup plots of contact frequency matrices from Hi-C data surrounding the central loop pixels.
APAs extract and aggregate matrices around a Hi-C pixel at a given resolution (res
) and number of pixels in either direction (buffer
). For example, if you want to extract a 21x21 matrix at 10-kb resolution you should set res = 10e3
and buffer = 10
. "Short" interactions that are too close to the diagonal must be filtered out to avoid aggregation errors. The filterBedpe()
function calculates which interactions would intersect the diagonal and filters them out. In the code below, we assemble our loop categories into a list and apply this filtering for a resolution of 10-kb and a buffer of 10:
## Assemble all, WT, and FS loops into a list loopList <- list(allLoops = loopCounts, wtLoops = wtLoops, fsLoops = fsLoops) ## Define resolution and buffer (pixels from center) res <- 10e3 buffer <- 10 ## Filter out short loop interactions filteredLoops <- lapply(X = loopList, FUN = filterBedpe, res = res, buffer = buffer) |> `names<-`(value = names(loopList)) lapply(filteredLoops, summary)
You will notice that many of these interactions are close to the diagonal and are filtered out. The next block of code demonstrates how to apply the calcApa()
function to the list of filtered loops to extract and aggregate counts from the replicate-merged Hi-C files of the "WT" and "FS" conditions:
## Hi-C file paths from GEO wtHicPath <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE143nnn/GSE143465/suppl/GSE143465_HEK_HiC_NUP_IDR_WT_A9_megaMap_inter_30.hic" fsHicPath <- "https://ftp.ncbi.nlm.nih.gov/geo/series/GSE143nnn/GSE143465/suppl/GSE143465_HEK_HiC_NUP_IDR_FS_A9_megaMap_inter_30.hic" ## Calculate APA matrices for loops from WT Hi-C data loopApaWtHic <- lapply(X = filteredLoops, FUN = calcApa, hic = wtHicPath, norm = "KR", buffer = buffer) ## Calculate APA matrices for loops from FS Hi-C data loopApaFsHic <- lapply(X = filteredLoops, FUN = calcApa, hic = fsHicPath, norm = "KR", buffer = buffer)
Since calculating the APA matrices can take some time, an example dataset has been packaged with pre-computed APA matrices for the code shown above.
data("loopApaWtHic") data("loopApaFsHic") lapply(loopApaWtHic, dim) lapply(loopApaFsHic, dim)
The last step before visualizing these matrices is to normalize the summed values to the number of loops in each category so that the interpretation becomes the average signal per loop. This will also put the plots on a similar scale for visualizing together.
## Get the number of loops for each condition nLoops <- lapply(filteredLoops, length) ## Divide each matrix by nLoops loopApaWtHic <- Map("/", loopApaWtHic, nLoops) loopApaFsHic <- Map("/", loopApaFsHic, nLoops)
ggplot2
To visualize the results in ggplot2
we must first convert the matrix to long format.
## Convert matrix to long-format long <- loopApaWtHic$allLoops |> as.table() |> as.data.frame() |> setNames(c('rows', 'cols', 'counts')) ## Visualize with ggplot2 library(ggplot2) ggplot(data = long, mapping = aes(x = rows, y = cols, fill = counts)) + scale_fill_distiller(palette = 'YlGnBu', direction = 1) + geom_tile() + theme(aspect.ratio=1, axis.text.x = element_text(angle = 45, hjust=1))
Notice that we can flip the matrix by using rev()
on either the rows or columns to change the orientation of the Hi-C diagonal:
## Flip the matrix library(ggplot2) ggplot(data = long, mapping = aes(x = rev(rows), y = cols, fill = counts)) + scale_fill_distiller(palette = 'YlGnBu', direction = 1) + geom_tile() + theme(aspect.ratio=1, , axis.text.x = element_text(angle = 45, hjust=1))
We can apply this to all matrices in our list and combine the datasets into a "tidy" data form for visualizing with ggplot2
:
## Define function to convert a matrix to long format toLong <- \(x) { x |> as.table() |> as.data.frame() |> setNames(c('rows', 'cols', 'counts')) } ## Apply function to convert all matrices to long format apas <- list(WT = lapply(loopApaWtHic, toLong), FS = lapply(loopApaFsHic, toLong)) ## Add loopType to each data.frame and combine apas <- lapply(apas, \(x) do.call(rbind, Map(cbind, x, loopType = names(x)))) ## Add hicMap to each data.frame and combine apas <- do.call(rbind, Map(cbind, apas, hicMap = names(apas))) ## Reorder factors apas$loopType <- factor(x = apas$loopType, levels = c("allLoops", "wtLoops", "fsLoops")) apas$hicMap <- factor(x = apas$hicMap, levels = c("WT", "FS")) ## Visualize with ggplot2 library(ggplot2) ggplot(data = apas, mapping = aes(x = rows, y = cols, fill = counts)) + scale_fill_distiller(palette = 'YlGnBu', direction = 1) + facet_grid(hicMap~loopType, scales = "free") + geom_tile() + theme(aspect.ratio=1, axis.text.x = element_text(angle = 45, hjust=1))
plotgardener
plotgardener
is a genomics plotting package that allows for more flexibility than ggplot2
. As part of the plotgardener
ecosystem, hictoolsr
provides a plotApa()
function that is compatible with other plotgardener
functions. Additionally, plotApa()
can works on matrices and doesn't first require converting to long-format.
Here is an example of quickly visualizing with plotApa()
using a palette from RColorBrewer
:
library(RColorBrewer) plotApa(apa = loopApaWtHic$allLoops, palette = colorRampPalette(brewer.pal(9, 'YlGnBu')))
By supplying positional information (e.g. x
, y
, width
, height
, etc...) plotgardener
will switch to multi-figure mode and allow multiple plot arrangements on a pgPage
. Let's visualize all APA results using functions from hictoolsr
and plotgardener
:
library(plotgardener) library(purrr) ## Initiate plotgardener page pageCreate(width = 4.25, height = 3) ## Define shared parameters p <- pgParams(x = 0.5, y = 0.5, width = 1, height = 1, space = 0.075, zrange = c(0, max(unlist(c(loopApaWtHic, loopApaFsHic)))), palette = colorRampPalette(brewer.pal(9, 'YlGnBu'))) ## Define grid of coordinate positions xpos <- c(p$x, p$x + p$width + p$space, p$x + (p$width + p$space)*2) ypos <- c(p$y, p$y + p$height + p$space, p$y + (p$height + p$space)*2) ## Plot row of WT APAs wt_plots <- pmap(list(loopApaWtHic, xpos, ypos[1]), \(a, x, y) { plotApa(params = p, apa = a, x = x, y = y) }) ## Plot row of FS APAs fs_plots <- pmap(list(loopApaFsHic, xpos, ypos[2]), \(a, x, y) { plotApa(params = p, apa = a, x = x, y = y) }) ## Add legend annoHeatmapLegend(plot = wt_plots[[1]], x = p$x + (p$width + p$space)*3, y = ypos[1], width = p$space, height = p$height*0.75, fontcolor = 'black') ## Add text labels plotText(label = c("All loops", "WT loops", "FS loops"), x = xpos + p$width / 2, y = ypos[1] - p$space, just = c('center', 'bottom')) plotText(label = c("WT", "FS"), x = xpos[1] - p$space, y = ypos[1:2] + p$height / 2, rot = 90, just = c('center', 'bottom')) ## Remove Guides pageGuideHide()
As you can see, in some ways plotgardener
can be more verbose, but this comes with the added flexibility to specificy exactly where and how you would like to visualize your genomic data. Just remember, with freedom comes great responsibility!
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.