In cqgd/pky: Capture Hi-C calling algorithm

Purpose

Peaky maps physical contacts between different regions of DNA by recognizing peaks in a Capture Hi-C (CHi-C) signal for a specific region of interest. It accounts for technical and biological noise, and subsequently finds regions around which the residual signal peaks and then decays with distance. The algorithm finds sets of regions at which peaks of varying strengths combinedly explain the normalized CHi-C signal. The probability of these regions contacting a region of interest is then quantified.

Input

Peaky requires two input files. The first contains the raw CHi-C signal: readcounts (N, must be >=1) corroborating interactions between regions of interest (baits) and their putative interaction partners (preys). The second describes the physical location of each region (fragment) in the genome. These will be referred to as the interactions file and the fragments file, respectively.

An example of each is included with the package and used for the analysis below:

library(peaky)

base = system.file("extdata",package="peaky")

interactions_file = paste0(base,"/counts.tsv")
fragments_file = paste0(base,"/fragments.bed")

interactions = data.table::fread(interactions_file)
fragments = data.table::fread(fragments_file)

print(head(interactions))
print(head(fragments))

Initial normalization of counts

Peaky will initially adjust the readcounts for each pair of fragments based on the linear genomic distance between them, their lengths, and transchromosomal bait activity. The effects these characteristics have on the readcounts can vary between short and long-range interactions, and they are therefore estimated multiple times from samples of interactions spanning various ranges. The first step is thus to bin interactions based on distance using bin_interactions():

BI = bin_interactions(interactions, fragments, bins=5)
print(BI$bins)

The 5 distance bins created above all contain an equal number of interactions. Note that the linear genomic distance between two fragments is not defined when they are from two different chromosomes. Transchromosomal interactions (bottom row of the table) are therefore not assigned a distance bin.

Models that predict readscounts under the null (i.e. in the case where an interaction is not biologically significant) can then be generated for each distance bin based on a random sample of the interactions they contain using model_bin():.

models = by(BI$interactions, BI$interactions$dist.bin, model_bin, subsample_size=1000)

plot(models[[1]]$fit)

Taking the observed readcounts and adjusting for the effects of biological and technical noise in the models gives us normalized readcounts: they quantify the extent to which the observed readcounts exceed those predicted under the null. In practice, the normalized readcounts are the randomized quantile residuals of each model. Note that the Q-Q plot shows a series of outliers: unexpectedly high adjusted read counts.

The adjusted readcounts for all putative interaction are extracted from the models below. They are tacked onto the interactions, which are all grouped together again using split_baits(). Note that the p-values calculated here mark the significance of the observed readcounts before local patterns across fragments are considered (such patterns will be considered later, using peaky()).

residuals = lapply(models, "[[", "residuals")
bins = split(BI$interactions, BI$interactions$dist.bin)
BTS = split_baits(bins, residuals)

Plotting the adjusted readcounts associated with a single bait (up to 1Mb around it) reveals that there is still an autocorrelated signal:

relevant_bait = BTS[baitID==618421]
zoom = relevant_bait[abs(relevant_bait$dist)<1e6]
plot(x=zoom$dist, 
     y=zoom$residual, 
     xlab="Distance from bait (bp)", 
     ylab="Adjusted readcount",
     main=paste("Bait",unique(zoom$baitID)))

A central assumption of Peaky is that this signal stems from few direct contacts, regions above which the CHi-C signal peaks. The signal then decays amongst neighboring fragments, with which contacts are merely collateral.

Joint analysis of neighboring fragments

Next, Peaky identifies sets of direct contacts that explain the peaks in the adjusted readcounts. It takes a single bait and proposes direct contacts with several of its prey fragments simultaneously, at varying strengths.

To determine where direct contacts occur, Peaky must be told what a peak in the adjusted readcounts looks like: how quickly would the signal decay as we consider fragments further and further away from a fragment that is directly contacted? The steepness of the expected decay is set via the parameter $\omega$:

\begin{equation} y = \beta \exp{(-|\omega * d|)} \end{equation}

where $y$ is the contribution of a specific peak (direct contact) to the adjusted readcount we observe, $\beta$ represents peak height and $d$ the distance from the center of the peak in bp.

We can look at several isolated peaks to see what would be an appropriate choice of $\omega$. For example, from the decay seen around the highest peak of the bait above, we can estimate that $\omega \approx 10^{-5}$:

#Isolate the highest peak and adjusted readcounts within 40 kbp
peak_center = zoom$dist[which.max(zoom$residual)]
zoom$distance_from_peak = abs(zoom$dist-peak_center)
single_peak = zoom[distance_from_peak<4e5,]

#Define the distance function and find the best fit for omega
dist_exp = function(offset, strength, omega_power){
  omega = 10^omega_power
  strength * exp(-(abs(offset * omega)))
}

fit_omega = nls(residual~dist_exp(distance_from_peak,strength,omega_power), 
    data=single_peak, 
    start=list(strength=5, omega_power=-3))

{
plot(single_peak$distance_from_peak, single_peak$residual, 
     main=paste0("Decay of signal for an isolated peak\nEstimate of omega = 10^",
                round(coefficients(fit_omega)["omega_power"],3)),
     xlab="Distance from center of peak (bp)",
     ylab="Adjusted readcount")
lines(single_peak$distance_from_peak, fitted.values(fit_omega), col="red", lwd=3)
}

At this point, Peaky is ready to determine where peaks, or direct contacts, occur in the signal for the bait above. It does so iteratively, by proposing different sets (i.e. different models) of hypothetical peaks, and settling more often on more likely sets (through reversible-jump Markov chain Monte Carlo estimation). The number of sets proposed is controlled by the iterations parameter of peaky(). It is recommended to increase the value until the results (e.g. MPPC, see below) are highly correlated between different runs of peaky() (note that its output is not deterministic). Raw models and their likelihoods are stored in PKS, but these do not have to be inspected manually (although this is possible, e.g. to remove burn-in models or to construct trace plots). Instead, summary statistics can be calculated with interpret_peaky(). For each fragment X, these are stored in the following columns:

rjmcmc: Proportion of sets featuring a peak at X
rjmcmc_pos: Proporion of sets featuring an upward peak at X
- referred to as the marginal posterior probability of contact (MPPC) in the main paper
rjmcmc_neg: Proportion of sets featuring a downward peak at X
rjmcmc_pos_present: Proportion of peaks at X which were upward
rjmcmc_neg_present: Proportion of peaks at X which were downward
beta: Median contact strength at X across all sets (includes values of 0 whenever there was no peak at X)
beta_mean: Mean contact strength at X across all sets (includes values of 0 whenever there was no peak at X)
beta_present: Median contact strength at X whenever a peak was present
beta_mean_present: Mean contact strength at X whenever a peak was present
predicted: Reconstructed signal from all peaks, based on beta_mean
predicted_present: Reconstructed signal from all peaks, based on beta_mean_present
- can be undefined when beta_mean_present cannot be calculated, i.e. when a fragment has never been considered as a peak location

relevant_bait = BTS[baitID==618421]
omega_power = -5
PKS = peaky(relevant_bait, omega_power, iterations=1e6)
P = interpret_peaky(relevant_bait, PKS, omega_power)

par(mfrow=c(3,1))
zoom = P[abs(P$dist)<1e6]
plot(x=zoom$dist, xlab="Distance from bait (bp)",
     y=zoom$residual, ylab="Adjusted readcount")

plot(x=zoom$dist, xlab="Distance from bait (bp)",
     y=zoom$beta_mean, ylab="Mean contact strength",
     col="green")

plot(x=zoom$dist, xlab="Distance from bait (bp)",
     y=zoom$rjmcmc_pos, ylab="MPPC",
     col="blue")

While many fragments originally had high (adjusted) readcounts, the peaks in the CHi-C signal can be explained by only 3-4 direct contacts, and MPPC values jump up where these occur.

Filesystem version

To make the software more convenient for use on a cluster, Peaky has wrapper functions that interact directly with the filesystem. These wrappers generally take integer arguments specifying which inputs (out of some directory) to work with, and save results directly to disk.

The workflow is similar to (and arguably simpler than) that of the interactive version, and allows for parallelization at the following computationally intensive steps:

Removing initial noise (going from raw readcounts to adjusted readcounts) with model_bin_fs
Separating direct from collateral contacts with peaky_fs()

These two functions are demonstrated within for-loops in the example below, but would normally be run in parallel (e.g. via a scheduler).

base = system.file("extdata",package="peaky")
interactions_file = paste0(base,"/counts.tsv")
bins_dir = paste0(base,"/bins")
fragments_file = paste0(base,"/fragments.bed")

bin_interactions_fs(interactions_file, fragments_file, output_dir=bins_dir)

fits_dir = paste0(base,"/fits")

for(bin_index in 1:5){
  model_bin_fs(bins_dir,bin_index,output_dir=fits_dir,subsample_size=1000)
}

baits_dir = paste0(base,"/baits")

split_baits_fs(bins_dir,residuals_dir = fits_dir, indices=1:5, output_dir = baits_dir)

baitlist = paste0(baits_dir,"/baitlist.txt")
rjmcmc_dir = paste0(base,"/rjmcmc")
omega_power = -5

for(i in 1:1){
 peaky_fs(baitlist,i,output_dir=rjmcmc_dir,omega_power=omega_power)
}

rjmcmc_list(rjmcmc_dir)

rjmcmclist = paste0(rjmcmc_dir,"/rjmcmclist.txt")
baits_rjmcmc_dir = paste0(base,"/baits_rjmcmc")
interpret_peaky_fs(rjmcmclist,1,baits_dir,baits_rjmcmc_dir,omega_power)