In hillarykoch/locdiffr: analysis of LOCal DIFFerences in chromatin aRchitecture

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(locdiffr)

We will run a preliminary analysis with some toy Hi-C data. There are two replicates each from two conditions. If you have more replicates, or differing numbers of replicates from each condition, that's okay. The package will automatically select which replicates to retain based on sequencing depth.

Input

To begin the analysis, you need file paths to all of the data, split into two groups. The data should be in the following tab-delimited format:

| Loc1 | Loc2 | Counts | |------|--------|--------| | 0 | 0 | 100 | | 0 | 50000 | 88 | | 0 | 100000 | 40 |

Here, Loc1 is the start of the first bin, Loc2 the start of the second bin, and counts indicating the read counts for the interactions between those 2 bins. The value in Loc1 should always be less than or equal to the value in Loc2. An individual dataset is intended to be Hi-C data from an entire chromosome. Chromosomes should be analyzed separately.

Let's prepare the data pre-loaded into this package. We can see that there are 4 datasets -- two from a knockout group and two from a parent group. These data are already in the correct format.

# locate the data
all_data <- list.files(system.file("extdata", package = "locdiffr"))
print(all_data)

Core analysis

Once we have defined file paths to two groups, the rest of the analysis is straightforward and proceeds as follows:

# Define paths to data from 2 conditions
ko_data <-
    system.file("extdata", all_data[grepl(pattern = "KO", x = all_data)], package = "locdiffr")
parent_data <-
    system.file("extdata", all_data[grepl(pattern = "parent", x = all_data)], package = "locdiffr")

# Compute the SCC scan statistics
run_scc_scan(
    infiles1 = ko_data,
    infiles2 = parent_data,
    outpath = "../inst/extdata/output/z.rds",
    resolution = 40000,
    winsizes = c(6, 12, 18),
    parallel = FALSE,
    offset = TRUE,
    min_count = 1,
    proportion_below_min_count = 0.25
)

# Fit the nearest-neighbor Gaussian process model via MCMC
fit_nngp(
    infile = "../inst/extdata/output/z.rds",
    outpath = "../inst/extdata/output/fit.rds",
    num_neighbors = 1,
    iters = 1000,
    parallel = FALSE
    )

# Use posterior samples to make new draws from the nearest-neighbor Gaussian process, for hypothesis testing
sample_new_nngps(
    scc_scan_file = "../inst/extdata/output/z.rds",
    mcmc_fit_file = "../inst/extdata/output/fit.rds",
    outpath = "../inst/extdata/output/pred.rds",
    stationary_iterations = 200:1000,
    parallel = FALSE,
    BOOT = 100
    )

Note that windows can be filtered during the\texttt{run_scc_scan} step: this is done with the options min_count and proportion_below_min_count. The setting in this example has min_count = 1 and proportion_below_min_count = 0.25. This means that if more than 25\% of the bins within a window have less than 1 measured interaction, the window is filtered from downstream analyses. To do no filtering, one can set proportion_below_min_count=1.

Another note is that currently the locdiffr package only supports the exponential covariance function, so it is not necessary to use more than 1 neighbor. To see that this is a reasonable choice, check the partial autocorrelation function (PACF) of the transformed SCC sliding window statistic. In this case, the partial autocorrelation mostly disappears after lag 1, supporting the choice of the exponential covariance function (and only 1 nearest neighbor!) on these data.

library(magrittr)
z <- readRDS("../inst/extdata/output/z.rds") %>%
    purrr::map(1) %>% purrr::map("z_s")
pacfs <-
    purrr::map2(z, names(z), ~
                    forecast::ggPacf(.x, lag.max = 12) +
                    ggplot2::ggtitle(paste0("window size = ", .y)))
cowplot::plot_grid(plotlist=pacfs, nrow = 1, ncol = 3)

Results from the model fitting are used to test whether or not a region contains significantly differential interactions. To test, we need to compute $\theta^b(s_w):=\text{I}\big[Z^{b}(s_w) < \mu^b(s_w)\big]$, the indicator that the nearest-neighbor Gaussian process sampled at iteration $b$ at location $s_w$ is less than the mean process at iteration $b$ at the same location. This is done within the sample_new_nngps function. Note that, to test using the recommended weighted false discovery exceedance (wFDX) quantity, some bootstrapping needs to be done. You can specify the number of bootstrap replicates using the BOOT argument in the sample_new_nngps function. The larger the value of BOOT, the more accuracy your estimate of wFDX. A larger value of BOOT also slightly increases computationally burden.

Output from sample_new_nngps is used to test for significantly differential regions based on either the weighted false discovery rate or weighted false discovery exceedance criteria:

# Compute rejections based on weighted false discovery rate
test_by_wFDR(
    scc_scan_file = "../inst/extdata/output/z.rds",
    sampled_nngps_file = "../inst/extdata/output/pred.rds",
    outpath = "../inst/extdata/output/wfdr.rds",
    alpha = 0.025
    )

# Compute rejections based on weighted false discovery exceedence
test_by_wFDX(
    scc_scan_file = "../inst/extdata/output/z.rds",
    sampled_nngps_file = "../inst/extdata/output/pred.rds",
    outpath = "../inst/extdata/output/wfdx.rds",
    alpha = 0.025,
    beta = 0.025
    )

Extracting regions in BED format

You can easily use the testing output from test_by_wFDR and test_by_wFDX to make 2 BED files of all examined loci, one containing loci which lie in a rejected region, and another containing loci which do not lie in a rejected region. This is done with the function make_BED_from_rejections. In the toy example used in this vignette, we only look at 1 chromosome. In practice, you will have several chromosomes. infiles contains a vector of paths to testing output from several chromosomes, and chromosomes is a (integer or character string) vector listing all the chromosomes in the same order as the files in infiles. It is absolutely essential that the order of infiles matches the order of chromosomes, as this function writes BED files containing the 3 essential BED columns: chromosome, start, and stop.

make_BED_from_rejections(
    infiles = "../inst/extdata/output/wfdr.rds",
    chromosomes = 19,
    null_outfile = "../inst/extdata/output/null_wfdr.bed",
    rejected_outfile = "../inst/extdata/output/rejected_wfdr.bed",
    resolution = 40000
    )

Built-in plotting tools

You can easily view the data and the output of the transformed SCC scan, verify that the model fitting is proceeding as expected, and check rejection locations with plot_cond_vs_cond and plot_rejections_along_process.

plot_cond_vs_cond shows a Hi-C heatmap, where each half of the plotted heatmap shows data from a given condition. You can specify the sub_range argument to zoom in on a given region. NOTE: This function downsamples the datasets to equal reads from each condition so that comparisons are happening on the same scale.

cond_vs_cond <-
    plot_cond_vs_cond(
    infiles1 = ko_data,
    infiles2 = parent_data,
    resolution = 40000,
    condition_names = c("knockout", "parent")
    )


cond_vs_cond_zoom <- plot_cond_vs_cond(
    infiles1 = ko_data,
    infiles2 = parent_data,
    resolution = 40000,
    condition_names = c("knockout", "parent"),
    sub_range = c(4800000, 10000000)
    )

cowplot::plot_grid(cond_vs_cond, cond_vs_cond_zoom, nrow = 1, ncol = 2)

plot_rejections_along_process shows the transformed SCC scan statistic with rejected loci marked. In the below plot, z1 and z2 are the observed data. z_star is the average over all of the newly sampled nearest-neighbor Gaussian processes, which we expect to typically take values which are pointwise averages of the observed processes. The mean function is the average mean function across all MCMC iterations. Rejections based on the wFDR and wFDX criteria are marked along the process.

p <- plot_rejections_along_process(scc_scan_file = "../inst/extdata/output/z.rds",
             mcmc_fit_file =  "../inst/extdata/output/fit.rds",
             sampled_nngps_file = "../inst/extdata/output/pred.rds",
             rejection_files = c("../inst/extdata/output/wfdr.rds", "../inst/extdata/output/wfdx.rds"),
             rejection_names = c("wFDR", "wFDX"))
cowplot::plot_grid(plotlist = p, nrow = 3, ncol = 1)

Likewise, you can see where these rejections fall along the Hi-C contact matrix with plot_rej_vs_diffs. This function will plot the absolute differences between pooled replicates from both conditions against the rejected regions. Like with the plot_cond_vs_cond function, you can use the sub_range argument to zoom in on a given location.

diff_v_rej <- plot_rej_vs_diffs(
    infiles1 = ko_data,
    infiles2 = parent_data,
    rejections_file = "../inst/extdata/output/wfdr.rds",
    resolution = 40000,
    condition_names = c("knockout", "parent")
    )

diff_v_rej_zoom <- plot_rej_vs_diffs(
    infiles1 = ko_data,
    infiles2 = parent_data,
    rejections_file = "../inst/extdata/output/wfdr.rds",
    resolution = 40000,
    condition_names = c("knockout", "parent"),
    sub_range = c(0, 4000000)
    )

cowplot::plot_grid(diff_v_rej, diff_v_rej_zoom, nrow = 1, ncol = 2)