In FrietzeLabUVM/ssvQC: QC for enrichment based NGS

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Objective

To illustrate the uses of ssvQC at a basic, intermediate, and advanced level.

This will be done using a variety of different datasets to demonstrate the flexibility of ssvQC.

1) Basic replicate comparison 2) Reprodicible peaks comparison between two different marks.

library(ssvQC)
SQC_OPTIONS$SQC_FORCE_CACHE_OVERWRITE = TRUE

Simple Replicate Comparison

The first task is to locate the files we need. ssvQC requires two types of files "signal" (bam or bigwig) and "feature" (any bed-like file with special support for narrowPeak and other common types). UCSC has an excellent resource for learning about different file types.

The code below also demonstrates basic usage of dir() coupled with Regex. Regex is shorthand for Regular Expression and it's a specialized syntax for matching text. Regex below is the odd looking text passed to dir as the "pattern" argument. Regex gets used again in grepl(). Here are a couple good Regex resources.

in_dir = "/slipstream/galaxy/uploads/working/qc_framework/output_SF_mcf7.repeak/"
data_paths = dir(in_dir, pattern = "H4K[5].+rep", full.names = TRUE)
bam_files = dir(data_paths, pattern = "bam$", full.names = TRUE)

np_files = dir(data_paths, pattern = ".+narrowPeak$", full.names = TRUE)
np_files = np_files[!grepl("loose", np_files)]

With bam_files for our "signal" and narrowPeak files for our "features" we can create a ssvQC object which will link the two together.

sqc1 = ssvQC(np_files, bam_files)

As we'll see in the next section, this isn't a very good ssvQC configuration but it will work, especially for this simplest case simply comparing 2 replicates.

Note the use of the $ accessor. This is the primary intended way to access, meaning get or set, data in ssvQC object.

plot(sqc1$signal_config)
plot(sqc1$features_config)

The names are rather long and don't match between the signal and config files. The colors are also automatically selected but work fine with just 2 color groups like this.

Regardless, we can run ssvQC with this data just fine. We'll start slow and just look at feature overlaps.

ssvQC has 2 families of functions ssvQC.prep and ssvQC.plot. In general you only want to run ssvQC.prep steps once and the results may take some time to calculate so by default they are cached. Meaning, if you run an ssvQC.prep step again using the same signal and feature files the previous results are simply loaded. Feature selection uses random sampling so it's important to use set.seed() prior to processing data to benefit from caching and is just generally a good idea for reproducibility.

Most ssvQC.plot functions will run the matching ssvQC.prep function as needed so it isn't actually necessary to call ssvQC.prep* independently as we do here.

set.seed(0)
#NEED TO TWEAK DEFAULT BEHAVIOR SO THIS IS SIMPLER
sqc1$features_config$run_by = "All" #default would run the 2 sets of peaks separately, this will overlap and compare together
sqc1$features_config$to_run = "All"
sqc1$features_config$meta_data$rep = c("rep1", "rep2")
sqc1$features_config$color_by = "rep" 
sqc1$features_config$n_peaks = 100
sqc1$features_config$consensus_fraction
sqc1$features_config$consensus_n

sqc1$signal_config$meta_data$rep = c("rep1", "rep2")
sqc1$signal_config$to_run = "All"
sqc1$signal_config$color_by = "rep" #default would run the 2 sets of peaks separately, this will overlap and compare together

undebug(ssvQC:::prepFeatures)
sqc1 = ssvQC.prepFeatures(sqc1) #ssvQC.plotFeatures would call this automatically.
sqc1 = ssvQC.plotFeatures(sqc1)

This doesn't actually output a plot. All plots are stored in the "plots" slot of the ssvQC object. Again we use the $ accessor.

Feature plots

sqc1$plots$features$count$All_features
sqc1$plots$features$binary_heatmap$All_features
sqc1$plots$features$UpSet$All_features
sqc1$plots$features$venn$All_features
sqc1$plots$features$euler$All_features

This kind of "feature" analysis actually doesn't rely on "signal" files at all. It would have been valid to create the ssvQC object with only the feature files and we could have generated the same plots.

ssvQC can do a lot more than this though. With signal files we can perform clustering and generate heatmaps, etc. Especially with bam files, as opposed to bigwigs, there are lots of QC metrics we can generate. The function ssvQC.runAll will perform all possible analyses on our ssvQC object.

runAll

SQC_OPTIONS$mc.cores = 10 #equivalent to `options(mc.cores = 10)`
sqc1 = ssvQC.runAll(sqc1)

This has called a series of ssvQC.plot (and their supporting ssvQC.prep) functions.

The following sets of plots are now available:

names(sqc1$plots)

Heatmap

First let's look at the average read pileup across assessed peaks.

sqc1$plots$signal$lines$All_features$All_signal

Now the signal clustered:

cowplot::plot_grid(
  sqc1$plots$signal$heatmaps$All_features$All_signal,
  sqc1$plots$signal$heatmaps.lines$All_features$All_signal
)

The default is to plot using RPM values and scale the plot to 0-10. We can tweak a couple setting that impact the color scale limits of the heatmap and y-axis limits of the line plots.

sqc1$signal_config$lineplot_free_limits = FALSE
sqc1$signal_config$plot_value = SQC_SIGNAL_VALUES$raw
sqc1$signal_config$cluster_value = SQC_SIGNAL_VALUES$raw
sqc1$signal_config$sort_value = SQC_SIGNAL_VALUES$raw

sqc1$signal_config$heatmap_limit_values = c(0, 50)

sqc1 = ssvQC.plotSignal(sqc1)

cowplot::plot_grid(
  sqc1$plots$signal$heatmaps$All_features$All_signal,
  sqc1$plots$signal$heatmaps.lines$All_features$All_signal
)

See how the raw value heatmap show what might look like differences between the reps? This is due to the read depth being substanitally different. We can see this in the reads plot.

sqc1$plots$reads$All_signal

SCC

cowplot::plot_grid(ncol = 2,
                   sqc1$plots$SCC$curves$All_features$All_signal,
                   sqc1$plots$SCC$dots$All_features$All_signal
)

Neither peak set has peaks with high correlation at read length. This is good! That kind of peak is typically an artifact from over-amplification or just mapping to artifact prone regions of the genome (i.e. peri-centromere, rDNA, other repetitive or high copy-number regions).

The dots plot shows that the peaks have a spread of correlations. This visualization can help reveal the presence of peaks that should be removed.

FRIP

sqc1$plots$reads$All_signal
sqc1$plots$features$count$All_features
sqc1$plots$FRIP$reads_per_peak
sqc1$plots$FRIP$per_peak

The FRIP (Fraction of Reads in Peaks) series of plots assesses the strenght of the signal enrichment. Keep in mind we're only looking at a small sampling of peaks in this case so the total FRIP will be lower than expected. If the FRIP per peak plots show rough equivalence, that suggests the samples are of similar quality to one another despite differences in read depth. If that's true, and the read count per peak plots show an imbalance, that strongly indicates that more reads would be a benefit to the weaker replicate.

Batch Replicate Comparison

For this next example we'll be working with more of the same dataset. Where before we only considered the H4K5AC data, here we'll look at 2 addtional H4 marks and ATAD2B, a transcription factor.

in_dir = "/slipstream/galaxy/uploads/working/qc_framework/output_SF_mcf7.repeak"
data_paths = dir(in_dir, pattern = "((H4K[5812]+)|(ATAD)|(input)).+rep", full.names = TRUE)
data_paths

bam_files = dir(data_paths, pattern = "bam$", full.names = TRUE)

np_files = dir(data_paths, pattern = ".+narrowPeak$", full.names = TRUE)
np_files = np_files[!grepl("loose", np_files)]
np_files = c(np_files,
             "/slipstream/galaxy/uploads/working/qc_framework/output_SF_mcf7/MCF7_ATAD2B_rep1/MCF7_ATAD2B_rep1_peaks.narrowPeak")

The following block illustrates one method for controlling how these files get handled and outputs get organized.

We can derive "mark" and "rep" information from the file names provided and add these to the meta_data. This is currently rather clunky and may be improved later in ssvQC.

Once added to the meta_data, we can specify "run_by" and "color_by" as these two variables respectively. "run_by" provides grouping to the data for peak overlaps and when producing various plots. Likewise, "color_by" decides how items will be colored.

While "run_by" is shared in the signal and feature configurations, it's impact is a bit different. For feature configurations it compares how feature sets are split up for overlap comparison. For each of these feature sets, groups specified by "run_by" in the signal configuration will be analyzed.

"to_run" determines which items in "run_by" will actually be run. Here we will not run items with the "mark" value of "input", since we have no peaks called on the input samples alone. We will however specify "input" at the "to_run_reference" which will include input signal in every appropriate feature group analysis.

sig_config = QcConfigSignal.files(bam_files)
sig_config$meta_data$mark = sapply(strsplit(sig_config$meta_data$group, "[_\\.]"), function(x)x[2])
sig_config$meta_data$rep = sapply(strsplit(sig_config$meta_data$group, "[_\\.]"), function(x)x[3])
sig_config$run_by = "mark"
sig_config$color_by = "rep"
sig_config$to_run = setdiff(sig_config$meta_data[[sig_config$run_by]], "input")
sig_config$to_run_reference = "input"

# plot(sig_config)

feat_config = QcConfigFeatures.files(np_files, n_peaks = 200)
feat_config$meta_data$mark = sapply(strsplit(feat_config$meta_data$group, "[_\\.]"), function(x)x[2])
feat_config$meta_data$rep = sapply(strsplit(feat_config$meta_data$group, "[_\\.]"), function(x)x[3])
feat_config$run_by = "mark"
feat_config$color_by = "rep"
feat_config$to_run = setdiff(feat_config$meta_data[[feat_config$run_by]], "input")
feat_config$to_run_reference = character()
sqc2 = ssvQC(feat_config, sig_config)

sqc2$signal_config$cluster_value = SQC_SIGNAL_VALUES$RPM
sqc2$signal_config$sort_value = SQC_SIGNAL_VALUES$RPM
sqc2$signal_config$plot_value = SQC_SIGNAL_VALUES$RPM
sqc2$signal_config$lineplot_free_limits = TRUE
sqc2$signal_config$heatmap_limit_values = c(0, 5)
sqc2$signal_config$color_by = "mark"
sqc2$signal_config$color_mapping
sqc2$signal_config$color_mapping$input = "gray"
sqc2$signal_config$color_mapping$ATAD2B = "green"
sqc2$signal_config$color_mapping$H4K12AC = "red"
sqc2$signal_config$color_mapping$H4K5AC = "blue"
sqc2$signal_config$color_mapping$H4K8AC = "purple"

Because we've extensively tweaked these configuations, let's save them for later use.

QcConfigFeatures.save_config(sqc2$features_config, file = "qc_feature_config.csv")
QcConfigSignal.save_config(sqc2$signal_config, file = "qc_signal_config.csv")

Remember to run it!

sqc2 = ssvQC.runAll(sqc2)

Let's first repeat some plots from before.

cowplot::plot_grid(
  sqc2$plots$signal$heatmaps$H4K5AC_signal,
  sqc2$plots$signal$heatmaps.lines$H4K5AC_features$H4K5AC_signal
)

Now it's quite clear that certain H4K5AC peaks are problematic. I'd recommend removing peaks where the input signal is high.

cowplot::plot_grid(
sqc2$plots$FRIP$per_peak$H4K5AC_features$H4K5AC_signal,
sqc2$plots$FRIP$per_peak$H4K8AC_features$H4K8AC_signal,
sqc2$plots$FRIP$per_peak$H4K12AC_features$H4K12AC_signal,
sqc2$plots$FRIP$per_peak$ATAD2B_features$ATAD2B_signal
)

I think this is one of the most powerful tools for quickly assessing the quality of ChIP-seq/cuut&run/ATAC-seq experiment.

sqc2$plots$SCC$curves$H4K8AC_features$H4K8AC_signal

sqc2$plots$SCC$curves$ATAD2B_features$ATAD2B_signal

The SCC maxima at read length for ATAD2B is conclusively bad.

Between Mark Comparison

qc_features = QcConfigFeatures.parse("qc_feature_config.csv", process_features = TRUE)
qc_signal = QcConfigSignal.parse("qc_signal_config.csv")
sqc3 = ssvQC(qc_features, qc_signal)