knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
To illustrate the uses of ssvQC at a basic, intermediate, and advanced level.
This will be done using a variety of different datasets to demonstrate the flexibility of ssvQC.
1) Basic replicate comparison 2) Reprodicible peaks comparison between two different marks.
library(ssvQC) SQC_OPTIONS$SQC_FORCE_CACHE_OVERWRITE = TRUE
The first task is to locate the files we need. ssvQC requires two types of files "signal" (bam or bigwig) and "feature" (any bed-like file with special support for narrowPeak and other common types). UCSC has an excellent resource for learning about different file types.
The code below also demonstrates basic usage of dir() coupled with Regex. Regex is shorthand for Regular Expression and it's a specialized syntax for matching text. Regex below is the odd looking text passed to dir as the "pattern" argument. Regex gets used again in grepl(). Here are a couple good Regex resources.
in_dir = "/slipstream/galaxy/uploads/working/qc_framework/output_SF_mcf7.repeak/" data_paths = dir(in_dir, pattern = "H4K[5].+rep", full.names = TRUE) bam_files = dir(data_paths, pattern = "bam$", full.names = TRUE) np_files = dir(data_paths, pattern = ".+narrowPeak$", full.names = TRUE) np_files = np_files[!grepl("loose", np_files)]
With bam_files for our "signal" and narrowPeak files for our "features" we can create a ssvQC object which will link the two together.
sqc1 = ssvQC(np_files, bam_files)
As we'll see in the next section, this isn't a very good ssvQC configuration but it will work, especially for this simplest case simply comparing 2 replicates.
Note the use of the $ accessor. This is the primary intended way to access, meaning get or set, data in ssvQC object.
plot(sqc1$signal_config) plot(sqc1$features_config)
The names are rather long and don't match between the signal and config files. The colors are also automatically selected but work fine with just 2 color groups like this.
Regardless, we can run ssvQC with this data just fine. We'll start slow and just look at feature overlaps.
ssvQC has 2 families of functions ssvQC.prep and ssvQC.plot. In general you only want to run ssvQC.prep steps once and the results may take some time to calculate so by default they are cached. Meaning, if you run an ssvQC.prep step again using the same signal and feature files the previous results are simply loaded. Feature selection uses random sampling so it's important to use set.seed() prior to processing data to benefit from caching and is just generally a good idea for reproducibility.
Most ssvQC.plot functions will run the matching ssvQC.prep function as needed so it isn't actually necessary to call ssvQC.prep* independently as we do here.
set.seed(0) #NEED TO TWEAK DEFAULT BEHAVIOR SO THIS IS SIMPLER sqc1$features_config$run_by = "All" #default would run the 2 sets of peaks separately, this will overlap and compare together sqc1$features_config$to_run = "All" sqc1$features_config$meta_data$rep = c("rep1", "rep2") sqc1$features_config$color_by = "rep" sqc1$features_config$n_peaks = 100 sqc1$features_config$consensus_fraction sqc1$features_config$consensus_n sqc1$signal_config$meta_data$rep = c("rep1", "rep2") sqc1$signal_config$to_run = "All" sqc1$signal_config$color_by = "rep" #default would run the 2 sets of peaks separately, this will overlap and compare together undebug(ssvQC:::prepFeatures) sqc1 = ssvQC.prepFeatures(sqc1) #ssvQC.plotFeatures would call this automatically. sqc1 = ssvQC.plotFeatures(sqc1)
This doesn't actually output a plot. All plots are stored in the "plots" slot of the ssvQC object. Again we use the $ accessor.
sqc1$plots$features$count$All_features sqc1$plots$features$binary_heatmap$All_features sqc1$plots$features$UpSet$All_features sqc1$plots$features$venn$All_features sqc1$plots$features$euler$All_features
This kind of "feature" analysis actually doesn't rely on "signal" files at all. It would have been valid to create the ssvQC object with only the feature files and we could have generated the same plots.
ssvQC can do a lot more than this though. With signal files we can perform clustering and generate heatmaps, etc. Especially with bam files, as opposed to bigwigs, there are lots of QC metrics we can generate. The function ssvQC.runAll will perform all possible analyses on our ssvQC object.
SQC_OPTIONS$mc.cores = 10 #equivalent to `options(mc.cores = 10)` sqc1 = ssvQC.runAll(sqc1)
This has called a series of ssvQC.plot (and their supporting ssvQC.prep) functions.
The following sets of plots are now available:
names(sqc1$plots)
First let's look at the average read pileup across assessed peaks.
sqc1$plots$signal$lines$All_features$All_signal
Now the signal clustered:
cowplot::plot_grid( sqc1$plots$signal$heatmaps$All_features$All_signal, sqc1$plots$signal$heatmaps.lines$All_features$All_signal )
The default is to plot using RPM values and scale the plot to 0-10. We can tweak a couple setting that impact the color scale limits of the heatmap and y-axis limits of the line plots.
sqc1$signal_config$lineplot_free_limits = FALSE sqc1$signal_config$plot_value = SQC_SIGNAL_VALUES$raw sqc1$signal_config$cluster_value = SQC_SIGNAL_VALUES$raw sqc1$signal_config$sort_value = SQC_SIGNAL_VALUES$raw sqc1$signal_config$heatmap_limit_values = c(0, 50) sqc1 = ssvQC.plotSignal(sqc1) cowplot::plot_grid( sqc1$plots$signal$heatmaps$All_features$All_signal, sqc1$plots$signal$heatmaps.lines$All_features$All_signal )
See how the raw value heatmap show what might look like differences between the reps? This is due to the read depth being substanitally different. We can see this in the reads plot.
sqc1$plots$reads$All_signal
cowplot::plot_grid(ncol = 2, sqc1$plots$SCC$curves$All_features$All_signal, sqc1$plots$SCC$dots$All_features$All_signal )
Neither peak set has peaks with high correlation at read length. This is good! That kind of peak is typically an artifact from over-amplification or just mapping to artifact prone regions of the genome (i.e. peri-centromere, rDNA, other repetitive or high copy-number regions).
The dots plot shows that the peaks have a spread of correlations. This visualization can help reveal the presence of peaks that should be removed.
sqc1$plots$reads$All_signal sqc1$plots$features$count$All_features sqc1$plots$FRIP$reads_per_peak sqc1$plots$FRIP$per_peak
The FRIP (Fraction of Reads in Peaks) series of plots assesses the strenght of the signal enrichment. Keep in mind we're only looking at a small sampling of peaks in this case so the total FRIP will be lower than expected. If the FRIP per peak plots show rough equivalence, that suggests the samples are of similar quality to one another despite differences in read depth. If that's true, and the read count per peak plots show an imbalance, that strongly indicates that more reads would be a benefit to the weaker replicate.
For this next example we'll be working with more of the same dataset. Where before we only considered the H4K5AC data, here we'll look at 2 addtional H4 marks and ATAD2B, a transcription factor.
in_dir = "/slipstream/galaxy/uploads/working/qc_framework/output_SF_mcf7.repeak" data_paths = dir(in_dir, pattern = "((H4K[5812]+)|(ATAD)|(input)).+rep", full.names = TRUE) data_paths bam_files = dir(data_paths, pattern = "bam$", full.names = TRUE) np_files = dir(data_paths, pattern = ".+narrowPeak$", full.names = TRUE) np_files = np_files[!grepl("loose", np_files)] np_files = c(np_files, "/slipstream/galaxy/uploads/working/qc_framework/output_SF_mcf7/MCF7_ATAD2B_rep1/MCF7_ATAD2B_rep1_peaks.narrowPeak")
The following block illustrates one method for controlling how these files get handled and outputs get organized.
We can derive "mark" and "rep" information from the file names provided and add these to the meta_data. This is currently rather clunky and may be improved later in ssvQC.
Once added to the meta_data, we can specify "run_by" and "color_by" as these two variables respectively. "run_by" provides grouping to the data for peak overlaps and when producing various plots. Likewise, "color_by" decides how items will be colored.
While "run_by" is shared in the signal and feature configurations, it's impact is a bit different. For feature configurations it compares how feature sets are split up for overlap comparison. For each of these feature sets, groups specified by "run_by" in the signal configuration will be analyzed.
"to_run" determines which items in "run_by" will actually be run. Here we will not run items with the "mark" value of "input", since we have no peaks called on the input samples alone. We will however specify "input" at the "to_run_reference" which will include input signal in every appropriate feature group analysis.
sig_config = QcConfigSignal.files(bam_files) sig_config$meta_data$mark = sapply(strsplit(sig_config$meta_data$group, "[_\\.]"), function(x)x[2]) sig_config$meta_data$rep = sapply(strsplit(sig_config$meta_data$group, "[_\\.]"), function(x)x[3]) sig_config$run_by = "mark" sig_config$color_by = "rep" sig_config$to_run = setdiff(sig_config$meta_data[[sig_config$run_by]], "input") sig_config$to_run_reference = "input" # plot(sig_config) feat_config = QcConfigFeatures.files(np_files, n_peaks = 200) feat_config$meta_data$mark = sapply(strsplit(feat_config$meta_data$group, "[_\\.]"), function(x)x[2]) feat_config$meta_data$rep = sapply(strsplit(feat_config$meta_data$group, "[_\\.]"), function(x)x[3]) feat_config$run_by = "mark" feat_config$color_by = "rep" feat_config$to_run = setdiff(feat_config$meta_data[[feat_config$run_by]], "input") feat_config$to_run_reference = character() sqc2 = ssvQC(feat_config, sig_config) sqc2$signal_config$cluster_value = SQC_SIGNAL_VALUES$RPM sqc2$signal_config$sort_value = SQC_SIGNAL_VALUES$RPM sqc2$signal_config$plot_value = SQC_SIGNAL_VALUES$RPM sqc2$signal_config$lineplot_free_limits = TRUE sqc2$signal_config$heatmap_limit_values = c(0, 5) sqc2$signal_config$color_by = "mark" sqc2$signal_config$color_mapping sqc2$signal_config$color_mapping$input = "gray" sqc2$signal_config$color_mapping$ATAD2B = "green" sqc2$signal_config$color_mapping$H4K12AC = "red" sqc2$signal_config$color_mapping$H4K5AC = "blue" sqc2$signal_config$color_mapping$H4K8AC = "purple"
Because we've extensively tweaked these configuations, let's save them for later use.
QcConfigFeatures.save_config(sqc2$features_config, file = "qc_feature_config.csv") QcConfigSignal.save_config(sqc2$signal_config, file = "qc_signal_config.csv")
Remember to run it!
sqc2 = ssvQC.runAll(sqc2)
Let's first repeat some plots from before.
cowplot::plot_grid( sqc2$plots$signal$heatmaps$H4K5AC_signal, sqc2$plots$signal$heatmaps.lines$H4K5AC_features$H4K5AC_signal )
Now it's quite clear that certain H4K5AC peaks are problematic. I'd recommend removing peaks where the input signal is high.
cowplot::plot_grid( sqc2$plots$FRIP$per_peak$H4K5AC_features$H4K5AC_signal, sqc2$plots$FRIP$per_peak$H4K8AC_features$H4K8AC_signal, sqc2$plots$FRIP$per_peak$H4K12AC_features$H4K12AC_signal, sqc2$plots$FRIP$per_peak$ATAD2B_features$ATAD2B_signal )
I think this is one of the most powerful tools for quickly assessing the quality of ChIP-seq/cuut&run/ATAC-seq experiment.
sqc2$plots$SCC$curves$H4K8AC_features$H4K8AC_signal
sqc2$plots$SCC$curves$ATAD2B_features$ATAD2B_signal
The SCC maxima at read length for ATAD2B is conclusively bad.
qc_features = QcConfigFeatures.parse("qc_feature_config.csv", process_features = TRUE) qc_signal = QcConfigSignal.parse("qc_signal_config.csv") sqc3 = ssvQC(qc_features, qc_signal)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.