knitr::opts_chunk$set(fig.width = 6, cache = FALSE) library("happyCompare") library("tidyverse") library("ggplot2") theme_set(theme_bw())
In the present vignette we will be comparing stratified recall between a set of PCR-Free vs. Nano builds from NA12878. We will be relying on pre-processed data that we have generated by running hap.py on the VCFs from BaseSpace Data Central:
hap.py ${TRUTH_VCF} ${QUERY_VCF} -o ${OUTPUT_PREFIX} -f ${CONFIDENT_REGIONS} \ --threads 40 --write-counts --stratification filtered_beds/stratification_config.txt --reference ${REF} \ --roc QUAL --roc-filter LowQual --no-json
And then imported into R using the read_samplesheet()
function from happyCompare
:
# do not run samplesheet_path <- "/path/to/happyCompare-master/pcrfree_vs_nano.csv" happy_compare <- read_samplesheet(samplesheet_path, lazy = TRUE)
happy_compare <- readRDS("stratified_counts.Rds")
We will eventually end up with a happy_compare
object (a list):
class(happy_compare) sapply(happy_compare, class)
If we furhter inspect the contents of happy_results
, we can see that each of its elements matches the data structures described in happyR
, e.g.:
names(happy_compare$happy_results[[1]])
happy_compare
objectsOnce we have a happy_compare
object we can rely on the extract_metrics
function to access our metrics of interest. Since hap.py saves stratified counts under *extended.csv
files, we can use the following:
stratified_counts <- extract_metrics(happy_compare, table = "extended") %>% # focus on PASS calls in level 0 subsets filter(Subtype == "*", Filter == "PASS", Subset.Level == 0, !grepl(pattern = "TS*", Subset))
This will conveniently merge hap.py results with samplesheet metadata, making downstream analysis easier:
stratified_counts
Finally, we can estimate highest density intervals to account for uncertainty in the stratified counts:
hdi <- stratified_counts %>% estimate_hdi(successes_col = "TRUTH.TP", totals_col = "TRUTH.TOTAL", group_cols = c("Group.Id", "Subset", "Type"), aggregate_only = FALSE)
And visualise the results using ggplot2
:
hdi %>% mutate(Subset = factor(Subset, rev(sort(unique(Subset))))) %>% filter(replicate_id == ".aggregate") %>% ggplot(aes(x = estimated_p, y = Subset, group = Subset)) + geom_point(aes(color = Group.Id), size = 2) + geom_errorbarh(aes(xmin = lower, xmax = upper, color = Group.Id), height = 0.4) + facet_grid(. ~ Type) + theme(legend.position = "bottom") + ggtitle("Recall estimates in L0 subsets") + ylab("")
We can also evaluate the significance of the differences highlighted by our plot with compare_hdi
:
hdi_diff <- compare_hdi(happy_hdi = hdi) hdi_diff %>% mutate(Subset = factor(Subset, levels = rev(sort(unique(Subset))))) %>% ggplot(aes(x = estimated_diff, y = Subset, group = Subset)) + geom_point(aes(color = significant), size = 2) + geom_errorbarh(aes(xmin = lower, xmax = upper, color = significant), height = 0.4) + geom_vline(xintercept = 0, lty = 2, lwd = 0.25) + facet_grid(. ~ Type) + theme(legend.position = "bottom") + ggtitle("Recall differences in Nano vs PCR-Free") + ylab("")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.