In Illumina/happyCompare: Reporting toolbox for hap.py output

knitr::opts_chunk$set(fig.width = 6, cache = FALSE)

library("happyCompare")
library("tidyverse")
library("ggplot2")
theme_set(theme_bw())

Set up

In the present vignette we will be comparing stratified recall between a set of PCR-Free vs. Nano builds from NA12878. We will be relying on pre-processed data that we have generated by running hap.py on the VCFs from BaseSpace Data Central:

hap.py ${TRUTH_VCF} ${QUERY_VCF} -o ${OUTPUT_PREFIX} -f ${CONFIDENT_REGIONS} \
    --threads 40 --write-counts --stratification filtered_beds/stratification_config.txt --reference ${REF} \
    --roc QUAL --roc-filter LowQual --no-json

And then imported into R using the read_samplesheet() function from happyCompare:

# do not run
samplesheet_path <- "/path/to/happyCompare-master/pcrfree_vs_nano.csv"
happy_compare <- read_samplesheet(samplesheet_path, lazy = TRUE)

happy_compare <- readRDS("stratified_counts.Rds")

We will eventually end up with a happy_compare object (a list):

class(happy_compare)
sapply(happy_compare, class)

If we furhter inspect the contents of happy_results, we can see that each of its elements matches the data structures described in happyR, e.g.:

names(happy_compare$happy_results[[1]])

Extracting data from `happy_compare` objects

Once we have a happy_compare object we can rely on the extract_metrics function to access our metrics of interest. Since hap.py saves stratified counts under *extended.csv files, we can use the following:

stratified_counts <- extract_metrics(happy_compare, table = "extended") %>% 
  # focus on PASS calls in level 0 subsets
  filter(Subtype == "*", Filter == "PASS", Subset.Level == 0, !grepl(pattern = "TS*", Subset))

This will conveniently merge hap.py results with samplesheet metadata, making downstream analysis easier:

stratified_counts

Downstream analysis of stratified counts

Finally, we can estimate highest density intervals to account for uncertainty in the stratified counts:

hdi <- stratified_counts %>% 
  estimate_hdi(successes_col = "TRUTH.TP", totals_col = "TRUTH.TOTAL", 
               group_cols = c("Group.Id", "Subset", "Type"), aggregate_only = FALSE)

And visualise the results using ggplot2:

hdi %>% 
  mutate(Subset = factor(Subset, rev(sort(unique(Subset))))) %>% 
  filter(replicate_id == ".aggregate") %>% 
  ggplot(aes(x = estimated_p, y = Subset, group = Subset)) +
    geom_point(aes(color = Group.Id), size = 2) +
    geom_errorbarh(aes(xmin = lower, xmax = upper, color = Group.Id), height = 0.4) +
    facet_grid(. ~ Type) +
    theme(legend.position = "bottom") +
    ggtitle("Recall estimates in L0 subsets") +
    ylab("")

We can also evaluate the significance of the differences highlighted by our plot with compare_hdi:

hdi_diff <- compare_hdi(happy_hdi = hdi)

hdi_diff %>% 
  mutate(Subset = factor(Subset, levels = rev(sort(unique(Subset))))) %>% 
  ggplot(aes(x = estimated_diff, y = Subset, group = Subset)) +
    geom_point(aes(color = significant), size = 2) +
    geom_errorbarh(aes(xmin = lower, xmax = upper, color = significant), height = 0.4) +
    geom_vline(xintercept = 0, lty = 2, lwd = 0.25) +
    facet_grid(. ~ Type) +
    theme(legend.position = "bottom") +
    ggtitle("Recall differences in Nano vs PCR-Free") +
    ylab("")