knitr::opts_chunk$set( collapse = TRUE, cache = FALSE, comment = "#>")
One of the features of hap.py is the ability to produce TP, FP and FN counts in specific genomic subsets, as specified by the user. This opens up the possibility to compare performance metrics across genomic loci, as well as across analysis groups. However, replicate variability and variable subset sizes pose a challenge when interpreting these type of results. E.g. if we observe the counts from the table below in 4 different replicates, can we confidently claim that there is a difference in recall across conditions for subset A? And can we say that there is a difference across subsets for condition 2?
| SUBSET | TP.CONDITION_1 | TP.CONDITION_2 | TRUTH.TOTAL | |--------|----------------|----------------|-------------| | A | 81,82,82,82 | 74,75,76,75 | 82 | | B | 2,2,2,2 | 2,2,2,2 | 2 |
In this vignette we will descibe a method to calculate credible intervals for stratified counts that takes into account the challenges stated above. In particular, we will be comparing stratified recall between a set of PCR-Free vs. Nano builds from NA12878.
library(happyR) library(tidyverse) theme_set(theme_minimal()) set.seed(42) ## helper functions for HDI calculation # source("stratified_counts_util.R")
We will be relying on a pre-processed dataset that we have generated by running hap.py on VCFs from BaseSpace Data Central - "NovaSeq S2: TruSeq PCR-Free 350 (Replicates of NA12878)" and "NovaSeq S2: TruSeq Nano 550 - (Replicates of NA12878)":
hap.py ${TRUTH_VCF} ${QUERY_VCF} -o ${OUTPUT_PREFIX} -f ${CONFIDENT_REGIONS} \ --threads 40 --write-counts --reference ${REF} \ --stratification ${STRATIFICATION_CONFIG} \ --roc QUAL --roc-filter LowQual --no-json
The example dataset consists of a set of replicates for NA12878, sequenced on NovaSeq S2 and analysed using WGSv8. 5 of the replicates have been processed using TruSeq PCR-Free, and 8 using Truseq Nano. Our goal will be to compare recall across both library preparation workflows, both at a genome-wide level, and across specific regions (coding exons in ACMG genes, high AT/GC regions and a repeat masker track). Note that, for speed reasons, we have focused our analyses on chr22
, and have subsetted the truth VCF and stratification BEDs accordingly.
Let's first load hap.py outputs into R using a happyR samplesheet:
# not run data_dir <- "demo_data.pcrfree_vs_nano/filtered_happy/" samplesheet <- readr::read_csv("group_id,replicate_id,happy_prefix PCR-Free,NA12878-I30,NA12878-I30_S1 PCR-Free,NA12878-I33,NA12878-I33_S1 PCR-Free,NA12878-I47,NA12878-I47_S1 PCR-Free,NA12878-I67,NA12878-I67_S1 PCR-Free,NA12878-I82,NA12878-I82_S1 PCR-Free,NA12878-I85,NA12878-I85_S1 Nano,NA12878-R1,NA12878-R1_S1 Nano,NA12878-R2,NA12878-R2_S1 Nano,NA12878-R3,NA12878-R3_S1 Nano,NA12878-R4,NA12878-R4_S1 Nano,NA12878-R5,NA12878-R5_S1 Nano,NA12878-R6,NA12878-R6_S1 Nano,NA12878-R7,NA12878-R7_S1 Nano,NA12878-R8,NA12878-R8_S1 ") %>% mutate(happy_prefix = sprintf("%s/%s", data_dir, happy_prefix)) hap_samplesheet <- read_samplesheet_(samplesheet) saveRDS(hap_samplesheet, file = "demo_data.Rds")
# load the demo dataset from a pre-saved Rds hap_samplesheet <- readRDS("demo_data.Rds") class(hap_samplesheet)
Once we have a happy_samplesheet
object we can rely on the extract_metrics
function to access our metrics of interest. Since hap.py saves stratified counts under *extended.csv
files, we can use the following:
stratified_counts <- extract_results(hap_samplesheet$results, table = "extended") %>% # focus on PASS calls in level 0 subsets filter(Subtype == "*", Filter == "PASS", Subset.Level == 0, !grepl(pattern = "TS*", Subset)) %>% inner_join(hap_samplesheet$samplesheet) %>% mutate(.type_group = paste(Type, group_id)) stratified_counts %>% head
Let's inspect the number of true events that overlap with each of our regions of interest:
stratified_counts %>% select(Subset, Type, Subset.Size, TRUTH.TOTAL) %>% unique() %>% spread(key = Type, value = TRUTH.TOTAL) %>% rename(TRUTH.TOTAL.INDEL = INDEL) %>% rename(TRUTH.TOTAL.SNP = SNP)
We can see how some subsets contain a large number of events (e.g. *
and repeat.masker
), whilst others barely contain any observations (e.g. coding.exons.acmg.ensembl87
).
In order to reliably compare TP counts across subsets and analysis groups, we propose a method to compute highest-posterior density intervals (HDIs) from hap.py outputs. Our method is able to account for variable subset sizes and replicate variability, and is based on the following steps:
Model counts using Negative Binomial sampling: $$k_{ij} \sim NB(\rho_{i}*n_{ij}, \sigma)$$ where $k_{ij}$ = observed successes for subset i in replicate j; $\rho_{i}$ = success rate for subset i; $n_{ij}$ = total counts for subset i in replicate j; $\sigma$ = dispersion parameter; $\rho_{i} \sim Beta(1, 1)$ = $\rho$ prior; $\sigma \sim Normal(\mu, \sigma^2)$ = $\sigma$ prior
Use MCMC to obtain posterior distributions for the success rate, per replicate
We provide a set of helper functions attached to this vignette (stratified_counts_util.R
) that make it easy to calculate HDIs from the extended metrics table that we have previously imported; e.g. for stratified recall:
# not run groups <- stratified_counts %>% select(.type_group) %>% unique() %>% unlist() demo_hdi <- lapply(seq_along(groups), function(i) { sel_group_type <- groups[i] # initalise model and sample from the posterior distribution stan_result <- sample_posterior(m = stratified_counts %>% filter(.type_group == sel_group_type), successes_field = "TRUTH.TP", totals_field = "TRUTH.TOTAL") # calculate 95% HDIs from posterior observations demo_hdi <- estimate_hdi(r = stan_result, credMass = 0.95) %>% mutate(.type_group = sel_group_type) demo_hdi }) %>% bind_rows() demo_hdi %>% head write_csv(demo_hdi, path = "demo_hdi.csv")
# load pre-computed dataset demo_hdi <- read_csv("demo_hdi.csv", col_types = cols())
Finally, we can visualise our recall HDIs as follows:
dodge_width <- 0.7 demo_hdi %>% ggplot(aes(x = subset, color = .type_group)) + # observed geom_errorbar(aes(ymin = obs.min, ymax = obs.max), alpha = 0.2, size = 5, lty = 1, position = position_dodge(width = dodge_width), width = 0) + # estimated geom_errorbar(aes(ymin = posterior.hdi.low, ymax = posterior.hdi.high), size = 0.5, alpha = 1, width = 0.6, position = position_dodge(width = dodge_width)) + geom_point(aes(y = posterior.mean), size = 2, shape = 1, position = position_dodge(width = dodge_width)) + xlab("") + ylab("METRIC.Recall") + ylim(0, 1) + coord_flip() + ggtitle("Recall across genomic subsets in NA12878")
We can see that, whilst the point estimates from hap.py suggest differences for INDEL recall in high GC regions in PCR-Free vs. Nano, the credible intervals are too wide to draw meaningful conclusions. Conversely, we can confidently claim that Nano impacts our ability to recall INDELs in repeat and high AT regions, compared to PCR-Free.
One way to evaluate the accuracy of our HDIs is through simulated data, where we know what the true success rate is and, as a result, we can measure how often that value is captured by our HDI estimates. However, we do not want to rely on a simulation approach that is based solely on the model that we have used for estimation, since it might not be fully representative of real data.
Let us take recall measurements across $r$ replicates as an example. For a subset with $n$ total variants we can classify each variant into one of these 3 groups:
1) the variant is consistently called across all of the replicates (always seen - A) 2) the variant is only called in a subset of the replicates (sometimes seen - S) 3) the variant is never called (never seen - N)
We can then define our true recall $\rho$ as follows:
$$\rho = (nq_Ax_A + nq_Sx_S + nq_Nx_N) / n$$
where $n$: total number of variants in the subset; $q_A$: probability of having variants in group A; $x_A$: recall in group A (will always be 1); $q_S$: probability of having variants in group S; $x_S$: recall in group S (variable); $q_N$: probability of having variants in group N; $x_N$: recall in group N (will always be 0).
From these parameters, we can simulate TP counts in one sample by drawing from a multinomial distribution that attributes each of the $n$ variants into one of the groups using the defined probabilities ${q_A, q_S, q_N}$. We can next simulate replicates from this sample by performing random binomial draws from each group with recalls ${x_A, x_S, x_N}$, which in practice will be ${1, x_S, 0}$. Note that the number of true positives for groups A and N will be fixed across replicates by definition, but we can still use $x_S$ to simulate replicate variability. As a result of this variability, our observed recall will differ slightly from the theoretical expectation $\rho$.
We can implement the simulation strategy discussed above as follows:
# not run # simulate a sample with N subsets, by randomly assigning values to {q_A, q_S, q_N}, x_S and n n_subsets = 1000 sim_sample = lapply(1:n_subsets, function(i) { q = runif(3, min = 0, max = 1) q = round(q/sum(q), 4) d = data.frame( i = i, qa = q[1], qs = q[2], qn = q[3] ) }) %>% bind_rows() %>% mutate(x = sample(x = seq(0, 1, 1e-4), size = n_subsets, replace = TRUE)) %>% mutate(n = sample(x = seq(1e3, 1e4, 1), size = n_subsets, replace = TRUE)) %>% mutate(expected_rho = round((n*qa + n*qs*x) / n, 4)) %>% mutate(subset_qa_qs_qn_x_n = sprintf("S%s_%s_%s_%s_%s_%s", i, qa, qs, qn, x, n)) sim_sample %>% head # simulate r replicates from the sample by drawing counts in each subset, with probabilities {1, x, 0} get_counts <- function(subset_qa_qs_qn_x_n, r) { s <- data.frame(subset_qa_qs_qn_x_n) %>% separate(subset_qa_qs_qn_x_n, into = c("subset", "qa", "qs", "qn", "x", "n"), sep = "_") counts <- table(sample( factor(c("always", "sometimes", "never")), s$n, prob = c(s$qa, s$qs, s$qn), replace = TRUE)) always <- counts["always"] sometimes <- rbinom(as.numeric(r), counts["sometimes"], as.numeric(s$x)) per_replicate_successes <- data.frame( subset_qa_qs_qn_x_n = subset_qa_qs_qn_x_n, replicate_id = paste("R", 1:r, sep = ""), successes = always + sometimes, stringsAsFactors = FALSE ) per_replicate_successes } sim_replicates <- lapply(sim_sample$subset_qa_qs_qn_x_n, function(s) get_counts(s, r = 5)) %>% bind_rows() sim_replicates %>% head # combine the sample and replicate datasets sim_data <- sim_sample %>% inner_join(sim_replicates) %>% mutate(observed_rho = round(successes / n, 4)) sim_data %>% head write_csv(sim_data, path = "sim_data.csv") # calculate HDIs stan_result <- sample_posterior(m = sim_data %>% rename(Subset = subset_qa_qs_qn_x_n), successes_field = "successes", totals_field = "n") sim_hdi <- estimate_hdi(r = stan_result) %>% separate(subset, into = c("subset", "qa", "qs", "qn", "x", "n"), sep = "_") %>% mutate(n = as.numeric(n)) %>% mutate(obs.range = obs.max - obs.min) %>% mutate(hdi.range = posterior.hdi.high - posterior.hdi.low) sim_hdi %>% head write_csv(sim_hdi, path = "sim_hdi.csv")
# load pre-computed datasets sim_data <- read_csv(file = "sim_data.csv", col_types = cols()) sim_data %>% head sim_hdi <- read_csv(file = "sim_hdi.csv", col_types = cols()) sim_hdi %>% head
By comparing the expected vs. observed success rates ($\rho$) from our simulated dataset, we can confirm that our simulation has indeed produced the intended variability:
sim_data %>% ggplot() + geom_abline(slope = 1, color = "gray70", lty = 2, lwd = 0.25) + geom_point(aes(x = expected_rho, y = observed_rho, color = replicate_id), shape = 1, size = 1) + ggtitle("Replicate variability in our simulated dataset")
We can also inspect if our HDIs respond to the variables of interest, that is, subset size and replicate variability:
p1 <- sim_hdi %>% ggplot(aes(x = n, y = hdi.range)) + geom_point(alpha = 0.5) + ggtitle("HDI range decreases with subset size") p2 <- sim_hdi %>% ggplot(aes(x = obs.range, y = hdi.range)) + geom_point(alpha = 0.5) + ggtitle("HDI range increases with replicate variability") gridExtra::grid.arrange(p1, p2, nrow = 1)
And we can confirm that our HDIs contain the simulated per-replicate probabilities by inspecting a few subsets:
i <- sample(seq(1, dim(sim_hdi)[1]), size = 10, replace = FALSE) sel_subsets <- paste("S", i, sep = "") dodge_width <- 0.7 sim_hdi %>% filter(subset %in% sel_subsets) %>% ggplot(aes(x = subset)) + # observed geom_errorbar(aes(ymin = obs.min, ymax = obs.max), alpha = 0.4, size = 5, lty = 1, position = position_dodge(width = dodge_width), width = 0) + # estimated geom_errorbar(aes(ymin = posterior.hdi.low, ymax = posterior.hdi.high), size = 0.5, alpha = 1, width = 0.6, position = position_dodge(width = dodge_width)) + geom_point(aes(y = posterior.mean), size = 2, shape = 1, position = position_dodge(width = dodge_width)) + ylab("METRIC.Recall") + xlab("") + ylim(0, 1) + coord_flip() + ggtitle("HDI estimates in a random selection of simulated subsets")
Last, we can extend the evaluation to all subsets and quantify what proportion of HDIs contain all 5 replicate observations (should be close to 95%):
obs_in_hdi <- sim_hdi %>% mutate(obs_min_in_hdi = ifelse(posterior.hdi.low <= obs.min, TRUE, FALSE)) %>% mutate(obs_max_in_hdi = ifelse(posterior.hdi.high >= obs.max, TRUE, FALSE)) %>% mutate(all_obs_in_hdi = obs_min_in_hdi & obs_max_in_hdi) obs_in_hdi %>% group_by(all_obs_in_hdi) %>% summarise(n = n())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.