knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(crayon.enabled=F)
library(CNAqc)
CNAqc can compute a CCF per mutation, reporting a measure of uncertainty of the estimate.
To compute CCFs one needs to "phase" the multiplicity for every input mutation. This is the task of computing the number of copies of a mutation mapping in a certain copy number segment; this task is a difficult, and can lead to erroneous CCF estimates
Note: phasing multiplicities is the same as computing "early" and "late" clonal events in the terminology of Gerstung et al.
CNAqc computes CCFs for simple clonal CNA segments, offering two algorithms to phase mutations directly from VAFs.
data("example_PCAWG", package = 'CNAqc') print(example_PCAWG)
The entropy-based approach will flag mutations for which we cannot phase multiplicity by VAFs with certainty; the CCFs of these mutations should be manually controlled and, unless necessary, discarded.
The model uses the entropy $H(x)$ of a VAF mixture with two Binomial distributions to detect mutations happened before, and after aneuploidy. The density of these distributions are:
The assumptions of CNAqc are that:
Mixture construction. CNAqc computes two Binomial densities $\text{Bin}_1$ and $\text{Bin}_2$ over the set of values $[0, n]$ with $n$ trials and success probability $v_i$, i.e. [ \text{Bin}_i = p(s \mid n; v_i) ] as the probability of finding $s$ reads with the mutant allele out of $n$ (median depth), given the mutation expected frequency $v_i$; recall that $v_i$ includes the effect of tumour purity.
To finalize the densities CNAqc determines the mixing proportions by counting how many mutations fall below the area of each Binomial density, restricted to the 1% and 99% quantiles. Two ranges $I_1 = [v_1^1; v_1^{99}]$ and $I_2 = [v_2^1; v_2^{99}]$ are computed so that:
Note that:
The counts are normalized to create a mixture [ M = \mu_1 * \text{Bin}_1 + (1 - \mu_1) * \text{Bin}_2 ] where $\mu_1 = n_1/(n_1+n_2)$ and $\mu_2 = 1 - \mu_1$ are the normalized proportions.
Entropy-based assignments.
Assigning multiplicities is difficult at the crossing of the two densities where mutations could have multiplicity $m=1$ or $m=2$. If mistaken, these mutations can determine aritficial peaks in the CCF distribution and compromise downstream subclonal deconvolution.
Mutations with VAF below $v_1$ have $m=1$, those above $v_2$ have $m=2$; for all others CNAqc uses the entropy $H(x)$
[
H(x) = - p_x \log(p_x)
]
defined from the mixture's latent variables. Peaks in $H(x)$ identify VAF ranges where one of the two densities dominates and $m$ can be estimated confidently. CNAqc uses peakPick to find two peaks $\rho_1$ and $\rho_2$ and setting $m=1$ for $x < \rho_1$, $m=2$ for $x > \rho_2$, and defines NA
for $\rho_1 \leq x \leq \rho_2$ (crossing point).
CNAqc computes a QC score for CCFs; a karyotype is considered PASS
is less than p%
(default 10%
) mutations are not assignable (CCF = NA
).
# We do not assemble the fits plot_CCF(example_PCAWG, assembly_plot = F)
The plot shows the quantities described above; from left to right we have
The rightmost CCF histogram plot is also surrounded by a green or red square to reflect QC status of PASS
(green) or FAIL
(red).
A method is available to compute CCFs regardless of the entropy $H(x)$. From the 2-class Binomial mixture, CNAqc uses the means of the Binomial parameters to determine a hard split of the data.
Define $p_1$ and $p_2$ the two means and the midpoint $\hat{p} = p1 + (p_2-p_1)/2$ [ \hat{p} = p_1 + (p_2 - p_1) * \mu_1 ] where $\mu_1 = n_1/(n_1+n_2)$ is the normalized proportions.
We set $m=1$ for $x \leq \hat{p}$, and $m=2$ otherwise. Since there are no NA
assignments, the computation is always scored PASS
for QC purposes; for this reason this computation is more "rough" than the one based on entropy.
We work with the template dataset.
# Dataset available with the package data('example_dataset_CNAqc', package = 'CNAqc') x = CNAqc::init( mutations = example_dataset_CNAqc$mutations, cna = example_dataset_CNAqc$cna, purity = example_dataset_CNAqc$purity, ref = 'hg19') print(x)
We run function compute_CCF
; without parameters we use the entropy method.
Note: one should compute CCFs only for segments that pass peaks-based QC.
x = compute_CCF(x) # Print new object, it informs us of the available computation print(x)
A tibble of CCF values (getter function CCF
) reports the computed values in column "CCF"
, and in column "mutation_multiplicity"
.
# Tibble CCF(x) %>% dplyr::select(VAF, mutation_multiplicity, CCF)
Fit plot as above.
# We just omit empty plots with empty_plot = FALSE plot_CCF(x, empty_plot = FALSE)
The CCF histogram can be computed with plot_data_histogram
, specifying CCF
as parameter.
plot_data_histogram(x, which = 'CCF')
We can compare the two CCF methods on the same data.
x_r = compute_CCF(x, method = 'ROUGH') print(x_r)
The cutoff is annotated in the panel with the mutation multiplicity $m$ assigned to each observed VAF value.
plot_CCF(x_r, empty_plot = FALSE)
Notice that for 2:0
and '2:1'
the difference among the mutation multiplicity histograms (second panel) obtained from the two methods is minimal. The results for '2:2'
is instead quite different because there are few mutations after genome doubling (leftmost peak, $m=1$), and therefore the weighted split of the rough method tends to assign most mutations to the rightmost peak ($m=2$).
A visual plot is obtained using the internal function CNAqc:::plot_mutation_multiplicity_rough
.
ggpubr::ggarrange( CNAqc:::plot_mutation_multiplicity_entropy(x, '2:2'), CNAqc:::plot_mutation_multiplicity_rough(x_r, '2:2'), nrow = 2 )
We can compare the two computed CCF distributions, and observe that the difference in the estimation from karyotype '2:2'
creates a small bump at about CCF=0.5
.
ggpubr::ggarrange( plot_data_histogram(x, which = 'CCF'), plot_data_histogram(x_r, which = 'CCF'), nrow = 1 )
Because these are high-quality data, the results between methods are quite similar. This is however not always true. The uncertainty from the entropy method stems from the variance of the mixture components. Because these are Binomial distributions, the variance depends on the number of trials, so the sequencing coverage.
If we take the data of x
and just scale by a constant the read counts, we preserve the same VAF distribution but increase the variance because we decrease coverage. Let's see this in practice retaining 70%
of the actual read counts.
x_s = init( mutations = x$mutations %>% dplyr::mutate(DP = DP * .6, NV = NV * .6), cna = x$cna, purity = x$purity) # Recompute CCF values y_e = compute_CCF(x_s, method = 'ENTROPY', karyotypes = '2:2') y_r = compute_CCF(x_s, method = 'ROUGH', karyotypes = '2:2') CCF(y_e) %>% filter(!is.na(CCF)) CCF(y_r) %>% filter(!is.na(CCF))
Observe that there are about 400 less mutations with CCF values when we run the entropy method.
ggpubr::ggarrange( CNAqc:::plot_mutation_multiplicity_entropy(y_e, '2:2'), CNAqc:::plot_mutation_multiplicity_rough(y_r, '2:2'), nrow = 2 )
ggpubr::ggarrange( plot_data_histogram(y_e, which = 'CCF', karyotypes = '2:2'), plot_data_histogram(y_r, which = 'CCF', karyotypes = '2:2'), nrow = 1 )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.