neutralitytestr: Testing a neutral evolutionary model on cancer sequencing data
In neutralitytestr: Test for a Neutral Evolutionary Model in Cancer Sequencing Data

Load package

After you have installed the package, you can load the package via the normal command. This will also load some test data.

library(neutralitytestr)

The test data files include 2 simulated VAF distributions, one under a neutral evolutionary model and one with a non neutral evolutionary model. The test data are vectors of variant allele frequency, and are named VAFselection and VAFneutral. First of all we print the first few elements of one of these vectors.

head(VAFselection)

The test data were generated using an evolutionary model of cancer which produces synthetic sequencing data. The test data here were generated to approximately mimic whole genome sequencing data. There are $\sim 5000$ mutations in both datasets and the data was "sequenced" to 100X. Also included was normal contamination of 20%, that is why we observe the peak at roughly 0.4 ($2 \times 0.4 = 0.8$) which represents the clonal mutations in the sample, that is mutations present in every cancer cell. For both data the effective mutation rate (or equivalently the per tumour doubling mutation rate) is 200 corresponding to a WGS base per rate of $\sim 7 \times 10^{-8}$. For the VAFneutral data a neutral model where all cells have the same fitness was used, for the VAFselection data there is a single subclone at frequency 0.44 hence the peak at around 0.2 in the VAF spectrum (as will be seen below).

`neutralitytest` object

The basic functionality of the neutralitytestr package is achieved by creating a neutralitytest object. The neutralitytest object contains a range of metrics to test for neutrality, and makes plotting histograms and cumulative distributions to visualize the output easy. The neutralitytest function takes a vector of VAFs and an upper and lower limit for the frequency range over which we wish to test whether the data is consistent with a neutral model, and then calculates all 4 metrics.

s <- neutralitytest(VAFselection, fmin = 0.1, fmax = 0.25)

We can the print a summary of the neutralitytest object for the synthetic data with selection. This prints out all values and associated p-values for all the metrics. The p-values are the p-values under the null model of neutral evolution.

summary(s)

Rather than manually inputting the minimum and maximum of the integration range we can also input values for the read depth ($D$), the cellularity (% tumour content, c) of the sample, overdispersion of the sequencing data (rho) and the ploidy ($\pi$). This will calculate the expected standard deviation of the clonal cluster using the following equation: $$ SD = \sqrt{\frac{1 + (D-1)\times\rho}{D}}$$ We expect the mean of the cluster to be at the following frequency: $$ M = \frac{c}{\pi}$$ Then the the maximum of the integration range, fmax will be set to $M - 2\times SD$.

We'll now test the model on some synthetic data from a neutral evolutionary model using by inputting values for the read depth, cellularity, overdispersion parameter rho and ploidy and allowing the limits to be calculated automatically.

n <- neutralitytest(VAFneutral, read_depth = 100.0, cellularity = 0.8, rho = 0.0, ploidy = 2)
summary(n)

Plotting

The neutralitytest object makes plotting simple. Plotting uses ggplot, so plots can easily modified using the usual ggplot syntax. First of all we can plot a histogram of the VAFs.

vaf_histogram(n)

We can also plot the cumulative distribution along with the least squares best fit line, from which we get the $R^2$ value and the estimated mutation rate.

lsq_plot(n)

And finally the normalized cumulative distribution, which is used to calculate the kolmogorov distance and the area metrics.

normalized_plot(n)

Invoking the plot command will plot all 3 plots together, all these plots are generated using ggplot2, they can therefore be modified as any other ggplot object would be including saving etc.

gout <- plot_all(s)
gout

Summary of neutrality metrics

neutralitytestr.R calculates values for 4 different metrics from which we can deduce whether a given dataset is likely to be driven by a neutral evolutionary process or not. These metrics are all based the cumulative distributions of mutations, M(f), which under a neutral model follows the following equation: \begin{equation} M(f) = \frac{\mu}{\beta} \left (\frac{1}{f} - \frac{1}{f_{max}} \right) \end{equation} where $f$ is the frequency of mutations, $\mu$ is the mutation rate, $\beta$ is the proportion of divisions that results in 2 surviving offspring and $f_{max}$ is the maximum frequency over which we conduct the analysis. The first metric we calculate is the $R^2$ value from the best fit line of the linear model described by equation (1).

The other metrics are based on a normalized version of equation (1) which removes the mutation rate dependency. The equation for the normalized $M(f)$ is \begin{equation} M(f) = \frac{\frac{1}{f}-\frac{1}{f_{max}}} {\frac{1}{f_{min}} - \frac{1}{f_{max}}} \end{equation} Theoretically, any dataset can be compared to the curve described by equation (2). We can then calculate the area between the data and this curve, the kolmogorov distance between the two and the mean distance of all points and this curve. For further details on the theoretical background see Williams, Werner et al. Nat. Gen. 2016.

Input data

For the best results we would advise filtering mutations for a specific ploidy, ie only including mutations that fill in diploid regions and attempting to estimate the cellularity. A good heuristic for estimating cellularity is looking at the VAF histogram for mutations in diploid only regions and observing where the mean of the clonal cluster falls. @ times this value gives a good estimate of the cellularity.

Notes on the p-values

Note that the p-values should be interpreted with care and are meant to serve as an approximation to guide the interpretation of the test statistics. These p-values were generated empirically from a simulated cohort of cancers with known ground truth and are derived from the same data that generated the ROC curves in supplementary figure 3 from Williams et al BioRxiv. This cohort of simulated tumours were "sequenced" to 100X and thus if a sample you are analysing is sequenced to much higher or lower depth the p-values may no longer be valid. We have also developed a Bayesian alternative to identifying neutral and non-neutral tumours called SubClonalSelection.jl in julia which complements the approach in this package. Note that this Bayesian method is much more computationally expensive and can take upwards of 10 hours per sample.