dS models to quantify natural selection in somatic evolution

The dNdScv R package is a suite of maximum-likelihood dN/dS methods designed to quantify selection in cancer and somatic evolution (Martincorena et al., 2017). The package contains functions to quantify dN/dS ratios for missense, nonsense and essential splice mutations, at the level of individual genes, groups of genes or at whole-genome level. The dNdScv method was designed to detect cancer driver genes (i.e. genes under positive selection in cancer) on datasets ranging from a few samples to thousands of samples, in whole-exome/genome or targeted sequencing studies.

Unlike traditional implementations of dN/dS, the dNdScv package uses trinucleotide context-dependent substitution models to avoid common mutation biases affecting dN/dS (Greenman et al., 2006, Martincorena et al., 2017). The package includes two different dN/dS models. dNdSloc, like traditional dN/dS implementations, uses the number of synonymous mutations in a gene to infer the local mutation rate, without exploiting additional information from other genes. dNdScv offers a much more powerful alternative, combining local information (synonymous mutations in the gene) and global information (variation of the mutation rate across genes, exploiting epigenomic covariates) to estimate the background mutation rate. dNdScv should be preferred in most situations.

This vignette shows how to perform driver discovery and selection analyses with dNdScv in cancer sequencing data. By default, dNdScv assumes human data from the GRCh37/hg19 assembly, but the latest version of the package provides a function (buildref) to generate the necessary reference file to run dNdScv on others species or assembly (see the relevant tutorial). Although designed for cancer genomic studies, dNdScv can be also used to quantify selection in other resequencing studies, such as SNP analyses, mutation accumulation studies in bacteria or for the discovery of mutations causing developmental disorders using data from human trios.

To cite this package please use: Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell.

Driver discovery (positive selection) in cancer exomes/genomes

The simplest way to run dNdScv

library("dndscv")

data("dataset_simbreast", package="dndscv")
dndsout = dndscv(mutations)

Inputs and default parameters

For this example, we have used a simulated dataset of 196 breast cancer exomes provided in the package. The simplest way to run the dndscv function is to provide a table of mutations. Mutations are provided as a data.frame with five columns (sampleID, chromosome, position, reference base and mutant base). It is important that only unique mutations are provided in the file. Multiple instances of the same mutation in related samples (for example, when sequencing multiple samples of the same tumour) should only be listed once.

head(mutations)

With the example dataset provided, the function should take about one minute to run. In this example, the function issues a warning as it detects the same mutation in more than one sample, requesting the user to verify that the input table of mutations only contains independent mutation events. In this case, each sample corresponds to a different patient and so the warning can be safely ignored.

We have run dNdScv with default parameters. This includes removing ultra-hypermutator samples and subsampling mutations when encountering too many mutations per gene in the same sample. These were designed to protect against loss of sensitivity from ultra-hypermutators and from clustered artefacts in the input mutation table, but there are occasions when the user will prefer to relax these (see Example 2).

dndscv outputs: Table of significant genes

The output of the dndscv function is a list of objects. For an analysis of exome or genome data, the most relevant output will often be the result of neutrality tests at gene level. P-values for substitutions are obtained by Likelihood-Ratio Tests as described in (Martincorena et al, 2017) and q-values are obtained by Benjamini-Hodgberg's multiple testing correction. The table also includes information on the number of substitutions of each class observed in each gene, as well as maximum-likelihood estimates (MLEs) of the dN/dS ratios for each gene, for missense (wmis), nonsense (wnon), essential splice site mutations (wspl) and indels (wind). The global q-value integrating all mutation types are available in the qglobal_cv and qallsubs_cv columns for analyses with and without indels, respectively.

sel_cv = dndsout$sel_cv
print(head(sel_cv), digits = 3)
signif_genes = sel_cv[sel_cv$qglobal_cv<0.1, c("gene_name","qglobal_cv")]
rownames(signif_genes) = NULL
print(signif_genes)

Note in the table that the dN/dS ratios of significant cancer genes are typically extremely high, often >10 or even >100. This contains information about the fraction of mutations observed in a gene that are genuine drivers under positive selection. For example, for a highly significant gene, a dN/dS of 10 indicates that there are 10 times more non-synonymous mutations in the gene than neutrally expected, suggesting that at least around 90% of these mutations are genuine drivers (Greenman et al, 2006; Martincorena et al, 2017).

Users can calculate confidence intervals for the dN/dS ratios per gene using the geneci function in the package. To use this function you need to run dndscv with the optional argument outmats=T and then use ci = geneci(dndsout). For more details see ? geneci.

dndscv outputs: Global dN/dS estimates

Another output that can be of interest is a table with the global MLEs for the dN/dS ratios across all genes. dN/dS ratios with associated confidence intervals are calculated for missense, nonsense and essential splice site substitutions separately, as well as for all non-synonymous substitutions (wall) and for all truncating substitutions together (wtru), which include nonsense and essential splice site mutations.

print(dndsout$globaldnds)

Global dN/dS ratios in somatic evolution of cancer, and seemingly of healthy somatic tissues, appear to show a near-universal pattern of dN/dS~1, with exome-wide dN/dS ratios typically slightly higher than 1 (Martincorena et al., 2017). Only occasionally, I have found datasets with global dN/dS<1, but upon closer examination, this has typically been caused by contamination of the catalogue of somatic mutations with germline SNPs. An exception are melanoma tumours, which show a bias towards slight underestimation of dN/dS due to the signature of ultraviolet-induced mutations extending beyond the trinucleotide model (Martincorena et al., 2017). In my personal experience, datasets of somatic mutations with global dN/dS<<1 have always reflected a problem of SNP contamination or an inadequate substitution model, and so the evaluation of global dN/dS values can help identify problems in certain datasets.

Other useful outputs

The dndscv function also outputs other results that can be of interest, such as an annotated table of coding mutations (annotmuts), MLEs of mutation rate parameters (mle_submodel), lists of samples and mutations excluded from the analysis and a table with the observed and expected number of mutations per gene (genemuts), among others.

head(dndsout$annotmuts)

dNdScv relies on a negative binomial regression model across genes to refine the estimated background mutation rate for a gene. This assumes that the variation of the mutation rate across genes that remains unexplained by covariates or by the sequence composition of the gene can be modelled as a Gamma distribution. This model typically works well on clean cancer genomic datasets, but not all datasets may be suitable for this model. In particular, very low estimates of $\theta$ (the overdispersion parameter), particularly $\theta<1$, may reflect problems with the suitability of the dNdScv model for the dataset.

print(dndsout$nbreg$theta)

dNdSloc: local neutrality test

An additional set of neutrality tests per gene are performed using a more traditional dN/dS model in which the local mutation rate for a gene is estimated exclusively from the synonymous mutations observed in the gene (dNdSloc) (Wong, Martincorena, et al., 2014). This test is typically only powered in very large datasets. For example, in the dataset used in this example, comprising of 196 simulated breast cancer exomes, this model only detects ARID1A as significantly mutated.

signif_genes_localmodel = as.vector(dndsout$sel_loc$gene_name[dndsout$sel_loc$qall_loc<0.1])
print(signif_genes_localmodel)

Driver discovery in targeted sequencing data

The dndscv function can take a list of genes as an input to restrict the analysis of selection. This is strictly required when analysing targeted sequencing data, and might also be used to obtain global dN/dS ratios for a particular group of genes.

To exemplify the use of the dndscv function on targeted data, we can use another example dataset provided with the dNdScv package:

library("dndscv")
data("dataset_normalskin", package="dndscv")
data("dataset_normalskin_genes", package="dndscv")
dndsskin = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf)

This dataset comprises of 3,408 unique somatic mutations detected by ultra-deep (~500x) targeted sequencing of 74 cancer genes in 234 small biopsies of normal human skin (epidermis) from four healthy individuals. Note that all of the mutations listed in the input table are genuinely independent events and so, again, we can safely ignore the two warnings issued by dndscv. For more details on this study see:

Martincorena I, et al. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6. doi: 10.1126/science.aaa6806.

In the paper above, we described a strong evidence of positive selection on somatic mutations occurring in normal human skin throughout life. These mutations are detected as microscopic clones of mutant cells in normal skin. The dNdScv analysis below recapitulates some of the key analyses of selection in this study:

sel_cv = dndsskin$sel_cv
print(sel_cv[sel_cv$qglobal_cv<0.1,c(1:10,19)], digits = 3)
print(dndsskin$globaldnds, digits = 3)

Using different substitution models

Classic maximum-likelihood implementations of dN/dS use a simple substitution model with a single rate parameter. Mutations are classified as either transitions (C<>T, G<>A) or transversions, and the single rate parameter is a transition/transversion (ts/tv) ratio reflecting the relative frequency of both classes of substitutions (Goldman & Yang, 1994). The dndscv function can take a different substitution model as input. The user can choose from existing substitution models provided in the data directory as part of the package or input a different substitution model as a matrix:

library("dndscv")
# 192 rates (used as default)
data("submod_192r_3w", package="dndscv")
colnames(substmodel) = c("syn","mis","non","spl")
head(substmodel)
# 12 rates (no context-dependence)
data("submod_12r_3w", package="dndscv")
colnames(substmodel) = c("syn","mis","non","spl")
head(substmodel)
# 2 rates (classic ts/tv model)
data("submod_2r_3w", package="dndscv")
colnames(substmodel) = c("syn","mis","non","spl")
head(substmodel)

We can fit a traditional ts/tv model to the skin dataset using the code below:

library("dndscv")
data("dataset_normalskin", package="dndscv")
data("dataset_normalskin_genes", package="dndscv")
dndsskin_2r = dndscv(m, gene_list=target_genes, max_muts_per_gene_per_sample = Inf, max_coding_muts_per_sample = Inf, sm = "2r_3w")
print(dndsskin_2r$mle_submodel)
sel_cv = dndsskin_2r$sel_cv
print(head(sel_cv[sel_cv$qglobal_cv<0.1, c(1:10,19)]), digits = 3)

In general, the full trinucleotide model is recommended for cancer genomic datasets as it typically provides the least biased dN/dS estimates. The impact of using simplistic mutation models can be considerable on global dN/dS ratios (see Martincorena et al., 2017), and can lead to false signals of negative or positive selection. In general, the impact of simple substitution models on gene-level inferences of selection tends to be smaller. AIC model selection can be easily used:

AIC(dndsskin$poissmodel)
AIC(dndsskin_2r$poissmodel)

References

Martincorena I, et al. (2017) Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. 171(5):1029-1041.
Goldman N, Yang Z. (1994). A codon-based model of nucleotide substitution for protein-coding DNA sequences. Molecular biology and evolution. 11:725-736.
Greenman C, et al. (2006) Statistical analysis of pathogenicity of somatic mutations in cancer. Genetics. 173(4):2187-98.
Wong CC, Martincorena I, et al. (2014) Inactivating CUX1 mutations promote tumorigenesis. Nature Genetics. 46(1):33-8.
Martincorena I, et al. (2015) High burden and pervasive positive selection of somatic mutations in normal human skin. Science. 348(6237):880-6.