Warning: this function is in testing. Users are advised to interpret the results with caution.
The importance of recurrently mutated hotspots is widely appreciated in cancer. This tutorial shows how to apply the new sitednds function provided in the latest version of the dNdScv package to estimate dN/dS ratios at single-site level. Sitewise dN/dS estimation has a rich history in comparative genomics (e.g. Massingham and Goldman, 2005) but it has only been used in cancer studies occasionally (e.g. Martincorena et al., 2015). Yet, studying the relative strength of selection at single sites can be valuable, as emphasised by a recent study (Cannataro et al., 2018).
The new sitednds function allows the user to compute maximum-likelihood dN/dS estimates for recurrently mutated sites, as well as p-values against neutrality. Sitewise dN/dS ratios reflect the ratio between the number of observed mutations and the number expected under neutrality, while controlling for trinucleotide rates and for variable mutation rates across genes. In sparse datasets, point estimates for lowly-recurrent sites are likely to be underestimated, but p- and q-values provide a measure of their significance.
An important aspect is that mutation rates can vary considerably across sites, even after correcting for these known mutational biases. sitednds models the observed mutation counts across synonymous sites as following a negative binomial distribution. This effectively controls for Poisson noise in the mutation counts per site and fits a Gamma distribution to the unexplained variation in mutation rate across sites. P-values for site recurrence are calculated using the fitted negative binomial distribution. These p-values should be more conservative and reliable than only considering Poisson variation or non-parametric bootstrapping, but they still rely on the assumption than the Gamma distribution appropriately captures the unexplained variation across sites.
A major limitation is the fact that mapping artefacts and SNP contamination are common problems in cancer genomic datasets, and these tend to lead to recurrent false positive mutation calls. In noisy datasets, the results of sitednds can be dominated by artefacts. Users trying sitednds should be very critical of the results. In the context of cancer genomic studies, a considerable number of synonymous recurrently mutated sites among the significant hits in sitednds most certainly indicates a problem with the variant calling. This is exemplified in this tutorial analysing two real datasets.
As a small example, in this tutorial we will use public somatic mutation calls from bladder cancers from TCGA. To reduce the risk of false positives and increase the signal to noise ratio, this example will only consider mutations in Cancer Gene Census genes (v81).
library("dndscv") data("dataset_tcgablca", package="dndscv") # Loading the bladder cancer data data("cancergenes_cgc81", package="dndscv") # Loading the genes in the Cancer Gene Census (v81) dndsout = dndscv(mutations, outmats=T, gene_list=known_cancergenes)
The sitednds function takes the output of dndscv as input. In order for the dndsout object to be compatible with sitednds, users must use the "outmats=T" argument in dndscv. After running dndscv, we can evaluate the results at the gene level as explained in the main tutorial of dndscv:
sel_cv = dndsout$sel_cv print(head(sel_cv, 10), digits = 3) # Printing the top 10 genes by q-value
The table above reveals a problem with this dataset. The gene MLLT3 appears as significant in dndscv (i.e. it violates the neutral null model of dN/dS=1), but due to a very large excess of synonymous mutations (notice the high number of synonymous mutations and the very low dN/dS values). We can further confirm that the low dN/dS value in this gene is due to an excess of synonymous mutations and not genuine negative selection by comparing the observed number of synonymous mutations in the gene (43) and the expected number (exp_syn and exp_syn_cv columns below):
print(dndsout$genemuts[dndsout$genemuts$gene_name=="MLLT3",])
Thus, MLLT3 is a false positive, most likely due to recurrent artefacts or SNP contamination in the gene. A careful examination of all statistically significant genes in the dataset reveals other likely false positives. As we will see below, this will also affect the sitewise dN/dS analysis.
To run the sitewise dN/dS model on this dataset, we only need to input the dndsout object into the sitednds function. By default, sitednds will calculate sitewise dN/dS ratios and p-values for sites mutated at least two times (use the argument min_recurr to control this). While p-values are only provided for recurrently mutated sites, false discovery adjustment corrects for all possible changes.
hotspots = sitednds(dndsout) # Running sitewise dN/dS print(hotspots$theta) # Overdispersion (unexplained variation of the mutation rate across sites)
You can see that the maximum-likelihood estimate of theta is very low. This reflects considerable variation in the mutation rate across sites, not explained by the trinucleotide context or by the estimated relative mutation rate of the gene. sitednds takes this into account when calculating p-values. If there is large uncertainty in the estimation of theta, users can choose to use the lower bound estimate of theta instead of the maximum-likelihood estimate, when calculating p-values (use the argument theta_option="conservative" in sitednds).
The main output of sitednds is a table with all hotspots studied, including their position, the gene affected, the aminoacid change induced, the number of times that the mutation was observed, the expected number of mutations at this site by chance under neutrality (mu) and the dN/dS ratio. The table also contains p-values and q-values for the probability of observing that many mutations at the site by chance. Again, please treat these p-values with caution.
print(head(hotspots$recursites,10)) # First 10 lines of the table of hotspots
We can choose a significance cutoff (e.g. q-value<0.05) to list the significant hotspots in the dataset:
signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ] print(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], digits=5)
Careful examination of the significant hotspots reveals many well-known cancer-driver hotspots, including in FGFR3 (e.g. S249, Y375, G372), TP53 (e.g. R248), PIK3CA (e.g. E542, H1047), HRAS (Q61), KRAS (G12), ERBB2 (S310), ERBB3 (V104), etc. Note that the exact aminoacid position affected depends on the exact protein isoform used for annotation (see Ensembl protein IDs in dndsout$annotmuts).
However, the table of significant hotspots also contains a considerable number of likely false positives, including multiple synonymous sites in MLLT3. A proper analysis of these data would require careful reevaluation and improvement of the mutation calls, before repeating this analysis. Significant improvements to somatic mutation calls against recurrent artefacts can be achieved by using an unmatched normal panel and by more stringent filtering of germline SNP contamination.
The TCGA mutation calls used in this example are an old version and it is likely that more recent versions are much less affected by artefacts. However, I decided to use this dataset as an example to highlight the importance of critically examining the results and the impact of recurrent artefacts on driver discovery at gene and site level.
As a final note, users with whole-genome or whole-exome data can run sitednds on all genes. However, given the frequent presence of recurrent artefacts and the sparsity of cancer datasets, the signal-to-noise ratio can be considerably increased by running sitednds on a list of known cancer genes. To do so, I recommend running dndscv on all genes and then running sitednds on a list of genes of interest using the optional gene_list argument in sitednds. Running dndscv on all genes ensures that mutations from all genes are used to estimate the trinucleotide mutation rates, typically increasing their accuracy.
In a recent study, we sequenced 844 small biopsies of normal oesophageal epithelium from 9 transplant donors to study the extent of mutation and selection in a normal tissue (Martincorena et al., 2018). In this part of the tutorial, we will reanalyse this dataset using dndscv and sitednds. We first run dndscv using the settings from the analysis of normal skin data described in the main dNdScv tutorial.
library("dndscv") data("dataset_normaloesophagus", package="dndscv") # Loading the mutations in normal oesophagus mutations = unique(mutations) # Removing duplicate mutations (more conservative) data("dataset_normalskin_genes", package="dndscv") dndsout = dndscv(mutations, outmats=T, gene_list=target_genes, max_coding_muts_per_sample=Inf, max_muts_per_gene_per_sample=Inf) # outmats=T is required to run sitednds
We can see the list of genes under positive selection and the global dN/dS values using:
sel_cv = dndsout$sel_cv print(sel_cv[sel_cv$qglobal_cv<0.05, c(1:6,19)], digits = 3) print(dndsout$globaldnds, digits = 3)
To apply the sitednds model, we simply use the following code. Only the top 30 hotspots by q-value are shown, but a total of 133 sites are identified as significant with 5% FDR.
hotspots = sitednds(dndsout) # Running sitewise dN/dS signifsites = hotspots$recursites[hotspots$recursites$qval<0.05, ] head(signifsites[,c("gene","aachange","impact","freq","dnds","qval")], 30)
Remarkably, owing to the very large number of mutant clones identified in this study, this analysis finds a large number of statistically-significant sites (n=81 for qval<0.05). Reassuringly, they are all in genes detected under positive selection in the original publication (Martincorena et al, 2018). This comprises 61 sites in NOTCH1, 17 sites in TP53, the well-known PIK3CA hotspot H1047R and one site in FAT1 and TP63.
This analysis also identifies a known driver hotspot in a synonymous site of TP53 (T125T), which is known to affect splicing of TP53. Intriguingly, it also identifies a synonymous site in NOTCH1 (V717V), which deserves careful follow-up analysis. Apart from these two synonymous sites, all other 79 significant hotspots are non-synonymous.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.