Proof of concept: filtering pipeline (Clustering quality) "
In QuantumClone: Clustering Mutations using High Throughput Sequencing (HTS) Data

N.B.: This simulation is to show the paper pipeline when the number of biologically meaningful variants difficult to cluster is high compared to stringent mutations.

Creating data

This script is to show the importance of clustering high confidence variants and then attribute meaningful ones to the identified clusters

# Loading libraries
if(!require(QuantumClone)){
  if(!require(devtools)){
    install.packages("devtools")
  }
  devtools::install_github(repo = "DeveauP/QuantumClone")
}
if(!require(knitr)) install.packages("knitr")
if(!require(ggplot2)) install.packages("ggplot2")
if(!require(reshape2)) install.packages("reshape2")

# Creating reproducible example
source("reproduce.R")

set.seed(123)

All functions are stored inside the reproduce.R, to avoid long display of codes. Below are the values that will be used throughout the testing.

number_iterations <- 20
number_mutations <- 150
ndrivers <- 100

We will first create a test set with QuantumCat with 6 clones, r number_mutations variants, diploid, with an average depth of 100X, two samples with respective purity 70% and 60%. We make sure these variants correspond to stringent filters (i.e. depth $> 50$X).

toy.data<-QuantumCat_stringent(number_of_clones = 6,number_of_mutations = number_mutations,
                     ploidy = "AB",depth = 100,
                     contamination = c(0.3,0.4),min_depth = 50)

We check that all these variants are within the stringent filters (i.e depth $\geq 50$ X), and display the first six rows of the first sample:

sum(toy.data[[1]]$Depth<50 | toy.data[[2]]$Depth<50)
kable(toy.data[[1]][1:6,])

Then we create r number_mutations mutations that are in permissive filters. For that we take r round(number_mutations/4) mutations with 30 to 50 depth, r round(number_mutations/2) that have a depth $\geq 30$ in triploid (AAB) loci and r round(number_mutations/4) that have a depth $\geq 30$ in a tetraploid (AABB) locus.

permissive<-QuantumCat_permissive(fromQuantumCat = toy.data ,number_of_mutations = number_mutations,
                               ploidy = "AB",depth = 100,
                               contamination = c(0.3,0.4),max_depth = 50, min_depth = 30)
kable(permissive[[1]][1:6,])

We are now going to select r ndrivers drivers, with probability $10/11$ of being in the permissive filters.

drivers_id<-sample(1:(2*number_mutations),size = ndrivers,prob = rep(c(1/{20*number_mutations},
                                                                       19/{20*number_mutations}),
                                                                     each = number_mutations)
                   )
drivers_id<-drivers_id[order(drivers_id)]
drivers_id

We now want to cluster mutations using only the filtered mutations (Paper pipeline), the filtered and drivers (extended), or all mutations alltogether (All), and compare the clustering quality of these different methods.

ext<-extended(filtered = toy.data,
              permissive = permissive,
              drivers_id = drivers_id)

all<-All(filtered = toy.data,
         permissive = permissive,
         drivers_id = drivers_id
)

pap<-paper_pipeline(filtered = toy.data,
                    permissive = permissive,
                    drivers_id = drivers_id)

We are now going to compare the quality of clustering using the Normalized Mutual Information, the number of clusters found (the truth being 6), the maximal and average error in the distance of a driver to its real position. N.B.:

NMI is computed on the 200 filtered variants + drivers from permissive filters;
mean.error is the average distance between the real cellularity and its found position for ~ r number_mutations+ndrivers variants;
mean.driv.error is the average distance between the real cellularity and its found position for drivers only.

Quality<-compare_qual(paper = pap,
                      extended = ext,
                      all = all,
                      drivers_id = drivers_id)

kable(Quality)

We are now going to reproduce this test r number_iterations-1 times.

Quality<-rbind(Quality,
               reproduce(number_iterations-1,
                         number_mutations,
                         ndrivers)
               )

We can plot these results:

Melt<-melt(cbind(Quality),id = "Pipeline")
Melt$value<-as.numeric(as.character(Melt$value))

ggplot(Melt,aes_string(x= "Pipeline",y = "value",facet = "variable"))+geom_boxplot()+facet_wrap( ~ variable,nrow = floor(sqrt(length(unique(Melt$variable))))+1,scales = "free_y")+theme_bw()+theme(axis.text.x = element_text(angle = 90))