#' Clonal deconvolution
#'
#' Clone / Sub-clone decomposition of DNA sequencing data. This is recommended to be used for more than one sample preferably collected from
#' the same individual at different times. If the sample qualities vary, it is recommended to perform scaling first with \code{\link{seqn.scale}}.
#'
#' @param data A \code{dataframe} containing summary from DNA sequencing. It must include a column of sample IDs
#' and a corresponding column with the variant allele frequencies.
#' @param sample \code{Integer or character} of the column name or column number of the sample IDs.
#' @param vaf \code{Integer or character} of the column name or column number of the variant allele frequency.
#' @param allele.comp \code{Character} string for allelic composition of the variants. example: '1+1' or '2+3' etc.
#' @param n.clone Optional \code{integer} for number of suspected clones, default NULL.
#' @param n.subclone Optional \code{interger} for number of suspected subclones, default NULL.
#' @param optimization.method Method to find optimal number of clusters; \emph{GMM} or \emph{bootstrap}. Default is \emph{GMM}.
#' @param clustering.method Clustering methods; \emph{HKM}, \emph{bootkm} or \emph{hybrid}. Default is \emph{hkm}.
#' @param clonality Method for determining clonality of the predicted clusters; \emph{Allelic composition} (default) or \emph{density}
#' @param instruct \code{Character} input for accepting program suggestion.
#'
#' @details \code{cluster.doc} is meant to do two things, first determine the optimum number of clusters that \emph{should} be fitted and
#' second, to infer what groups the clusters thus obtained should be assigned to.
#' @details The data inputs interactively requested from the user help obtain the following information
#' @details \code{chromosomal segmentation} helps in determining the number of clone/sub-clone cloud to be expected in the data. As
#' variant alleles from different aberrant chromosomes may have similar relative frequencies but discordant clonal
#' interpretation. On the contrary convergent clonal alleles may demonstrate divergent frequencies if arisen from dissimilar aneuploidy.
#' @details \code{clouds} give the program a visual feedback from the user that assume to carry some biological interpretation of the
#' frequency distributions present in the data. This is a subjective estimate that the program later uses for cluster assignment.
#' @details Out of the two methods used for cluster optimization, \emph{GMM} stands for \emph{Gaussian Mixed Models} whereas \emph{bootstrap},
#' as the name suggests perform \emph{bootstrap} resampling of the VAFs in 50 repetitions with 20 runs each to find the most stable parameter
#' for clustering. \emph{GMM} outputs the optimization curve with \code{BIC} and \emph{AIC} against number of clusters chosen in the
#' \code{X-axis} where \emph{bootstrap} shows the \code{Smin} statistics instead in the \code{Y-axis}. Where as \code{gap}
#' calculates the gap statistics for each clustering. In all cases the statistics are to be interpreted as proxies for the \emph{entropy} of the
#' system. The maximum entropy is likely to indicate the most stable solution.
#' @details \code{clustering.method} gives the user three choices:
#' @details \code{HKM} is \emph{Heierarchical K-means clustering} which uses heierarchical clustering first to determine the cluster centers
#' that are subsequently used as the starting point for the K-means clustering.
#' \code{bootkm} performs a \emph{bootstrap} resampling of 20 fitted K-means clusters with 50 resamplings to out put the clusters.
#' \code{hybrid} performs \emph{hkm} on the principal component of the data.
#' @details \code{clonality} provides two choices for clonality assignment. The default is \emph{Allelic composition} that measures expected
#' clonality patterns according to the copy numbers. But in cases of unreliable allelic composition estimates this method may fail. In such
#' situations the clonality can be assigned without apriori assumptions with the alternate \emph{density} based method.
#'
#' @return A list of 12 objects is returned that includes all the summary statistics, diagnositics and the predictions as well as the mapping
#' internally used for clonal deconvolution.
#' @return \code{predicted.data} is necessarily an extension to the input \code{data} with the addition of the predicted clone and sub-clone
#' status of each variant for corresponding samples.
#' @return \code{density.map} is a distance matrix convoluted from cluster distances and desity departures.
#' @return \code{collapse} are clusters that are initially prredicted but later collapsed on each other dues to similarity between them.
#' @return \code{fitted.hkm, fitted bootkm or fitted.hybrid} is a vector of initial cluster assignment by the algorithm chosen.
#' Only one of these will have an output and the rest will show \code{NA}.
#' @return \code{Number of unscaled clusters} gives umber of predicted clusters before collapsing with density estimates.
#' @return \code{Number of scaled clusters} gives number of predicted clusters after collapsing (if any).
#' @return \code{cluster.diagnostics} if the optimization method was chosen to be \emph{GMM}, this is an object of \code{S3} class
#' that includes clustering diagnostics from the model-based clustering. If the chosen method was \emph{bootstrap} then this is a list.
#' @return \code{cluster centers} are the centroids of the predicted scaled clusters.
#' @return \code{cluster mapping} provides the map between scaled clusters and the clonal deconvolution assignments
#' @return \code{Dunn index} is the Dunn index for the fitted cluster.
#'
#' @seealso \code{\link{seqn.scale}} \code{\link{cluster.doubt}}
#'
#' @examples
#' \donttest{#cluster.doc(test.dat, 1, 2, optimization.method = 'GMM', clustering.method = 'HKM')}
#' @export
cluster.doc <- function (data = NULL, sample = NULL, vaf = NULL, allele.comp = NULL, n.clone = NULL, n.subclone = NULL, optimization.method = "GMM",
clustering.method = "HKM", clonality = "Allelic composition", instruct = TRUE){
if (missing(sample) | missing(vaf)) stop("Missing entries for sample or vaf")
if (typeof (sample) != typeof(vaf)) stop("input for sample and vaf should be similar object")
if (typeof(sample) == "character")
{x<-as_tibble(data[,c(sample,vaf)])} else {
x<-as_tibble(subset(data,select=c(sample,vaf)))}
user_col = colnames(x)
colnames(x)<-c("sample","vaf")
vaf.name<-colnames(data)[vaf]
if(missing(allele.comp) | (missing(n.clone) | missing(n.subclone))){
print("Insufficient entries provided!")
i.1 <- Init.input(x, al.comp = allele.comp)
} else {
h<-allele.comp
h1<-unlist(strsplit(h,"+")[1])[1]
h2<-unlist(strsplit(h,"+")[1])[3]
h1 <- as.numeric(h1)
h2 <- as.numeric(h2)
i.1 <- list(h1=h1, h2=h2, r=as.numeric(n.clone), m=as.numeric(n.subclone))
g<-dot_plot(d = x)
g<-g + theme(legend.position="none") + scale_color_brewer(palette="Pastel1")
print(g)
}
tmp.0<-i.1$r+i.1$m
######################################################################################################
################################# Number of cluster optimization #####################################
######################################################################################################
if (optimization.method == 'GMM') {
print("Optimizing number of clusters with GMM")
d_clust <- Mclust(as.matrix(x[,2]), G=2:10)##Deterministic clustering with GMM based density estimate
m.best <- dim(d_clust$z)[2] ##Optimum number of clusters
cat("Predicted optimal number of clusters:", m.best, "\n")
print("Plotting Information curve")
suppressWarnings(print(plot.bic(d_clust$BIC)))
}
else if (optimization.method == 'bootstrap') {
print("Optimizing number of clusters with Bootstrapping")
cat("Bootstrapping is like a box of chocolates. You never know what you're gonna get \n")
boot.k<-k.select(x[,2], range = 2:10, B = 50, r = 20, threshold = 0.75, scheme_2 = TRUE)
m.best <- boot.k$k
cat("Predicted optimal number of clusters:", m.best, "\n")
print("Plotting Smin curve")
suppressWarnings(print(plot.bic(boot.k)))
}
else if (optimization.method == 'gap') {
print("Optimizing number of clusters with Gap statistics")
cat("Bootstrapping is like a box of chocolates. You never know what you're gonna get \n")
gskmn <- clusGap(cbind(x[,2],x[,2]), FUN = kmeans, nstart = 20, K.max = 10, B = 50)
m.best <- which(gskmn[["Tab"]][,3]==max(gskmn[["Tab"]][,3]))
cat("Predicted optimal number of clusters:", m.best, "\n")
print("Plotting Gap Statistics")
suppressWarnings(print(plot(gskmn, main = "Gap statistics")))
}
else stop("Is that a real optimization method ? .... I felt a great disturbance in the source")
######################################################################################################
############################ Getting User input about the suggested prediction #######################
######################################################################################################
if (instruct == TRUE){
if (m.best == tmp.0) {
print("Your guess was correct! Proceeding with analysis.")
clust.num = m.best
} else {
h <- (readline("Would you like to see my suggestion instead? (Y/N): \n"))
if (h == "y" | h == "Y") {
clust.num = m.best
if((m.best %% 2) == 0) {
q = m.best/2
cat(paste0("\n I suggest ", q, " clone(s) and ", q, " subclone(s) \n"))
} else {
q = m.best-i.1$t.0
cat(paste0("\n I suggest ", i.1$t.0, " clone(s) and ", q, " subclone(s) \n"))
}
} else if (h == "n" | h == "N"){
cat("Suspected and predicted clone/subclone segregation mismatch!. \n")
cat("Proceeding with analysis. \n")
clust.num = tmp.0
} else {stop('Fool of a took! You shall not pass!')}
}
} else if (instruct == FALSE){
cat("Suspected and predicted clone/subclone segregation mismatch!. \n
Proceeding with analysis. \n")
clust.num = tmp.0
} else {stop('Unrecognized input provided!')}
if (clust.num > 5) {cat("WARNING: More than 5 clusters indicated; may result in redundancy \n")}
######################################################################################################
#################################### Prediction of clusters ##########################################
######################################################################################################
if (clustering.method == 'HKM'){ ##Predicting Clone Sub-clone status with HKM
print("Performing Heierarchical K-means clustering")
hkm<-hkmeans(x[,2],clust.num)
pred.clust<-hkm$cluster
} else if (clustering.method == 'bootkm'){ ##Predicting Clone Sub-clone status with bootcluster
print("Performing Bootstrapped K-means clustering")
bootkm<-stability(x[,2], clust.num, B=50, r=20, scheme_2 = TRUE)
pred.clust<-as.vector(bootkm$membership)
} else if (clustering.method == 'hybrid'){ ##Predicting Clone Sub-clone status with HCPC
print("Performing hybrid clustering")
hybrid<-suppressWarnings(HCPC(x[,2], iter.max=50, nb.clust = clust.num, graph=F))
pred.clust<-as.numeric(hybrid$data.clust$clust)
} else stop("That clustering method... What we've got here is failure to communicate.")
######################################################################################################
############################## Density based dimension adjustment ####################################
######################################################################################################
tmp<-x
tmp$clust<-as.numeric(pred.clust)
tmp.col<-ncol(tmp)
for (i in 1 :9) { tmp<-cbind(tmp,as.numeric(dbscan(dist(tmp$vaf),0.1+0.02*i)$cluster)) }
tmp$net<-rowSums(tmp[,(tmp.col+1) : (tmp.col+9)])
tmp<-tmp[,c(1,2,3,ncol(tmp))]
tmp1<-as.matrix(table(tmp$clust,tmp$net))
for(i in 1 : nrow(tmp1)) {
l<-sum(as.numeric(tmp1[i,]))
tmp1[i,]<-round(as.numeric(tmp1[i,])/l,2)
}
d<-round(dist(tmp1),2)
a<-vector()
for(i in 1 : nrow(tmp1)) {a<-c(a,mean(subset(tmp,tmp$clust==i)$vaf))}
mean.d<-dist(a)
net.d<-as.matrix(round(d * mean.d,3)) ##Density based scaling of cluster assignments
diag(net.d)<-1
similar.d<-sort(as.numeric(rownames(which(net.d< 0.05, arr.ind = T))))
if (is_empty(similar.d) == FALSE){
if(length(similar.d)>2) {
stop("More than one clusters are showing redundancy. I recommend lowering number of clones/sub-clones")}
else{
print("Reducing cluster dimension based on density map")
pred.clust<-replace(pred.clust,pred.clust==similar.d[2],similar.d[1])
}
}
tryCatch({
if(dim(table(pred.clust))<2) {
stop("The clusters merged to form only one cloud hence all are predicted clonal")
}
df<-scale(x[,2]) ##Dunn index for suggested clustering
clust.stats <- suppressWarnings(cluster.stats(d = dist(df), pred.clust))
Dunn.index <- as.numeric(clust.stats$dunn)
######################################################################################################
#################################### Clonality assignment ############################################
######################################################################################################
if (clonality == 'density'){
print("Performing density based clonal deconvolution")
if (is_empty(similar.d) == FALSE){
clust.d<-net.d[-similar.d[2],-similar.d[2]]
a<-a[-similar.d[2]]
} else {
clust.d<-net.d
}
if (length(a) < 2) {warning("All predicted clusters merged to form only one cloud.
Consider removing outliers or increase number of clonal / sub-clonal clouds")}
if(min(dist(a))<0.05){
warning("CAUTION: centroids of clonal and sub-clonal clusters are too close to comfort (+/- 0.05)!")
}
if(min(clust.stats$separation)<0.01){
warning("CAUTION: minimum separation between points of two different clusters is less than 0.01!")
}
clust.a<-as.data.frame(a)
clust.a$mcgfn<-1:nrow(clust.a)
rownames(clust.a)<-row.names(clust.d)
n.cl<-n.sbcl<-0
sim.pred<-vector()
cluster.map<-vector()
for (i in c(1,3,5,7,9)){
if(nrow(clust.d) > 1){
next.similar.1<-as.numeric(rownames(which(clust.d == min(clust.d), arr.ind = T)))
sim.pred[i]<-ifelse(clust.a[which(rownames(clust.a) == next.similar.1[2]),1] >
clust.a[which(rownames(clust.a) == next.similar.1[1]),1],'Sub-clone','Clone')
sim.pred[i+1]<-ifelse(clust.a[which(rownames(clust.a) == next.similar.1[2]),1] <
clust.a[which(rownames(clust.a) == next.similar.1[1]),1],'Sub-clone','Clone')
cluster.map<-c(cluster.map,next.similar.1)
n.cl<-n.cl+1
n.sbcl<-n.sbcl+1
clust.d<-as.matrix(clust.d[-which(rownames(clust.a) %in% next.similar.1), -which(rownames(clust.a) %in% next.similar.1)])
clust.a<-clust.a[!rownames(clust.a) %in% next.similar.1,]
}
if(nrow(clust.d) < 2) break
}
if (nrow(clust.d) == 1) {
if (i.1$r < i.1$m) {
sim.pred <- c(sim.pred,'Sub-clone')
} else {
sim.pred <- c(sim.pred,'Clone')
}
cluster.map <- c(cluster.map, as.numeric(rownames(clust.a)))
}
if(i.1$m == 0 | i.1$r == 0) {
if (i.1$m == 0) {
sim.pred<-rep("Clone",length(sim.pred))
cat("Fly you fools! User Suggests there's only Clones \n")
}
if (i.1$r == 0) {
sim.pred<-rep("Sub-clone",length(sim.pred))
cat("Fly you fools! User Suggests there's only Sub-clones \n")
}
}
pred.map<-as.data.frame(cluster.map)
pred.map$pred<-sim.pred
x$cluster.map<-pred.clust
x<-inner_join(x,pred.map,by='cluster.map')
x<-x[,c('sample','vaf','pred')]
} else {
print("Allelic composition based clonal deconvolution")
x$cluster.map<-pred.clust
mean.vaf <- x %>% group_by(cluster.map) %>% summarize(mean.vaf = round(mean(vaf),3))
map<-clonality(h1 = i.1$h1, h2 = i.1$h2, center=mean.vaf$mean.vaf)
map$cluster.map <- mean.vaf[order(mean.vaf[,2], decreasing = TRUE),]$cluster.map
pred.map<-map[,c(4,3)]
x<-inner_join(x,pred.map,by='cluster.map')
x<-x[,c('sample','vaf','clonality')]
colnames(x)<-c('sample','vaf','pred')
}
clone.perc<-table(x$pred)[[1]]*100/nrow(x)
if ( length(table(x$pred)) > 1 & (clone.perc < 5 | clone.perc > 95)) {
warning("Too unstable clustering as less than 5% variants are represented in a clone/subclone.
Recommended: removal of outliers or increase the number of clonal/sub-clonal cloud")
}
######################################################################################################
#################################### Creating the output #############################################
######################################################################################################
if (vaf.name == 'scaled.vaf') {
plot.data <- cbind(data[,c(sample,vaf-1)],x$pred)
colnames(plot.data)<-colnames(x) } else {
plot.data <- x
}
g<-dot_plot(plot.data)
g<-g + scale_color_manual(values=c("tan1","mediumpurple3"))
print(g)
data$'Predicted clonal structure'<-x$pred
if (exists("d_clust")==TRUE) {cd = d_clust}
else if (exists("boot.k")==TRUE) {cd = boot.k}
else {cd = NA}
invisible(list(predicted.data=data,
density.map=net.d,
collapse=similar.d,
fitted.hkm=ifelse(exists("hkm")==TRUE,hkm,NA),
fitted.bootkm=ifelse(exists("bootkm")==TRUE,bootkm,NA),
fitted.hybrid=ifelse(exists("hybrid")==TRUE,hybrid,NA),
'Number of unscaled clusters' = clust.num,
'Number of scaled clusters' = nrow(pred.map),
'cluster diagnostics' = cd,
'cluster centers' = a,
'cluster mapping' = pred.map,
'Dunn index' = Dunn.index,
'user input' = user_col))
}, error=function(e){
cat("ERROR : ",conditionMessage(e), "\n")
data1=data
data1$'Predicted clonal structure'<-"clone"
invisible(list(predicted.data=data1,
fitted.hkm=ifelse(exists("hkm")==TRUE,hkm,NA),
fitted.bootkm=ifelse(exists("bootkm")==TRUE,bootkm,NA),
fitted.hybrid=ifelse(exists("hybrid")==TRUE,hybrid,NA),
'Number of unscaled clusters' = clust.num,
'user input' = user_col))
})
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.