cluster.doc: Clonal deconvolution
In Subhayan18/CRUST: Multi-regional clonal deconvolution of tumor sequencing data

Description Usage Arguments Details Value See Also Examples

Clone / Sub-clone decomposition of DNA sequencing data. This is recommended to be used for more than one sample preferably collected from the same individual at different times. If the sample qualities vary, it is recommended to perform scaling first with seqn.scale.

cluster.doc(
  data = NULL,
  sample = NULL,
  vaf = NULL,
  allele.comp = NULL,
  n.clone = NULL,
  n.subclone = NULL,
  optimization.method = "GMM",
  clustering.method = "HKM",
  clonality = "Allelic composition",
  instruct = TRUE
)

`data`	A `dataframe` containing summary from DNA sequencing. It must include a column of sample IDs and a corresponding column with the variant allele frequencies.
`sample`	`Integer or character` of the column name or column number of the sample IDs.
`vaf`	`Integer or character` of the column name or column number of the variant allele frequency.
`allele.comp`	`Character` string for allelic composition of the variants. example: '1+1' or '2+3' etc.
`n.clone`	Optional `integer` for number of suspected clones, default NULL.
`n.subclone`	Optional `interger` for number of suspected subclones, default NULL.
`optimization.method`	Method to find optimal number of clusters; GMM or bootstrap. Default is GMM.
`clustering.method`	Clustering methods; HKM, bootkm or hybrid. Default is hkm.
`clonality`	Method for determining clonality of the predicted clusters; Allelic composition (default) or density
`instruct`	`Character` input for accepting program suggestion.

cluster.doc is meant to do two things, first determine the optimum number of clusters that should be fitted and second, to infer what groups the clusters thus obtained should be assigned to.

The data inputs interactively requested from the user help obtain the following information

chromosomal segmentation helps in determining the number of clone/sub-clone cloud to be expected in the data. As variant alleles from different aberrant chromosomes may have similar relative frequencies but discordant clonal interpretation. On the contrary convergent clonal alleles may demonstrate divergent frequencies if arisen from dissimilar aneuploidy.

clouds give the program a visual feedback from the user that assume to carry some biological interpretation of the frequency distributions present in the data. This is a subjective estimate that the program later uses for cluster assignment.

Out of the two methods used for cluster optimization, GMM stands for Gaussian Mixed Models whereas bootstrap, as the name suggests perform bootstrap resampling of the VAFs in 50 repetitions with 20 runs each to find the most stable parameter for clustering. GMM outputs the optimization curve with BIC and AIC against number of clusters chosen in the X-axis where bootstrap shows the Smin statistics instead in the Y-axis. Where as gap calculates the gap statistics for each clustering. In all cases the statistics are to be interpreted as proxies for the entropy of the system. The maximum entropy is likely to indicate the most stable solution.

clustering.method gives the user three choices:

HKM is Heierarchical K-means clustering which uses heierarchical clustering first to determine the cluster centers that are subsequently used as the starting point for the K-means clustering. bootkm performs a bootstrap resampling of 20 fitted K-means clusters with 50 resamplings to out put the clusters. hybrid performs hkm on the principal component of the data.

clonality provides two choices for clonality assignment. The default is Allelic composition that measures expected clonality patterns according to the copy numbers. But in cases of unreliable allelic composition estimates this method may fail. In such situations the clonality can be assigned without apriori assumptions with the alternate density based method.

A list of 12 objects is returned that includes all the summary statistics, diagnositics and the predictions as well as the mapping internally used for clonal deconvolution.

predicted.data is necessarily an extension to the input data with the addition of the predicted clone and sub-clone status of each variant for corresponding samples.

density.map is a distance matrix convoluted from cluster distances and desity departures.

collapse are clusters that are initially prredicted but later collapsed on each other dues to similarity between them.

fitted.hkm, fitted bootkm or fitted.hybrid is a vector of initial cluster assignment by the algorithm chosen. Only one of these will have an output and the rest will show NA.

Number of unscaled clusters gives umber of predicted clusters before collapsing with density estimates.

Number of scaled clusters gives number of predicted clusters after collapsing (if any).

cluster.diagnostics if the optimization method was chosen to be GMM, this is an object of S3 class that includes clustering diagnostics from the model-based clustering. If the chosen method was bootstrap then this is a list.

cluster centers are the centroids of the predicted scaled clusters.

cluster mapping provides the map between scaled clusters and the clonal deconvolution assignments

Dunn index is the Dunn index for the fitted cluster.

seqn.scale cluster.doubt