COMMUNAL: Run clustering algorithms and evaluate validation metrics.
In COMMUNAL: Robust Selection of Cluster Number K

Description Usage Arguments Value Author(s) See Also Examples

This functions runs various (user-specified) clustering algorithms on the data, for each potential number of clusters k. It then runs internal validation measures the quantify the fit of each clustering. The returned object is of class "COMMUNAL", and can be used to identify 'core' clusters in the data. Currently supported clustering algorithms are those in packages "clValid", "NMF", and "ConsensusClusterPlus".

The COMMUNAL algorithm is designed to be run with clusterRange, via a call to COMMUNAL() (although this may still be useful to some researchers). After running clusterRange, use getGoodAlgs and getNonCorrNonMonoMeasures to get locally optimized clustering algorithms and validity measures.

To determine the optimal number of clusters, use the plotRange3D function.

COMMUNAL(data, ks, clus.methods = c("hierarchical", "kmeans", "diana",
                                    "som", "sota", "pam", "clara", "agnes"), 
         validation = c("Connectivity", "dunn", "wb.ratio", "g3", 
                        "g2", "pearsongamma", "avg.silwidth", "sindex"), 
         dist.metric = "euclidean", aggl.method = "ward", 
         neighb.size = 10, seed = NULL, parallel=F, gapBoot=20, 
         verbose=F, mc.cores=NULL, ...)

`data`	The data to cluster (numeric matrix or data frame). The columns are clustered, rows are features. If using cluster method `nmf`, all entries must be non-negative.
`ks`	A numeric vector of integers greater than 1, for the number of clusters to consider. For example, 2:4 tells the function to try clusterings with 2, 3, and 4 clusters.
`clus.methods`	Character vector of which clustering methods to use. Valid options: "`hierarchical`", "`kmeans`", "`diana`", "`fanny`", "`som`", "`model`", "`sota`", "`pam`", "`clara`","`agnes`", "`ccp-hc`","`ccp-km`", "`ccp-pam`", "`nmf`". In this list, "`nmf`" corresponds to "`nmf`" in package NMF, "`ccp-xx`" corresponds to "`xx`" in package pkgConsensusClusterPlus, and the rest match to the method of the same name in package clValid.
`validation`	A character vector of the validation measures to consider. Valid options: "`Connectivity`", "`average.between`", "`g2`", "`ch`", "`sindex`","`avg.silwidth`", "`average.within`", "`dunn`", "`widestgap`", "`wb.ratio`", "`entropy`", "`dunn2`", "`pearsongamma`", "`g3`", "`within.cluster.ss`", "`min.separation`", "`max.diameter`", "`gapStatistic`". With the exception of "`Connectivity`", which is calculated by `clValid::connectivity`, and "`gapStatistic`", which is implemented by COMMUNAL based on cluster::clusGap(), these are calculated with `fpc::cluster.stats`.
`dist.metric`	Which metric to use when calculating the distance matrix. Used by clValid clustering algorithms, and in calculating validation measures. Available choices are "`euclidean`", "`correlation`", "`manhattan`".
`aggl.method`	The agglomeration method to use for "`hclust`" and "`agnes`" (if specified in `clus.methods`). Available choices are "`ward`", "`ward.D`", "`ward.D2`", "`single`", "`complete`", "`average`". The ward methods have not been implemented in clValid as of this writing.
`neighb.size`	Numeric value. The neighborhood size used for calculating the `Connectivity` validation measure.
`seed`	Numeric value. Random seed to use in ConsensusClusterPlus and NMF.
`parallel`	Allows for parallel computation of the gap statistic bootstraps. WILL NOT WORK ON WINDOWS MACHINES (sorry).
`gapBoot`	The number of gap statistic bootstraps to perform. This recursively calls COMMUNAL for each bootstrap, though the other validation measures do not have to be calculated for each call.
`verbose`	Mostly output regarding clustering algorithms and gap statistic.
`mc.cores`	If null, uses detectCores(). Ignored if parallel=F.
`...`	Other arguments to pass down to ConsensusClusterPlus, NMF, and clValid.

Return object is an object of class COMMUNAL. The class has a getClustering method to extract a data frame of cluster assignments. Alternatively, functions clusterKeys and returnCore are provided to identify core clusters. See examples below.

Albert Chen and Timothy E Sweeney
Maintainer: Albert Chen acc2015@stanford.edu

Class "COMMUNAL". Use functions clusterKeys and returnCore to identify core clusters.

## Not run: 
## create artificial data set with 3 distinct clusters
set.seed(1)
V1 = c(abs(rnorm(100, 2)), abs(rnorm(100, 50)), abs(rnorm(100, 140)))
V2 = c(abs(rnorm(100, 2, 8)), abs(rnorm(100, 55, 4)), abs(rnorm(100, 105, 1)))
data <- t(data.frame(V1, V2))
colnames(data) <- paste("Sample", 1:ncol(data), sep="")
rownames(data) <- paste("Gene", 1:nrow(data), sep="")

## run COMMUNAL
result <- COMMUNAL(data=data, ks=seq(2,5))  # result is a COMMUNAL object
k <- 3                                # suppose optimal cluster number is 3
clusters <- result$getClustering(k)   # method to extract clusters
mat.key <- clusterKeys(clusters) # get core clusters
examineCounts(mat.key)                # help decide agreement.thresh
core <- returnCore(mat.key, agreement.thresh=50) # find 'core' clusters (all algs agree)
table(core) # the 'core' cluster sizes
## Note: could try a different value for k to
##  see clusters with sub-optimal k

## Can specify clustering methods and validation measures
result <- COMMUNAL(data = data, ks=c(2,3),
                      clus.methods = c("diana", "som", "pam", "kmeans", "ccp-hc", "nmf"),
                      validation=c('pearsongamma', 'avg.silwidth'))
clusters <- result$getClustering(k=3)
mat.key <- clusterKeys(clusters)
examineCounts(mat.key)
core <- returnCore(mat.key, agreement.thresh=50) # find 'core' clusters
table(core) # the 'core' clusters

## Additional arguments are passed down to clValid, NMF, ConsensusClusterPlus
result <- COMMUNAL(data=data, ks=2:5,
                      clus.methods=c("diana", "ccp-hc", "nmf"), reps=20, nruns=2)

## End(Not run)