Cluster distance matrix from subsampling
Description
Given a n x n
matrix of distances, these functions will
try to find the clusters based on the given clustering function. cluster01
and clusterK are internal functions and clusterD is a wrapper around these
two functions for easier user interface. cluster01 and clusterK are not
expected to be called directly by the user, except for ease in debugging
userdefined clustering functions.
Usage
1 2 3 4 5 6 7 8 9 10 11 12  clusterD(x = NULL, diss = NULL, clusterFunction = c("hierarchical01",
"tight", "pam", "hierarchicalK"), typeAlg = c("01", "K"),
distFunction = NA, minSize = 1, orderBy = c("size", "best"),
format = c("vector", "list"), clusterArgs = NULL, checkArgs = TRUE,
returnD = FALSE, ...)
cluster01(diss, clusterFunction = c("hierarchical01", "tight"), alpha = 0.1,
clusterArgs = NULL, checkArgs)
clusterK(diss, clusterFunction = c("pam", "hierarchicalK"),
findBestK = FALSE, k, kRange, removeSil = FALSE, silCutoff = 0,
clusterArgs = NULL, checkArgs)

Arguments
x 

diss 

clusterFunction 
clusterFunction a function that clusters a nxn matrix of dissimilarities/distances. Can also be given character values to indicate use of internal wrapper functions for default methods. See Details for the format of what the function must take as arguments and what format the function must return. 
typeAlg 
character value of either '01' or 'K' determining whether the function given in clusterFunction should be called by clusterK or cluster01. Only used if clusterFunction is a userdefined function. Otherwise, for methods provided by the package (i.e. by user setting clusterFunction to a character value) clusterD will determine the appropriate input for 'typeAlg' and will ignore user input. 
distFunction 
a distance function to be applied to 
minSize 
the minimum number of samples in a cluster. Clusters found below this size will be discarded and samples in the cluster will be given a cluster assignment of "1" to indicate that they were not clustered. 
orderBy 
how to order the cluster (either by size or by maximum alpha value). 
format 
whether to return a list of indices in a cluster or a vector of clustering assignments. List is mainly for compatibility with sequential part. 
clusterArgs 
arguments to be passed directly to the clusterFunction, beyond the required input. 
checkArgs 
logical as to whether should give warning if arguments given that don't match clustering choices given. Otherwise, inapplicable arguments will be ignored without warning. 
returnD 
logical as to whether to return the D matrix in output. 
... 
arguments given to clusterD to be passed to cluster01 or clusterK (depending on the value of typeAlg). Examples include 'k' for clusterK or 'alpha' for cluster01. These should not be the arguments needed by clusterFunction (which should be passed via the argument 'clusterArgs') but the actual arguments of cluster01 or clusterK. 
alpha 
a cutoff value of how much similarity needed for drawing blocks (lower values more strict). 
findBestK 
logical, whether should find best K based on average silhouette width (only used if clusterFunction of type "K"). 
k 
single value to be used to determine how many clusters to find, if findBestK=FALSE (only used if clusterFunction of type "K"). 
kRange 
vector of integers. If findBestK=TRUE, this gives the range of k's to look over. Default is k2 to k+20, subject to those values being greater than 2. Note that default values depend on the input k, so running for different choices of k and findBestK=TRUE can give different answers unless kRange is set to be the same. 
removeSil 
logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K") 
silCutoff 
Requirement on minimum silhouette width to be included in cluster (only if removeSil=TRUE). 
Details
To provide a distance matrix via the argument distFunction
,
the function must be defined to take the distance of the rows of a matrix
(internally, the function will call distFunction(t(x))
. This is to
be compatible with the input for the dist
function.
as.matrix
will be performed on the output of distFunction
,
so if the object returned has a as.matrix
method that will convert
the output into a symmetric matrix of distances, this is fine (for
example the class dist
for objects returned by dist
have
such a method). If distFunction=NA
, then a default distance will
be calculated based on the type of clustering algorithm of
clusterFunction
. For type "K" the default is to take dist
as the distance function. For type "01", the default is to take the
(1cor(x))/2.
Types of algorithms: cluster01 is for clustering functions that expect as an input D that takes on 01 values (e.g. from subclustering). clusterK is for clustering functions that require an input k, the number of clusters, but arbitrary distance/dissimilarity matrix. cluster01 and clusterK are given as separate functions in order to allow the user to provide different clustering functions that expect different types of input and for us to provide different shared processing of the results that is different for these different types of clustering methods (for example, removing low silhouette values is appropriate for clusterK clustering functions rather than cluster01 functions). It is also generally expected that cluster01 algorithms use the 01 nature of the input to set criteria as to where to find clusters and therefore do not need a predetermined 'k'. On the other hand, clusterK functions are assumed to need a predetermined 'k' and are also assumed to cluster all samples to a cluster, and therefore clusterK gives options to exclude poorly clustered samples via silhouette distances.
cluster01 required format for input and output for clusterFunction: clusterFunction should be a function that takes (as a minimum) an argument "D" and "alpha". 01 clustering algorithms are expected to use the fact that the D input is 01 range to find the clusters, rather than a user defined number of clusters; "alpha" is the parameter that tunes the finding of such clusters. For example, a candidate block of samples might be considered a cluster if all values of D are greater than or equal to 1alpha. The output is a list with each element corresponding to a cluster and the elements of the list corresponding to the indices of the samples that are in the cluster. The list is expected to be in order of 'best clusters' (as defined by the clusterFunction), with first being the best and last being worst.
cluster01 methods: "tight" method refers to the method of finding clusters from a subsampling matrix given internally in the tight algorithm code of Tsang and Wong. Arguments for the tight method are 'minSize.core' (default=2), which sets the minimimum number of samples that form a core cluster. "hierarchical01" refers to running the hclust algorithm on D and transversing down the tree until getting a block of samples with whose summary of the values is greater than or equal to 1alpha. Arguments that can be passed to 'hierarchical' are 'evalClusterMethod' which determines how to summarize the samples' values of D[samples,samples] for comparison to 1alpha: "maximum" (default) takes the minimum of D[samples,samples] and requires it to be less than or equal to 1alpha; "average" requires that each row mean of D[samples,samples] be less than or equal to 1alpha. Arguments of hclust can also be passed via clusterArgs to control the hierarchical clustering of D.
clusterK required format for input and output for clusterFunction:
clusterFunction should be a function that takes as a minimum an argument
'D' and 'k'. The output must be a clustering, specified by integer values.
The function silhouette
will be used on the clustering to
calculate silhouette scores for each observation.
clusterK methods: "pam" performs pam clustering on the input
D
matrix using pam
in the cluster package. Arguments
to pam
can be passed via 'clusterArgs', except for the
arguments 'x' and 'k' which are given by D and k directly. "hierarchicalK"
performs hierarchical clustering on the input via the hclust
and then applies cutree
with the specified k to obtain
clusters. Arguments to hclust
can be passed via
clusterArgs
.
Value
clusterD returns a vector of cluster assignments (if format="vector") or a list of indices for each cluster (if format="list"). Clusters less than minSize are removed. If orderBy="size" the clusters are reordered by the size of the cluster, instead of by the internal ordering of the clusterFunction.
cluster01 and clusterK return a list of indices of the clusters found, which each element of the list corresponding to a cluster and the elements of that list a vector of indices giving the indices of the samples assigned to that cluster. Indices not included in any list are assumed to have not been clustered. The list is assumed to be ordered in terms of the ‘best’ cluster (as defined by the clusterFunction for cluster01 or by average silhoute for clusterK), for example in terms of most internal similarity of the elements, or average silhouette width.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30  data(simData)
cl1<clusterD(simData,clusterFunction="pam",k=3)
cl2<clusterD(simData,clusterFunction="hierarchical01")
cl3<clusterD(simData,clusterFunction="tight")
#change distance to manhattan distance
cl4<clusterD(simData,clusterFunction="pam",k=3,
distFunction=function(x){dist(x,method="manhattan")})
#run hierarchical method for finding blocks, with method of evaluating
#coherence of block set to evalClusterMethod="average", and the hierarchical
#clustering using single linkage:
clustSubHier < clusterD(simData, clusterFunction="hierarchical01", alpha=0.1,
minSize=5, clusterArgs=list(evalClusterMethod="average", method="single"))
#do tight
clustSubTight < clusterD(simData, clusterFunction="tight", alpha=0.1,
minSize=5)
#two twists to pam
clustSubPamK < clusterD(simData, clusterFunction="pam", silCutoff=0, minSize=5,
removeSil=TRUE, k=3)
clustSubPamBestK < clusterD(simData, clusterFunction="pam", silCutoff=0,
minSize=5, removeSil=TRUE, findBestK=TRUE, kRange=2:10)
# note that passing the wrong arguments for an algorithm results in warnings
# (which can be turned off with checkArgs=FALSE)
clustSubTight_test < clusterD(simData, clusterFunction="tight", alpha=0.1,
minSize=5, removeSil=TRUE)
clustSubTight_test2 < clusterD(simData, clusterFunction="tight", alpha=0.1,
clusterArgs=list(evalClusterMethod="average"))
