Given a data matrix, this function will call clustering routines, and sequentially remove best clusters, and iterate to find clusters.
1 2 3 4 
x 

diss 

k0 
the value of K at the first iteration of sequential algorithm, see details below or vignette. 
clusterFunction 
passed to clusterDMat option 'clusterFunction' to
indicate method of clustering, see 
subsample 
logical as to whether to subsample via

beta 
value between 0 and 1 to decide how stable clustership membership has to be before 'finding' and removing the cluster. 
top.can 
only the top.can clusters from 
remain.n 
when only this number of samples are left (i.e. not yet clustered) then algorithm will stop. 
k.min 
each iteration of sequential detection of clustering will decrease the beginning K of subsampling, but not lower than k.min. 
k.max 
algorithm will stop if K in iteration is increased beyond this point. 
verbose 
whether the algorithm should print out information as to its progress. 
subsampleArgs 
list of arguments to be passed to

clusterDArgs 
list of arguments to be passed to

This code is adapted from the code of the tightClust
package of Tseng and Wong
Each iteration of the algorithm will cluster the current set of
samples. Depending on the method, the number of clusters resulting from
clusterD
may not be equal to the K used in the clustering of
the (subsampled) data. The resulting clusters will then be compared to
clusters found in the previous iteration that set the subsampling
clustering to K1. For computational (and other?) convenience, only the
first top.can clusters of each iteration will be compared to the first
top.can clusters of previous iteration for similarity (where top.can
currently refers to ordering by size, so first top.can largest clusters).
If there is a cluster in the current iteration that has overlap similarity > beta to a cluster in the previous iteration, then the cluster with the largest such similarity will be identified as a 'final' cluster and the samples in it will be removed for future iterations. The algorithm will then continue to the next iteration, but without these samples. Furthermore, in this case K for the next iteration will NOT be set to K+1, but will be reset to kinit1, where kinit was the first K used after the previous 'final' cluster was removed. If kinit1<k.min, then K will be set to k.min.
If there is no cluster of the first top.can in the current iteration that has overlap similarity > beta to any in the previous iteration, then the algorithm will move to the next iteration (i.e. redo after increasing K to K+1).
If there are less than remain.n samples left after finding a cluster and removing its samples, the algorithm will stop, as subsampling is deamed to no longer be appropriate. If the K has to be increased to beyond k.max without finding any pair of clusters with overlap > beta, then the algorithm will stop. Any samples not found as part of a 'final' cluster after the algorithm stops, will be classified as unclustered (given a value of 1)
'subsample' controls what is the D (distance) matrix used for
clustering at each iteration. If subsample=TRUE, D is given via
subsampleClustering
function with k=K (with additional
arguments passed via subsampleArgs). If subsample=FALSE, D is dist(x), for
the samples currently considered in the iteration and clusterFunction must
be of the 'K' type (e.g. "pam", see clusterD
) or an error
will be produced. The nsample x nsample matrix D is then clustered via
clusterD
to find clusters. The option 'clusterFunction' is
passed to the argument 'clusterFunction' of clusterD
to
control what method is used to cluster D.
If clusterFunction is of type 'K' (e.g. "pam", see
clusterD
) the 'k' argument of clusterK
called
by clusterD
is set to the current iteration of K by the
sequential iteration, so setting 'k=' in the list given to clusterDArgs
will not do anything and will produce a warning to that effect.
Similarly, the current K of the iteration also determines the 'k'
argument passed to subsampleClustering
so setting 'k=' in
the list given to the subsampleArgs will not do anything and will produce a
warning to that effect.
If subsample=FALSE and 'findBestK=FALSE' is passed to clusterDArgs, then each iteration will run the clustering given by clusterFunction on dist(x) iterating over k. However, if subsample=FALSE, you should not set 'findBestK=TRUE' (otherwise clustering dist(x) will be essentially the same for iterating over different k and there is no method implemented to change the choice of how to remove a cluster other than similarity as you change k); an error message will be given if this combination of options are set.
However, if clusterFunction="pam" (or is of type 'K') and
subsample=TRUE passing either 'findBestK=TRUE' or 'findBestK=FALSE' will
function as expected. In particular, the iteration over K will set the
number of clusters for clustering of each subsample. If findBestK=FALSE,
that same K will be used for clustering of DMat. If findBestK=TRUE, then
clusterD
will search for best k; note that the default
'kRange' over which clusterD
searches when findBestK=TRUE
depends on the input value of 'k' (you can change this to a fixed set of
values by setting 'kRange' explicitly in the clusterDArgs list).
A list with values
clustering
a vector of length equal to nrows(x) giving the
integervalued cluster ids for each sample. The integer values are assigned
in the order that the clusters were found. "1" indicates the sample was not
clustered.
clusterInfo
if clusters were successfully found, a matrix of
information regarding the algorithm behavior for each cluster (the starting
and stopping K for each cluster, and the number of iterations for each
cluster).
whyStop
a character string explaining what triggered the
algorithm to stop.
Tseng and Wong (2005), "Tight Clustering: A ResamplingBased Approach for Identifying Stable and Tight Patterns in Data", Biometrics, 61:1016.
tight.clust
1 2 3 4 5 6 7 8 9 10 11 
Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
Please suggest features or report bugs with the GitHub issue tracker.
All documentation is copyright its authors; we didn't write any of that.