Given a data matrix, this function will call clustering routines, and sequentially remove best clusters, and iterate to find clusters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
numerical matrix on which to run the clustering or a
a character vector defining what type of input is given in
the value of K at the first iteration of sequential algorithm, see details below or vignette.
logical as to whether to subsample via
value between 0 and 1 to decide how stable clustership membership has to be before 'finding' and removing the cluster.
only the top.can clusters from
when only this number of samples are left (i.e. not yet clustered) then algorithm will stop.
each iteration of sequential detection of clustering will decrease the beginning K of subsampling, but not lower than k.min.
algorithm will stop if K in iteration is increased beyond this point.
whether the algorithm should print out information as to its progress.
list of arguments to be passed to
list of arguments to be passed to
logical. Whether to print out the many possible warnings and messages regarding checking the internal consistency of the parameters.
seqCluster is not meant to be called by the user. It is only
an exported function so as to be able to clearly document the arguments for
seqCluster which can be passed via the argument
This code is adapted from the sequential protion of the code of the tightClust package of Tseng and Wong. At each iteration of the algorithm it finds a set of samples that constitute a homogeneous cluster and remove them, and iterate again to find the next set of samples that form a cluster.
In each iteration, to determine the next set of homogeneous set of
samples, the algorithm will iteratively cluster the current set of samples
for a series of increasing values of the parameter $K$, starting at a value
kinit and increasing by 1 at each iteration, until a sufficiently
homogeneous set of clusters is found. For the first set of homogeneous
kinit is set to the argument $k0$, and for iteration,
kinit is increased internally.
Depending on the value of
subsample how the value of $K$ is
used differs. If
subsample=TRUE, $K$ is the
k sent to the
clusterFunction sent to
mainClustering is run on the result of the co-occurance matrix from
subsampleClustering with the
defined in the argument
clusterFunction set via
The number of clusters actually resulting from this run of
mainClustering may not be equal to the $K$ sent to the clustering
mainClustering is called directly on the data to determine the
clusters and $K$ set by
seqCluster for this iteration determines the
parameter of the clustering done by
clusterFunction defines the clustering of the
mainClustering step and
k is sent to that
ClusterFunction object. This means that if
clusterFunction must be of
In either setting of
subsample, the resulting clusters from
mainClustering for a particular $K$ will be compared to clusters
found in the previous iteration of $K-1$. For computational (and other?)
convenience, only the first
top.can clusters of each iteration will
be compared to the first
top.can clusters of previous iteration for
top.can currently refers to ordering by size, so
top.can largest clusters.
If there is no cluster of the first
top.can in the current
iteration $K$ that has overlap similarity >
beta to any in the
previous iteration, then the algorithm will move to the next iteration,
increasing to $K+1$.
If, however, of these clusters there is a cluster in the current
iteration $K$ that has overlap similarity > beta to a cluster in the
previous iteration $K-1$, then the cluster with the largest such similarity
will be identified as a homogenous set of samples and the samples in it
will be removed and designated as such. The algorithm will then start again
to determine the next set of homogenous samples, but without these samples.
Furthermore, in this case (i.e. a cluster was found and removed), the value
kinit will be be reset to
kinit-1; i.e. the range of
increasing $K$ that will be iterated over to find a set of homogenous
samples will start off one value less than was the case for the previous
set of homogeneous samples. If
kinit will be set to
If there are less than
remain.n samples left after finding a
cluster and removing its samples, the algorithm will stop, as subsampling
is deamed to no longer be appropriate. If the K has to be increased to
k.max without finding any pair of clusters with overlap >
beta, then the algorithm will stop. Any samples not found as part of a
homogenous set of clusters at that point will be classified as unclustered
(given a value of -1)
Certain combinations of inputs to
subsampleArgs are not allowed. See
A list with values
clustering a vector of length equal to nrows(x) giving the
integer-valued cluster ids for each sample. The integer values are assigned
in the order that the clusters were found. "-1" indicates the sample was not
clusterInfo if clusters were successfully found, a matrix of
information regarding the algorithm behavior for each cluster (the starting
and stopping K for each cluster, and the number of iterations for each
whyStop a character string explaining what triggered the
algorithm to stop.
Tseng and Wong (2005), "Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data", Biometrics, 61:10-16.
1 2 3 4 5 6 7 8 9 10 11
## Not run: data(simData) set.seed(12908) clustSeqHier <- seqCluster(simData, inputType="X", k0=5, subsample=TRUE, beta=0.8, subsampleArgs=list(resamp.n=100, samp.p=0.7, clusterFunction="kmeans", clusterArgs=list(nstart=10)), mainClusterArgs=list(minSize=5,clusterFunction="hierarchical01", clusterArgs=list(alpha=0.1))) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.