Description Usage Arguments Details Value References See Also Examples
Given a data matrix, this function will call clustering routines, and sequentially remove best clusters, and iterate to find clusters.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
inputMatrix |
numerical matrix on which to run the clustering or a
|
inputType |
a character vector defining what type of input is given in
the |
k0 |
the value of K at the first iteration of sequential algorithm, see details below or vignette. |
subsample |
logical as to whether to subsample via
|
beta |
value between 0 and 1 to decide how stable clustership membership has to be before 'finding' and removing the cluster. |
top.can |
only the top.can clusters from |
remain.n |
when only this number of samples are left (i.e. not yet clustered) then algorithm will stop. |
k.min |
each iteration of sequential detection of clustering will decrease the beginning K of subsampling, but not lower than k.min. |
k.max |
algorithm will stop if K in iteration is increased beyond this point. |
verbose |
whether the algorithm should print out information as to its progress. |
subsampleArgs |
list of arguments to be passed to
|
mainClusterArgs |
list of arguments to be passed to
|
warnings |
logical. Whether to print out the many possible warnings and messages regarding checking the internal consistency of the parameters. |
seqCluster is not meant to be called by the user. It is only
an exported function so as to be able to clearly document the arguments for
seqCluster which can be passed via the argument seqArgs in
functions like clusterSingle and clusterMany.
This code is adapted from the sequential protion of the code of the tightClust package of Tseng and Wong. At each iteration of the algorithm it finds a set of samples that constitute a homogeneous cluster and remove them, and iterate again to find the next set of samples that form a cluster.
In each iteration, to determine the next set of homogeneous set of
samples, the algorithm will iteratively cluster the current set of samples
for a series of increasing values of the parameter $K$, starting at a value
kinit and increasing by 1 at each iteration, until a sufficiently
homogeneous set of clusters is found. For the first set of homogeneous
samples, kinit is set to the argument $k0$, and for iteration,
kinit is increased internally.
Depending on the value of subsample how the value of $K$ is
used differs. If subsample=TRUE, $K$ is the k sent to the
cluster function clusterFunction sent to
subsampleClustering via subsampleArgs; then
mainClustering is run on the result of the co-occurance matrix from
subsampleClustering with the ClusterFunction object
defined in the argument clusterFunction set via mainClusterArgs.
The number of clusters actually resulting from this run of
mainClustering may not be equal to the $K$ sent to the clustering
done in subsampleClustering. If subsample=FALSE,
mainClustering is called directly on the data to determine the
clusters and $K$ set by seqCluster for this iteration determines the
parameter of the clustering done by mainClustering. Specifically,
the argument clusterFunction defines the clustering of the
mainClustering step and k is sent to that
ClusterFunction object. This means that if subsample=FALSE,
the clusterFunction must be of algorithmType "K".
In either setting of subsample, the resulting clusters from
mainClustering for a particular $K$ will be compared to clusters
found in the previous iteration of $K-1$. For computational (and other?)
convenience, only the first top.can clusters of each iteration will
be compared to the first top.can clusters of previous iteration for
similarity (where top.can currently refers to ordering by size, so
first top.can largest clusters.
If there is no cluster of the first top.can in the current
iteration $K$ that has overlap similarity > beta to any in the
previous iteration, then the algorithm will move to the next iteration,
increasing to $K+1$.
If, however, of these clusters there is a cluster in the current
iteration $K$ that has overlap similarity > beta to a cluster in the
previous iteration $K-1$, then the cluster with the largest such similarity
will be identified as a homogenous set of samples and the samples in it
will be removed and designated as such. The algorithm will then start again
to determine the next set of homogenous samples, but without these samples.
Furthermore, in this case (i.e. a cluster was found and removed), the value
of kinit will be be reset to kinit-1; i.e. the range of
increasing $K$ that will be iterated over to find a set of homogenous
samples will start off one value less than was the case for the previous
set of homogeneous samples. If kinit-1<k.min, then
kinit will be set to k.min.
If there are less than remain.n samples left after finding a
cluster and removing its samples, the algorithm will stop, as subsampling
is deamed to no longer be appropriate. If the K has to be increased to
beyond k.max without finding any pair of clusters with overlap >
beta, then the algorithm will stop. Any samples not found as part of a
homogenous set of clusters at that point will be classified as unclustered
(given a value of -1)
Certain combinations of inputs to mainClusterArgs and
subsampleArgs are not allowed. See clusterSingle for
these explanations.
A list with values
clustering a vector of length equal to nrows(x) giving the
integer-valued cluster ids for each sample. The integer values are assigned
in the order that the clusters were found. "-1" indicates the sample was not
clustered.
clusterInfo if clusters were successfully found, a matrix of
information regarding the algorithm behavior for each cluster (the starting
and stopping K for each cluster, and the number of iterations for each
cluster).
whyStop a character string explaining what triggered the
algorithm to stop.
Tseng and Wong (2005), "Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data", Biometrics, 61:10-16.
tight.clust,
clusterSingle,mainClustering,subsampleClustering
1 2 3 4 5 6 7 8 9 10 11 | ## Not run:
data(simData)
set.seed(12908)
clustSeqHier <- seqCluster(simData, inputType="X", k0=5, subsample=TRUE,
beta=0.8, subsampleArgs=list(resamp.n=100,
samp.p=0.7, clusterFunction="kmeans", clusterArgs=list(nstart=10)),
mainClusterArgs=list(minSize=5,clusterFunction="hierarchical01",
clusterArgs=list(alpha=0.1)))
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.