ClustOmit  Cluster Stability Evaluation via Cluster Omission
Description
We provide an implementation of the ClustOmit statistic, which is an approach to evaluating the stability of a clustering determined by a clustering algorithm. As discussed by Hennig (2007), arguably a stable clustering is one in which a perturbation of the original data should yield a similar clustering. However, if a perturbation of the data yields a large change in the clustering, the original clustering is considered unstable. The ClustOmit statistic provides an approach to detecting instability via a stratified, nonparametric resampling scheme. We determine the stability of the clustering via the similarity statistic specified (by default, the Jaccard coefficient).
Usage
1 2 3 4 
Arguments
x 
data matrix with 
num_clusters 
the number of clusters to find with
the clustering algorithm specified in

cluster_method 
a character string or a function
specifying the clustering algorithm that will be used.
The method specified is matched with the

similarity 
the similarity statistic that is used to compare the original clustering (after a single cluster and its observations have been omitted) to its resampled counterpart. Currently, we have implemented the Jaccard and Rand similarity statistics and use the Jaccard statistic by default. 
weighted_mean 
logical value. Should the aggregate
similarity score for each bootstrap replication be
weighted by the number of observations in each of the
observed clusters? By default, yes (i.e., 
num_reps 
the number of bootstrap replicates to draw for each omitted cluster 
num_cores 
the number of coures to use. If 1 core
is specified, then 
... 
additional arguments passed to the function
specified in 
Details
To compute the ClustOmit statistic, we first cluster the
data given in x
into num_clusters
clusters
with the clustering algorithm specified in
cluster_method
. We then omit each cluster in turn
and all of the observations in that cluster. For the
omitted cluster, we resample from the remaining
observations and cluster the resampled observations into
num_clusters  1
clusters again using the
clustering algorithm specified in cluster_method
.
Next, we compute the similarity between the cluster
labels of the original data set and the cluster labels of
the bootstrapped sample. We approximate the sampling
distribution of the ClustOmit statistic using a
stratified, nonparametric bootstrapping scheme and use
the apparent variability in the approximated sampling
distribution as a diagnostic tool for further evaluation
of the proposed clusters. By default, we utilize the
Jaccard similarity coefficient in the calculation of the
ClustOmit statistic to provide a clear interpretation of
cluster assessment. The technical details of the
ClustOmit statistic can be found in our forthcoming
publication entitled "Cluster Stability Evaluation of
Gene Expression Data."
The ClustOmit cluster stability statistic is based on the cluster omission admissibility condition from Fisher and Van Ness (1971), who provide decisiontheoretic admissibility conditions that a reasonable clustering algorithm should satisfy. The guidelines from Fisher and Van Ness (1971) establish a systematic foundation that is often lacking in the evaluation of clustering algorithms. The ClustOmit statistic is our proposed methodology to evaluate the cluster omission admissibility condition from Fisher and Van Ness (1971).
We require a clustering algorithm function to be
specified in the argument cluster_method
. The
function given should accept at least two arguments:
 x
matrix of observations to cluster
 num_clusters
the number of clusters to find
 ...
additional arguments that can be passed on
Also, the function given should return only clustering
labels for each observation in the matrix x
. The
additional arguments specified in ...
are useful
if a wrapper function is used: see the example below for
an illustration.
Value
object of class clustomit
, which contains a named
list with elements
 boot_aggregate:
vector of the aggregated similarity statistics for each bootstrap replicate
 boot_similarity:
list containing the bootstrapped similarity scores for each cluster omitted
 obs_clusters:
the clustering labels determined for the observations in
x
 num_clusters:
the number of clusters found
 similarity:
the similarity statistic used for comparison between the original clustering and the resampled clusterings
References
Fisher, L. and Van Ness, J. (1971), Admissible Clustering Procedures, _Biometrika_, 58, 1, 91104.
Hennic, C. (2007), Clusterwise assessment of cluster stability, _Computational Statistics and Data Analysis_, 52, 258271. http://www.jstor.org/stable/2334320
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  # First, we create a wrapper function for the Kmeans clustering algorithm
# that returns only the clustering labels for each observation (row) in
# \code{x}.
kmeans_wrapper < function(x, num_clusters, num_starts = 10, ...) {
kmeans(x = x, centers = num_clusters, nstart = num_starts, ...)$cluster
}
# For this example, we generate five multivariate normal populations with the
# \code{sim_data} function.
x < sim_data("normal", delta = 1.5, seed = 42)$x
clustomit_out < clustomit(x = x, num_clusters = 4,
cluster_method = "kmeans_wrapper", num_cores = 1)
clustomit_out2 < clustomit(x = x, num_clusters = 5,
cluster_method = kmeans_wrapper, num_cores = 1)
