Description Usage Arguments Details Value References Examples
We provide an implementation of the ClustOmit statistic, which is an approach to evaluating the stability of a clustering determined by a clustering algorithm. As discussed by Hennig (2007), arguably a stable clustering is one in which a perturbation of the original data should yield a similar clustering. However, if a perturbation of the data yields a large change in the clustering, the original clustering is considered unstable. The ClustOmit statistic provides an approach to detecting instability via nonparametric bootstrapping. We determine the stability of the clustering via the similarity statistic specified (by default, the Jaccard coefficient).
Plots the results of a ClustOmit object.
Summarizes the clustomit
.
1 2 3 4 5 6 7 8 9 |
x |
data matrix with |
K |
the number of clusters to find with the clustering algorithm
specified in |
cluster_method |
a character string specifying the clustering algorithm
that will be used. The method specified is matched with
|
similarity |
the similarity statistic that is used to compare the
original clustering (after a single cluster and its observations have been
omitted) to its resampled counterpart. See |
weighted_mean |
logical value. Should the aggregate similarity score for
each bootstrap replication be weighted by the number of observations in each
of the observed clusters? By default, yes (i.e., |
stratified |
Should the bootstrap replicates be stratified by cluster? By default, no. See Details. |
num_reps |
the number of bootstrap replicates to draw for each omitted cluster |
num_cores |
the number of coures to use. If 1 core is specified, then
|
... |
additional arguments passed to the function specified in
|
To compute the ClustOmit statistic, we first cluster the data given in
x
into K
clusters with the clustering algorithm specified in
cluster_method
. We then omit each cluster in turn and all of the
observations in that cluster. For the omitted cluster, we resample from the
remaining observations and cluster the resampled observations into K -
1
clusters again using the clustering algorithm specified in
cluster_method
. Next, we compute the similarity between the cluster
labels of the original data set and the cluster labels of the bootstrapped
sample. We approximate the sampling distribution of the ClustOmit statistic
using a nonparametric bootstrapping scheme and use the apparent variability
in the approximated sampling distribution as a diagnostic tool for further
evaluation of the proposed clusters. By default, we utilize the Jaccard
similarity coefficient in the calculation of the ClustOmit statistic to
provide a clear interpretation of cluster assessment. The technical details
of the ClustOmit statistic can be found in our forthcoming publication
entitled "Cluster Stability Evaluation via Cluster Omission."
The bootstrap resampling employed randomly samples from the remaining
observations after a cluster is omitted. By default, we ensure that one
observation is selected from each remaining cluster to avoid potential
situations where the resampled data set contains multiple replicates of a
single observation. Optionally, by setting the stratified
argument to
TRUE
, we employ a stratified sampling scheme, where instead we sample
with replacement from each cluster. In this case, the number of observations
sampled from a cluster is equal to the number of observations originally
assigned to that cluster (i.e., its cluster size).
The ClustOmit cluster stability statistic is based on the cluster omission admissibility condition from Fisher and Van Ness (1971), who provide decision-theoretic admissibility conditions that a reasonable clustering algorithm should satisfy. The guidelines from Fisher and Van Ness (1971) establish a systematic foundation that is often lacking in the evaluation of clustering algorithms. The ClustOmit statistic is our proposed methodology to evaluate the cluster omission admissibility condition from Fisher and Van Ness (1971).
We require a clustering algorithm function to be specified in the argument
cluster_method
. The function given should accept at least two
arguments:
matrix of observations to cluster
the number of clusters to find
additional arguments that can be passed on
Also, the function given should return only clustering labels for each
observation in the matrix x
. The additional arguments specified in
...
are useful if a wrapper function is used: see the example below
for an illustration.
object of class clustomit
, which contains a named list with
elements
vector of the aggregated similarity statistics for each bootstrap replicate
list containing the bootstrapped similarity scores for each cluster omitted
the clustering labels determined for the observations
in x
the name of the clustering method
the number of clusters found
the similarity statistic used for comparison between the original clustering and the resampled clusterings
the sample size (the number of rows of x
)
the number of features (the number of columns of x
)
the number of bootstrap replicates drawn for each cluster omitted
a ggplot2
object. The object is plotted in interactive
sessions. In some cases, the returned objected may need plotted by invoking
plot
.
Ramey, J. A., Sego, L. H., and Young, D. M. (2014), Cluster Stability Evaluation via Cluster Omission.
Fisher, L. and Van Ness, J. (1971), Admissible Clustering Procedures, _Biometrika_, 58, 1, 91-104.
Hennic, C. (2007), Cluster-wise assessment of cluster stability, _Computational Statistics and Data Analysis_, 52, 258-271. http://www.jstor.org/stable/2334320
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ## Not run:
# First, we create a wrapper function for the K-means clustering algorithm
# that returns only the clustering labels for each observation (row) in
# \code{x}.
kmeans_wrapper <- function(x, K, num_starts = 10, ...) {
kmeans(x = x, centers = K, nstart = num_starts, ...)$cluster
}
# For this example, we generate five multivariate normal populations with the
# \code{sim_data} function.
set.seed(42)
x <- sim_data("normal", delta = 1.5)$x
clustomit_out <- clustomit(x = x, K = 4, cluster_method = "kmeans_wrapper",
num_cores = 1)
clustomit_out2 <- clustomit(x = x, K = 5, cluster_method = "kmeans_wrapper",
num_cores = 1)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.