clustomit: ClustOmit - Cluster Stability Evaluation via Cluster Omission

Description Usage Arguments Details Value References Examples

Description

We provide an implementation of the ClustOmit statistic, which is an approach to evaluating the stability of a clustering determined by a clustering algorithm. As discussed by Hennig (2007), arguably a stable clustering is one in which a perturbation of the original data should yield a similar clustering. However, if a perturbation of the data yields a large change in the clustering, the original clustering is considered unstable. The ClustOmit statistic provides an approach to detecting instability via nonparametric bootstrapping. We determine the stability of the clustering via the similarity statistic specified (by default, the Jaccard coefficient).

Plots the results of a ClustOmit object.

Summarizes the clustomit.

Usage

1
2
3
4
5
6
7
8
9
clustomit(x, K, cluster_method, similarity = "adjusted_rand",
  weighted_mean = FALSE, stratified = FALSE, num_reps = 250,
  num_cores = getOption("mc.cores", 2), ...)

## S3 method for class 'clustomit'
plot(x, ...)

## S3 method for class 'clustomit'
print(x, ...)

Arguments

x

data matrix with n observations (rows) and p features (columns)

K

the number of clusters to find with the clustering algorithm specified in cluster_method

cluster_method

a character string specifying the clustering algorithm that will be used. The method specified is matched with match.fun. The function given should return only clustering labels for each observation in the matrix x.

similarity

the similarity statistic that is used to compare the original clustering (after a single cluster and its observations have been omitted) to its resampled counterpart. See similarity_methods for a listing of available similarity methods. By default, the adjusted Rand index is used.

weighted_mean

logical value. Should the aggregate similarity score for each bootstrap replication be weighted by the number of observations in each of the observed clusters? By default, yes (i.e., TRUE).

stratified

Should the bootstrap replicates be stratified by cluster? By default, no. See Details.

num_reps

the number of bootstrap replicates to draw for each omitted cluster

num_cores

the number of coures to use. If 1 core is specified, then lapply is used without parallelization. See the mc.cores argument in mclapply for more details.

...

additional arguments passed to the function specified in cluster_method

Details

To compute the ClustOmit statistic, we first cluster the data given in x into K clusters with the clustering algorithm specified in cluster_method. We then omit each cluster in turn and all of the observations in that cluster. For the omitted cluster, we resample from the remaining observations and cluster the resampled observations into K - 1 clusters again using the clustering algorithm specified in cluster_method. Next, we compute the similarity between the cluster labels of the original data set and the cluster labels of the bootstrapped sample. We approximate the sampling distribution of the ClustOmit statistic using a nonparametric bootstrapping scheme and use the apparent variability in the approximated sampling distribution as a diagnostic tool for further evaluation of the proposed clusters. By default, we utilize the Jaccard similarity coefficient in the calculation of the ClustOmit statistic to provide a clear interpretation of cluster assessment. The technical details of the ClustOmit statistic can be found in our forthcoming publication entitled "Cluster Stability Evaluation via Cluster Omission."

The bootstrap resampling employed randomly samples from the remaining observations after a cluster is omitted. By default, we ensure that one observation is selected from each remaining cluster to avoid potential situations where the resampled data set contains multiple replicates of a single observation. Optionally, by setting the stratified argument to TRUE, we employ a stratified sampling scheme, where instead we sample with replacement from each cluster. In this case, the number of observations sampled from a cluster is equal to the number of observations originally assigned to that cluster (i.e., its cluster size).

The ClustOmit cluster stability statistic is based on the cluster omission admissibility condition from Fisher and Van Ness (1971), who provide decision-theoretic admissibility conditions that a reasonable clustering algorithm should satisfy. The guidelines from Fisher and Van Ness (1971) establish a systematic foundation that is often lacking in the evaluation of clustering algorithms. The ClustOmit statistic is our proposed methodology to evaluate the cluster omission admissibility condition from Fisher and Van Ness (1971).

We require a clustering algorithm function to be specified in the argument cluster_method. The function given should accept at least two arguments:

x:

matrix of observations to cluster

K:

the number of clusters to find

...

additional arguments that can be passed on

Also, the function given should return only clustering labels for each observation in the matrix x. The additional arguments specified in ... are useful if a wrapper function is used: see the example below for an illustration.

Value

object of class clustomit, which contains a named list with elements

boot_aggregate:

vector of the aggregated similarity statistics for each bootstrap replicate

boot_similarity:

list containing the bootstrapped similarity scores for each cluster omitted

obs_clusters:

the clustering labels determined for the observations in x

cluster_method:

the name of the clustering method

K:

the number of clusters found

similarity:

the similarity statistic used for comparison between the original clustering and the resampled clusterings

N:

the sample size (the number of rows of x)

p:

the number of features (the number of columns of x)

num_reps:

the number of bootstrap replicates drawn for each cluster omitted

a ggplot2 object. The object is plotted in interactive sessions. In some cases, the returned objected may need plotted by invoking plot.

References

Ramey, J. A., Sego, L. H., and Young, D. M. (2014), Cluster Stability Evaluation via Cluster Omission.

Fisher, L. and Van Ness, J. (1971), Admissible Clustering Procedures, _Biometrika_, 58, 1, 91-104.

Hennic, C. (2007), Cluster-wise assessment of cluster stability, _Computational Statistics and Data Analysis_, 52, 258-271. http://www.jstor.org/stable/2334320

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## Not run: 
# First, we create a wrapper function for the K-means clustering algorithm
# that returns only the clustering labels for each observation (row) in
# \code{x}.
kmeans_wrapper <- function(x, K, num_starts = 10, ...) {
  kmeans(x = x, centers = K, nstart = num_starts, ...)$cluster
}

# For this example, we generate five multivariate normal populations with the
# \code{sim_data} function.
set.seed(42)
x <- sim_data("normal", delta = 1.5)$x

clustomit_out <- clustomit(x = x, K = 4, cluster_method = "kmeans_wrapper",
                           num_cores = 1)
clustomit_out2 <- clustomit(x = x, K = 5, cluster_method = "kmeans_wrapper",
                            num_cores = 1)

## End(Not run)

ramhiser/clusteval documentation built on May 26, 2019, 10:07 p.m.