ClustOmit - Cluster Stability Evaluation via Cluster Omission

Share:

Description

We provide an implementation of the ClustOmit statistic, which is an approach to evaluating the stability of a clustering determined by a clustering algorithm. As discussed by Hennig (2007), arguably a stable clustering is one in which a perturbation of the original data should yield a similar clustering. However, if a perturbation of the data yields a large change in the clustering, the original clustering is considered unstable. The ClustOmit statistic provides an approach to detecting instability via a stratified, nonparametric resampling scheme. We determine the stability of the clustering via the similarity statistic specified (by default, the Jaccard coefficient).

Usage

1
2
3
4
  clustomit(x, num_clusters, cluster_method,
    similarity = c("jaccard", "rand"),
    weighted_mean = TRUE, num_reps = 50,
    num_cores = getOption("mc.cores", 2), ...)

Arguments

x

data matrix with n observations (rows) and p features (columns)

num_clusters

the number of clusters to find with the clustering algorithm specified in cluster_method

cluster_method

a character string or a function specifying the clustering algorithm that will be used. The method specified is matched with the match.fun function. The function given should return only clustering labels for each observation in the matrix x.

similarity

the similarity statistic that is used to compare the original clustering (after a single cluster and its observations have been omitted) to its resampled counterpart. Currently, we have implemented the Jaccard and Rand similarity statistics and use the Jaccard statistic by default.

weighted_mean

logical value. Should the aggregate similarity score for each bootstrap replication be weighted by the number of observations in each of the observed clusters? By default, yes (i.e., TRUE).

num_reps

the number of bootstrap replicates to draw for each omitted cluster

num_cores

the number of coures to use. If 1 core is specified, then lapply is used without parallelization. See the mc.cores argument in mclapply for more details.

...

additional arguments passed to the function specified in cluster_method

Details

To compute the ClustOmit statistic, we first cluster the data given in x into num_clusters clusters with the clustering algorithm specified in cluster_method. We then omit each cluster in turn and all of the observations in that cluster. For the omitted cluster, we resample from the remaining observations and cluster the resampled observations into num_clusters - 1 clusters again using the clustering algorithm specified in cluster_method. Next, we compute the similarity between the cluster labels of the original data set and the cluster labels of the bootstrapped sample. We approximate the sampling distribution of the ClustOmit statistic using a stratified, nonparametric bootstrapping scheme and use the apparent variability in the approximated sampling distribution as a diagnostic tool for further evaluation of the proposed clusters. By default, we utilize the Jaccard similarity coefficient in the calculation of the ClustOmit statistic to provide a clear interpretation of cluster assessment. The technical details of the ClustOmit statistic can be found in our forthcoming publication entitled "Cluster Stability Evaluation of Gene Expression Data."

The ClustOmit cluster stability statistic is based on the cluster omission admissibility condition from Fisher and Van Ness (1971), who provide decision-theoretic admissibility conditions that a reasonable clustering algorithm should satisfy. The guidelines from Fisher and Van Ness (1971) establish a systematic foundation that is often lacking in the evaluation of clustering algorithms. The ClustOmit statistic is our proposed methodology to evaluate the cluster omission admissibility condition from Fisher and Van Ness (1971).

We require a clustering algorithm function to be specified in the argument cluster_method. The function given should accept at least two arguments:

x

matrix of observations to cluster

num_clusters

the number of clusters to find

...

additional arguments that can be passed on

Also, the function given should return only clustering labels for each observation in the matrix x. The additional arguments specified in ... are useful if a wrapper function is used: see the example below for an illustration.

Value

object of class clustomit, which contains a named list with elements

boot_aggregate:

vector of the aggregated similarity statistics for each bootstrap replicate

boot_similarity:

list containing the bootstrapped similarity scores for each cluster omitted

obs_clusters:

the clustering labels determined for the observations in x

num_clusters:

the number of clusters found

similarity:

the similarity statistic used for comparison between the original clustering and the resampled clusterings

References

Fisher, L. and Van Ness, J. (1971), Admissible Clustering Procedures, _Biometrika_, 58, 1, 91-104.

Hennic, C. (2007), Cluster-wise assessment of cluster stability, _Computational Statistics and Data Analysis_, 52, 258-271. http://www.jstor.org/stable/2334320

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# First, we create a wrapper function for the K-means clustering algorithm
# that returns only the clustering labels for each observation (row) in
# \code{x}.
kmeans_wrapper <- function(x, num_clusters, num_starts = 10, ...) {
  kmeans(x = x, centers = num_clusters, nstart = num_starts, ...)$cluster
}

# For this example, we generate five multivariate normal populations with the
# \code{sim_data} function.
x <- sim_data("normal", delta = 1.5, seed = 42)$x

clustomit_out <- clustomit(x = x, num_clusters = 4,
                           cluster_method = "kmeans_wrapper", num_cores = 1)
clustomit_out2 <- clustomit(x = x, num_clusters = 5,
                            cluster_method = kmeans_wrapper, num_cores = 1)