kselection: Selection of K in K-means Clustering

View source: R/pham.R

kselectionR Documentation

Selection of K in K-means Clustering

Description

Selection of k in k-means clustering based on Pham et al. paper.

Usage

kselection(
  x,
  fun_cluster = stats::kmeans,
  max_centers = 15,
  k_threshold = 0.85,
  progressBar = FALSE,
  trace = FALSE,
  parallel = FALSE,
  ...
)

Arguments

x

numeric matrix of data, or an object that can be coerced to such a matrix.

fun_cluster

function to cluster by (e.g. kmeans). The first parameter of the function must a numeric matrix and the second the number of clusters. The function must return an object with a named attribute withinss which is a numeric vector with the within.

max_centers

maximum number of clusters for evaluation.

k_threshold

maximum value of f(K) from which can not be considered the existence of more than one cluster in the data set. The default value is 0.85.

progressBar

show a progress bar.

trace

display a trace of the progress.

parallel

If set to true, use parallel foreach to execute the function that implements the kmeans algorithm. Must register parallel before hand, such as doMC or others. Selecting this option the progress bar is disabled.

...

arguments to be passed to the kmeans method.

Details

This function implements the method proposed by Pham, Dimov and Nguyen for selecting the number of clusters for the K-means algorithm. In this method a function f(K) is used to evaluate the quality of the resulting clustering and help decide on the optimal value of K for each data set. The f(K) function is defined as

f(K) = 1, if K = 1; (S_K)/(α_K S_{K-1}, if S_{K-1} \ne 0, forall K >1; 1, if S_{K-1} = 0, forall K > 1

where S_K is the sum of the distortion of all cluster and α_K is a weight factor which is defined as

α_K = 1 - 3/(4 * N_d), if K = 1 and N_d > 1; α_{K-1} + (1 - α_{K-1})/6, if K > 2 and N_d > 1

where N_d is the number of dimensions of the data set.

In this definition f(K) is the ratio of the real distortion to the estimated distortion and decreases when there are areas of concentration in the data distribution.

The values of K that yield f(K) < 0.85 can be recommended for clustering. If there is not a value of K which f(K) < 0.85, it cannot be considered the existence of clusters in the data set.

Value

an object with the f(K) results.

Author(s)

Daniel Rodriguez

References

D T Pham, S S Dimov, and C D Nguyen, "Selection of k in k-means clustering", Mechanical Engineering Science, 2004, pp. 103-119.

See Also

num_clusters, get_f_k

Examples

# Create a data set with two clusters
dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1),
                rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2)

# Execute the method
sol <- kselection(dat)

# Get the results
k   <- num_clusters(sol) # optimal number of clustes
f_k <- get_f_k(sol)      # the f(K) vector

# Plot the results
plot(sol)

## Not run: 
# Parallel
require(doMC)
registerDoMC(cores = 4)

system.time(kselection(dat, max_centers = 50 , nstart = 25))
system.time(kselection(dat, max_centers = 50 , nstart = 25, parallel = TRUE))

## End(Not run)


kselection documentation built on May 17, 2022, 1:07 a.m.