kselection | R Documentation |
Selection of k in k-means clustering based on Pham et al. paper.
kselection( x, fun_cluster = stats::kmeans, max_centers = 15, k_threshold = 0.85, progressBar = FALSE, trace = FALSE, parallel = FALSE, ... )
x |
numeric matrix of data, or an object that can be coerced to such a matrix. |
fun_cluster |
function to cluster by (e.g. |
max_centers |
maximum number of clusters for evaluation. |
k_threshold |
maximum value of f(K) from which can not be considered the existence of more than one cluster in the data set. The default value is 0.85. |
progressBar |
show a progress bar. |
trace |
display a trace of the progress. |
parallel |
If set to true, use parallel |
... |
arguments to be passed to the kmeans method. |
This function implements the method proposed by Pham, Dimov and Nguyen for selecting the number of clusters for the K-means algorithm. In this method a function f(K) is used to evaluate the quality of the resulting clustering and help decide on the optimal value of K for each data set. The f(K) function is defined as
f(K) = 1, if K = 1; (S_K)/(α_K S_{K-1}, if S_{K-1} \ne 0, forall K >1; 1, if S_{K-1} = 0, forall K > 1
where S_K is the sum of the distortion of all cluster and α_K is a weight factor which is defined as
α_K = 1 - 3/(4 * N_d), if K = 1 and N_d > 1; α_{K-1} + (1 - α_{K-1})/6, if K > 2 and N_d > 1
where N_d is the number of dimensions of the data set.
In this definition f(K) is the ratio of the real distortion to the estimated distortion and decreases when there are areas of concentration in the data distribution.
The values of K that yield f(K) < 0.85 can be recommended for clustering. If there is not a value of K which f(K) < 0.85, it cannot be considered the existence of clusters in the data set.
an object with the f(K) results.
Daniel Rodriguez
D T Pham, S S Dimov, and C D Nguyen, "Selection of k in k-means clustering", Mechanical Engineering Science, 2004, pp. 103-119.
num_clusters
, get_f_k
# Create a data set with two clusters dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1), rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2) # Execute the method sol <- kselection(dat) # Get the results k <- num_clusters(sol) # optimal number of clustes f_k <- get_f_k(sol) # the f(K) vector # Plot the results plot(sol) ## Not run: # Parallel require(doMC) registerDoMC(cores = 4) system.time(kselection(dat, max_centers = 50 , nstart = 25)) system.time(kselection(dat, max_centers = 50 , nstart = 25, parallel = TRUE)) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.