Description Usage Arguments Details Value Author(s) References See Also Examples

Selection of k in k-means clustering based on Pham et al. paper.

1 2 3 |

`x` |
numeric matrix of data, or an object that can be coerced to such a matrix. |

`fun_cluster` |
function to cluster by (e.g. |

`max_centers` |
maximum number of clusters for evaluation. |

`k_threshold` |
maximum value of |

`progressBar` |
show a progress bar. |

`trace` |
display a trace of the progress. |

`parallel` |
If set to true, use parallel |

`...` |
arguments to be passed to the kmeans method. |

This function implements the method proposed by Pham, Dimov and Nguyen for
selecting the number of clusters for the K-means algorithm. In this method
a function *f(K)* is used to evaluate the quality of the resulting
clustering and help decide on the optimal value of *K* for each data
set. The *f(K)* function is defined as

*f(K) =
1, if K = 1;
(S_K)/(α_K S_{K-1}, if S_{K-1} \ne 0, forall K >1;
1, if S_{K-1} = 0, forall K > 1*

where *S_K* is the sum of the distortion of all cluster and *α_K*
is a weight factor which is defined as

*α_K =
1 - 3/(4 * N_d), if K = 1 and N_d > 1;
α_{K-1} + (1 - α_{K-1})/6, if K > 2 and N_d > 1*

where *N_d* is the number of dimensions of the data set.

In this definition *f(K)* is the ratio of the real distortion to the
estimated distortion and decreases when there are areas of concentration in
the data distribution.

The values of *K* that yield *f(K) < 0.85* can be recommended for
clustering. If there is not a value of *K* which *f(K) < 0.85*, it
cannot be considered the existence of clusters in the data set.

an object with the *f(K)* results.

Daniel Rodriguez

D T Pham, S S Dimov, and C D Nguyen, "Selection of k in k-means clustering", Mechanical Engineering Science, 2004, pp. 103-119.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ```
# Create a data set with two clusters
dat <- matrix(c(rnorm(100, 2, .1), rnorm(100, 3, .1),
rnorm(100, -2, .1), rnorm(100, -3, .1)), 200, 2)
# Execute the method
sol <- kselection(dat)
# Get the results
k <- num_clusters(sol) # optimal number of clustes
f_k <- get_f_k(sol) # the f(K) vector
# Plot the results
plot(sol)
## Not run:
# Parallel
require(doMC)
registerDoMC(cores = 4)
system.time(kselection(dat, max_centers = 50 , nstart = 25))
system.time(kselection(dat, max_centers = 50 , nstart = 25, parallel = TRUE))
## End(Not run)
``` |

