Description Usage Arguments Details Value Note Examples

Multithreaded weighted Minkowski and spherical K-means via Lloyd's algorithm over dense representation of data given cluster size (weight) constraints.

1 2 3 4 5 6 7 8 9 10 11 12 13 |

`X` |
A |

`centroid` |
A |

`Xw` |
A numeric vector of size |

`clusterWeightUB` |
An integer vector of size |

`minkP` |
A numeric value or a character string. If numeric, |

`convergenceTail` |
An integer. The algorithm may end up with "cyclical convergence" due to the size / weight constraints, that is, every few iterations produce the same clustering. If the cost (total in-cluster distance) of each of the last |

`tailConvergedRelaErr` |
A numeric value, explained in |

`maxIter` |
An integer. The maximal number of iterations. Default 100. |

`maxCore` |
An integer. The maximal number of threads to invoke. No more than the total number of logical processors on machine. Default 7. |

`paraSortInplaceMerge` |
A boolean value. |

`verbose` |
A boolean value. |

See details in `KM()`

for common implementation highlights. Weight upper bounds are implemented as follows:

In each iteration, all the (observation ID, centroid ID, distance) tuples are sorted by distance. From the first to the last tuple, the algorithm puts observation in the cluster labeled by the centroid ID, if (i) the observation has not already been assigned and (ii) the cluster size has not exceeded its upper bound. The actual implementation is slightly different. A parallel merge sort is crafted for computing speed.

A list of size `K`

, the number of clusters. Each element is a list of 3 vectors:

`centroid ` |
a numeric vector of size |

`clusterMember ` |
an integer vector of indexes of elements grouped to |

`member2centroidDistance ` |
a numeric vector of the same size of |

Empty `clusterMember`

implies empty cluster.

Although rarely happens, divergence of K-means with non-Euclidean distance `minkP != 2`

measure is still a theoretical possibility. Bounding the cluster weights / sizes increases the chance of divergence.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | ```
N = 3000L # Number of points.
d = 500L # Dimensionality.
K = 50L # Number of clusters.
dat = matrix(rnorm(N * d) + runif(N * d), nrow = d)
# Use kmeans++ initialization.
centroidInd = GMKMcharlie::KMppIni(
X = dat, K, firstSelection = 1L, minkP = 2, stochastic = FALSE,
seed = sample(1e9L, 1), maxCore = 2L, verbose = TRUE)
centroid = dat[, centroidInd]
# Each cluster size should not be greater than N / K * 2.
sizeConstraints = as.integer(rep(N / K * 2, K))
system.time({rst = GMKMcharlie::KMconstrained(
X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})
# Size upper bounds vary in [N / K * 1.5, N / K * 2]
sizeConstraints = as.integer(round(runif(K, N / K * 1.5, N / K * 2)))
system.time({rst = GMKMcharlie::KMconstrained(
X = dat, centroid = centroid, clusterWeightUB = sizeConstraints,
maxCore = 2L, tailConvergedRelaErr = 1e-6, verbose = TRUE)})
``` |

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.