km: km
In RBigData/pbdSHAQ: Tools for Tall Distributed Matrices

Description Usage Arguments Details Value Communication References Examples

k-means via Lloyd's Algorithm.

1	km(x, k = 2, maxiter = 100, seed = get_random_seed())

`x`	A shaq.
`k`	The 'k' in k-means.
`maxiter`	The maximum number of iterations possible.
`seed`	A seed for determining the (random) initial centroids. Each process has to use the same seed or very strange things may happen. If you do not provide a seed, a good initial seed will be chosen.

Note that the function does not respect set.seed() or comm.set.seed(). For managing random seeds, use the seed parameter.

The iterations stop either when the maximum number of iterations have been achieved, or when the centers in the current iteration are basically the same (within 1e-8) as the centers from the previous iteration.

For best performance, the data should be as balanced as possible across all MPI ranks.

A list containing the cluster centers (global), the observation labels i.e. the assignments to clusters (distributed shaq), and the total number of iterations (global).

Most of the computation is local. However, at each iteration there is a length n*k and a length k allreduce call to update the centers. There is also a check at the beginning of the call to find out how many observations come before the current process's data, which is an allgather operation.

Phillips, J.. Data Mining: Algorithms, Geometry, and Probability. https://www.cs.utah.edu/~jeffp/DMBook/DM-AGP.html

## Not run: 
suppressMessages(library(kazaam))
set.seed(1234)

m.local = 10
n = 2
k = comm.size()
data = matrix(rnorm(m.local*n, mean=10*comm.rank()), m.local, n)
x = shaq(data)

cl = km(x, k=k)
cl

finalize()

## End(Not run)