km: km

Description Usage Arguments Details Value Communication References Examples

View source: R/kmeans.r

Description

k-means via Lloyd's Algorithm.

Usage

1
km(x, k = 2, maxiter = 100, seed = get_random_seed())

Arguments

x

A shaq.

k

The 'k' in k-means.

maxiter

The maximum number of iterations possible.

seed

A seed for determining the (random) initial centroids. Each process has to use the same seed or very strange things may happen. If you do not provide a seed, a good initial seed will be chosen.

Details

Note that the function does not respect set.seed() or comm.set.seed(). For managing random seeds, use the seed parameter.

The iterations stop either when the maximum number of iterations have been achieved, or when the centers in the current iteration are basically the same (within 1e-8) as the centers from the previous iteration.

For best performance, the data should be as balanced as possible across all MPI ranks.

Value

A list containing the cluster centers (global), the observation labels i.e. the assignments to clusters (distributed shaq), and the total number of iterations (global).

Communication

Most of the computation is local. However, at each iteration there is a length n*k and a length k allreduce call to update the centers. There is also a check at the beginning of the call to find out how many observations come before the current process's data, which is an allgather operation.

References

Phillips, J.. Data Mining: Algorithms, Geometry, and Probability. https://www.cs.utah.edu/~jeffp/DMBook/DM-AGP.html

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
## Not run: 
suppressMessages(library(kazaam))
set.seed(1234)

m.local = 10
n = 2
k = comm.size()
data = matrix(rnorm(m.local*n, mean=10*comm.rank()), m.local, n)
x = shaq(data)

cl = km(x, k=k)
cl

finalize()

## End(Not run)

RBigData/pbdSHAQ documentation built on Nov. 9, 2021, 9:10 a.m.