Description Usage Arguments Details Value Note Author(s) References See Also Examples
k-means cluster analysis without the memory overhead, and possibly in parallel using shared memory.
1 | bigkmeans(x, centers, iter.max = 10, nstart = 1)
|
x |
a |
centers |
a scalar denoting the number of clusters, or for k clusters, a k by |
iter.max |
the maximum number of iterations. |
nstart |
number of random starts, to be done in parallel if there is a registered backend (see below). |
The real benefit is the lack of memory overhead compared to the standard
kmeans
function. Part of the overhead from kmeans()
stems
from the way it looks for unique starting centers, and could be improved
upon. The bigkmeans()
function works on either regular R
matrix
objects, or on big.matrix
objects. In either case, it
requires no extra
memory (beyond the data, other than recording the cluster memberships),
whereas kmeans()
makes at least two extra copies of the data. And
kmeans()
is even worse if multiple starts (nstart>1
) are used.
If nstart>1
and you are using bigkmeans()
in parallel,
a vector of cluster memberships
will need to be stored for each worker, which could be
memory-intensive for large data. This isn't a problem if you use are running
the multiple starts sequentially.
Unless you have a really big data set (where a single run of kmeans
not only burns memory but takes more than a few seconds), use of parallel
computing for multiple random starts is unlikely to be much
faster than running iteratively.
Only the algorithm by MacQueen is used here.
An object of class kmeans
, just as produced by kmeans
.
A comment should be made about the excellent package foreach. By
default, it provides foreach
, which is used
much like a for
loop, here over the nstart
random starting points. Even so, there are efficiencies, doing a comparison
of each result to the previous best result (rather than saving everything and doing
a final comparison of all results).
When a parallel backend has been registered (see packages doSNOW, doMC,
and doMPI, for example), bigkmeans()
automatically distributes
the nstart
random starting points across the available workers. This
is done in shared memory on an SMP, but is distributed on a cluster *IF*
the big.matrix
is file-backed. If used on a cluster with an in-RAM
big.matrix
, it will fail horribly. We're considering an extra option
as an alternative to the current behavior.
John W. Emerson <bigmemoryauthors.@gmail.com>
Hartigan, J. A. and Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics 28, 100–108.
MacQueen, J. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | # Simple example (with one processor):
library(bigmemory)
x <- big.matrix(100000, 3, init=0, type="double")
x[seq(1,100000,by=2),] <- rnorm(150000)
x[seq(2,100000,by=2),] <- rnorm(150000, 5, 1)
head(x)
ans <- bigkmeans(x, 1) # One cluster isn't always allowed
# but is convenient.
ans$centers
ans$withinss
ans$size
apply(x, 2, mean)
ans <- bigkmeans(x, 2, nstart=5) # Sequential multiple starts.
class(ans)
names(ans)
ans$centers
ans$withinss
ans$size
# To use a parallel backend, try something like the following,
# assuming you have at least 3 cores available on this machine.
# Each processor does incur memory overhead for the storage of
# cluster memberships.
## Not run:
library(doSNOW)
cl <- makeCluster(3, type="SOCK")
registerDoSNOW(cl)
ans <- bigkmeans(x, 2, nstart=5)
## End(Not run)
# Both the following are run iteratively, but with less memory overhead
# using bigkmeans(). Note that the gc() comparisons aren't completely
# fair, because the big.matrix objects aren't reflected in the gc()
# summary. But the savings is there.
gc(reset=TRUE)
time.new <- system.time(print(bigkmeans(x, 2, nstart=5)$centers))
gc()
y <- x[,]
rm(x)
gc(reset=TRUE)
time.old <- system.time(print(kmeans(y, 2, nstart=5)$centers))
gc()
# The new kmeans() centers should match the old kmeans() centers, without
# the memory overhead amd running more quickly.
time.new
time.old
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.