Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/madlib-kmeans.R
The wrapper function for MADlib's kmeans clustering [1]. Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. Each centroid represents a cluster that consists of all points to which this centroid is closest. The computation is parallelized by MADlib if the connected database is Greenplum/HAWQ database.
1 2 3 4 | madlib.kmeans(
x, centers, iter.max = 10, nstart = 1, algorithm = "Lloyd", key,
fn.dist = "squared_dist_norm2", agg.centroid = "avg", min.frac = 0.001,
kmeanspp = FALSE, seeding.sample.ratio=1.0, ...)
|
x |
An object of |
centers |
A number, a matrix or db.data.frame object. If it is a number, this sets the number of target centroids and the random (or kmeans++) seeding method is used. Otherwise, this parameter is used for initial centers. If it is a matrix, its rows will denote the initial centroid coordinates. Else, this parameter will point to a table in the connected database that contains the initial centroids. |
iter.max |
The maximum number of iterations allowed. |
nstart |
If centers is a number, this parameters specifies how many random sets should be chosen. |
algorithm |
The algorithm to compute the kmeans. Currently disabled (default:
“ |
key |
Name of the column (from the table that is pointed by |
fn.dist |
The distance function used by MADlib to compute the objective function. |
agg.centroid |
The aggregate function used by MADlib to compute the objective function. |
min.frac |
The minimum fraction of centroids reassigned to continue iterating. |
kmeanspp |
Whether to call MADlib's kmeans++ centroid seeding method. |
seeding.sample.ratio |
The proportion of subsample of original dataset to use for kmeans++ centroid seeding method. |
... |
Further arguments passed to or from other methods. Currently, no more parameters can be passed to madlib.kmeans. |
See madlib.kmeans
for more details.
For the return value of kmeans clustering see madlib.kmeans
for details.
MADlib kmeans clustering output is similar to that of the kmeans output of
the kmeans function of R package stats
. madlib.kmeans
also
returns an object of class "kmeans"
which has a print
and a
fitted
method.It is a list with at least the following components:
cluster |
A vector of integers (from |
centers |
A matrix of cluster centres. |
withinss |
Vector of within-cluster sum of squares, one component per cluster. |
tot.withinss |
Total within-cluster sum of squares,
i.e. |
size |
The number of points in each cluster. |
iter |
The number of (outer) iterations. |
Author: Predictive Analytics Team at Pivotal Inc.
Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io
[1] Documentation of kmeans clustering in the latest MADlib release, https://madlib.apache.org/docs/latest/group__grp__kmeans.html
madlib.lm
, madlib.summary
,
madlib.arima
are MADlib wrapper functions.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | ## Not run:
## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)
dat <- db.data.frame("__madlib_km_sample__", conn.id = cid, verbose = FALSE)
cent <- db.data.frame("__madlib_km_centroids__", conn.id = cid, verbose = FALSE)
seed.matrix <- matrix(
c(14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,
13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050),
byrow=T, nrow=2)
fit <- madlib.kmeans(dat, 2, key= 'key')
fit
## kmeans++ seeding method
fit <- madlib.kmeans(dat, 2, key= 'key', kmeanspp=TRUE)
fit # display the result
## Initial centroid table
fit <- madlib.kmeans(dat, centers= cent, key= 'key')
fit
## Initial centroid matrix
fit <- madlib.kmeans(dat, centers= seed.matrix, key= 'key')
fit
db.disconnect(cid)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.