madlib.kmeans: Wrapper for MADlib's Kmeans clustering function
In PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

Description Usage Arguments Details Value Author(s) References See Also Examples

The wrapper function for MADlib's kmeans clustering [1]. Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. Each centroid represents a cluster that consists of all points to which this centroid is closest. The computation is parallelized by MADlib if the connected database is Greenplum/HAWQ database.

madlib.kmeans(
  x, centers, iter.max = 10, nstart = 1, algorithm = "Lloyd", key,
  fn.dist = "squared_dist_norm2", agg.centroid = "avg", min.frac = 0.001,
  kmeanspp = FALSE, seeding.sample.ratio=1.0, ...)

`x`	An object of `db.obj` class. Currently, this parameter is mandatory. If it is an object of class `db.Rquery` or `db.view`, a temporary table will be created, and further computation will be done on the temporary table. After the computation, the temporary will be dropped from the corresponding database. Data points and predefined centroids (if used) are expected to be stored row-wise, and each point should be of `numeric` type.
`centers`	A number, a matrix or db.data.frame object. If it is a number, this sets the number of target centroids and the random (or kmeans++) seeding method is used. Otherwise, this parameter is used for initial centers. If it is a matrix, its rows will denote the initial centroid coordinates. Else, this parameter will point to a table in the connected database that contains the initial centroids.
`iter.max`	The maximum number of iterations allowed.
`nstart`	If centers is a number, this parameters specifies how many random sets should be chosen.
`algorithm`	The algorithm to compute the kmeans. Currently disabled (default: “`Lloyd`”) and kept for the future implementations.
`key`	Name of the column (from the table that is pointed by `x`) that contains the ids for each point.
`fn.dist`	The distance function used by MADlib to compute the objective function.
`agg.centroid`	The aggregate function used by MADlib to compute the objective function.
`min.frac`	The minimum fraction of centroids reassigned to continue iterating.
`kmeanspp`	Whether to call MADlib's kmeans++ centroid seeding method.
`seeding.sample.ratio`	The proportion of subsample of original dataset to use for kmeans++ centroid seeding method.
`...`	Further arguments passed to or from other methods. Currently, no more parameters can be passed to madlib.kmeans.

See madlib.kmeans for more details.

For the return value of kmeans clustering see madlib.kmeans for details.

MADlib kmeans clustering output is similar to that of the kmeans output of the kmeans function of R package stats. madlib.kmeans also returns an object of class "kmeans" which has a print and a fitted method.It is a list with at least the following components:

`cluster`	A vector of integers (from `1:k`) indicating the cluster to which each point is allocated.
`centers`	A matrix of cluster centres.
`withinss`	Vector of within-cluster sum of squares, one component per cluster.
`tot.withinss`	Total within-cluster sum of squares, i.e. `sum(withinss)`.
`size`	The number of points in each cluster.
`iter`	The number of (outer) iterations.

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

[1] Documentation of kmeans clustering in the latest MADlib release, https://madlib.apache.org/docs/latest/group__grp__kmeans.html

madlib.lm, madlib.summary, madlib.arima are MADlib wrapper functions.

## Not run: 


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

dat <- db.data.frame("__madlib_km_sample__", conn.id = cid, verbose = FALSE)
cent <- db.data.frame("__madlib_km_centroids__", conn.id = cid, verbose = FALSE)

seed.matrix <- matrix(
  c(14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,
    13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050),
  byrow=T, nrow=2)

fit <- madlib.kmeans(dat, 2, key= 'key')
fit

## kmeans++ seeding method
fit <- madlib.kmeans(dat, 2, key= 'key', kmeanspp=TRUE)
fit # display the result

## Initial centroid table
fit <- madlib.kmeans(dat, centers= cent, key= 'key')
fit

## Initial centroid matrix
fit <- madlib.kmeans(dat, centers= seed.matrix, key= 'key')
fit

db.disconnect(cid)

## End(Not run)