The wrapper function for MADlib's kmeans clustering [1]. Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. Each centroid represents a cluster that consists of all points to which this centroid is closest. The computation is parallelized by MADlib if the connected database is Greenplum/HAWQ database.

```
madlib.kmeans(
x, centers, iter.max = 10, nstart = 1, algorithm = "Lloyd", key,
fn.dist = "squared_dist_norm2", agg.centroid = "avg", min.frac = 0.001,
kmeanspp = FALSE, seeding.sample.ratio=1.0, ...)
```

`x` |
An object of |

`centers` |
A number, a matrix or db.data.frame object. If it is a number, this sets the number of target centroids and the random (or kmeans++) seeding method is used. Otherwise, this parameter is used for initial centers. If it is a matrix, its rows will denote the initial centroid coordinates. Else, this parameter will point to a table in the connected database that contains the initial centroids. |

`iter.max` |
The maximum number of iterations allowed. |

`nstart` |
If centers is a number, this parameters specifies how many random sets should be chosen. |

`algorithm` |
The algorithm to compute the kmeans. Currently disabled (default:
“ |

`key` |
Name of the column (from the table that is pointed by |

`fn.dist` |
The distance function used by MADlib to compute the objective function. |

`agg.centroid` |
The aggregate function used by MADlib to compute the objective function. |

`min.frac` |
The minimum fraction of centroids reassigned to continue iterating. |

`kmeanspp` |
Whether to call MADlib's kmeans++ centroid seeding method. |

`seeding.sample.ratio` |
The proportion of subsample of original dataset to use for kmeans++ centroid seeding method. |

`...` |
Further arguments passed to or from other methods. Currently, no more parameters can be passed to madlib.kmeans. |

See `madlib.kmeans`

for more details.

For the return value of kmeans clustering see `madlib.kmeans`

for details.

MADlib kmeans clustering output is similar to that of the kmeans output of
the kmeans function of R package `stats`

. `madlib.kmeans`

also
returns an object of class `"kmeans"`

which has a `print`

and a
`fitted`

method.It is a list with at least the following components:

`cluster` |
A vector of integers (from |

`centers` |
A matrix of cluster centres. |

`withinss` |
Vector of within-cluster sum of squares, one component per cluster. |

`tot.withinss` |
Total within-cluster sum of squares,
i.e. |

`size` |
The number of points in each cluster. |

`iter` |
The number of (outer) iterations. |

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. [email protected]

[1] Documentation of kmeans clustering in the latest MADlib release, http://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html

`madlib.lm`

, `madlib.summary`

,
`madlib.arima`

are MADlib wrapper functions.

```
## Not run:
## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)
dat <- db.data.frame("__madlib_km_sample__", conn.id = cid, verbose = FALSE)
cent <- db.data.frame("__madlib_km_centroids__", conn.id = cid, verbose = FALSE)
seed.matrix <- matrix(
c(14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,
13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050),
byrow=T, nrow=2)
fit <- madlib.kmeans(dat, 2, key= 'key')
fit
## kmeans++ seeding method
fit <- madlib.kmeans(dat, 2, key= 'key', kmeanspp=TRUE)
fit # display the result
## Initial centroid table
fit <- madlib.kmeans(dat, centers= cent, key= 'key')
fit
## Initial centroid matrix
fit <- madlib.kmeans(dat, centers= seed.matrix, key= 'key')
fit
db.disconnect(cid)
## End(Not run)
```

