madlib.kmeans: Wrapper for MADlib's Kmeans clustering function

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/madlib-kmeans.R

Description

The wrapper function for MADlib's kmeans clustering [1]. Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. Each centroid represents a cluster that consists of all points to which this centroid is closest. The computation is parallelized by MADlib if the connected database is Greenplum/HAWQ database.

Usage

1
2
3
4
madlib.kmeans(
  x, centers, iter.max = 10, nstart = 1, algorithm = "Lloyd", key,
  fn.dist = "squared_dist_norm2", agg.centroid = "avg", min.frac = 0.001,
  kmeanspp = FALSE, seeding.sample.ratio=1.0, ...)

Arguments

x

An object of db.obj class. Currently, this parameter is mandatory. If it is an object of class db.Rquery or db.view, a temporary table will be created, and further computation will be done on the temporary table. After the computation, the temporary will be dropped from the corresponding database. Data points and predefined centroids (if used) are expected to be stored row-wise, and each point should be of numeric type.

centers

A number, a matrix or db.data.frame object. If it is a number, this sets the number of target centroids and the random (or kmeans++) seeding method is used. Otherwise, this parameter is used for initial centers. If it is a matrix, its rows will denote the initial centroid coordinates. Else, this parameter will point to a table in the connected database that contains the initial centroids.

iter.max

The maximum number of iterations allowed.

nstart

If centers is a number, this parameters specifies how many random sets should be chosen.

algorithm

The algorithm to compute the kmeans. Currently disabled (default: “Lloyd”) and kept for the future implementations.

key

Name of the column (from the table that is pointed by x) that contains the ids for each point.

fn.dist

The distance function used by MADlib to compute the objective function.

agg.centroid

The aggregate function used by MADlib to compute the objective function.

min.frac

The minimum fraction of centroids reassigned to continue iterating.

kmeanspp

Whether to call MADlib's kmeans++ centroid seeding method.

seeding.sample.ratio

The proportion of subsample of original dataset to use for kmeans++ centroid seeding method.

...

Further arguments passed to or from other methods. Currently, no more parameters can be passed to madlib.kmeans.

Details

See madlib.kmeans for more details.

Value

For the return value of kmeans clustering see madlib.kmeans for details.

MADlib kmeans clustering output is similar to that of the kmeans output of the kmeans function of R package stats. madlib.kmeans also returns an object of class "kmeans" which has a print and a fitted method.It is a list with at least the following components:

cluster

A vector of integers (from 1:k) indicating the cluster to which each point is allocated.

centers

A matrix of cluster centres.

withinss

Vector of within-cluster sum of squares, one component per cluster.

tot.withinss

Total within-cluster sum of squares, i.e. sum(withinss).

size

The number of points in each cluster.

iter

The number of (outer) iterations.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

References

[1] Documentation of kmeans clustering in the latest MADlib release, https://madlib.apache.org/docs/latest/group__grp__kmeans.html

See Also

madlib.lm, madlib.summary, madlib.arima are MADlib wrapper functions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
## Not run: 


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

dat <- db.data.frame("__madlib_km_sample__", conn.id = cid, verbose = FALSE)
cent <- db.data.frame("__madlib_km_centroids__", conn.id = cid, verbose = FALSE)

seed.matrix <- matrix(
  c(14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065,
    13.2,1.78,2.14,11.2,1,2.65,2.76,0.26,1.28,4.38,1.05,3.49,1050),
  byrow=T, nrow=2)

fit <- madlib.kmeans(dat, 2, key= 'key')
fit

## kmeans++ seeding method
fit <- madlib.kmeans(dat, 2, key= 'key', kmeanspp=TRUE)
fit # display the result

## Initial centroid table
fit <- madlib.kmeans(dat, centers= cent, key= 'key')
fit

## Initial centroid matrix
fit <- madlib.kmeans(dat, centers= seed.matrix, key= 'key')
fit

db.disconnect(cid)

## End(Not run)

PivotalR documentation built on March 13, 2021, 1:06 a.m.