computeKmeans: Perform k-means clustering on the table.

Description Usage Arguments Details Value See Also Examples

View source: R/computeKmeans.R

Description

K-means clustering algorithm runs in-database, returns object compatible with kmeans and includes arbitrary aggregate metrics computed on resulting clusters.

Usage

1
2
3
4
5
6
7
computeKmeans(channel, tableName, centers, threshold = 0.0395, iterMax = 10,
  tableInfo, id, include = NULL, except = NULL,
  aggregates = "COUNT(*) cnt", scale = TRUE, persist = FALSE,
  idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL,
  scaledTableName = NULL, centroidTableName = NULL,
  clusteredTableName = NULL, tempTableName = NULL, schema = NULL,
  test = FALSE, version = "6.21")

Arguments

channel

connection object as returned by odbcConnect.

tableName

Aster table name. This argument is ignored if centers is a canopy object.

centers

either the number of clusters, say k, a matrix of initial (distinct) cluster centres, or an object of class "toacanopy" obtained with computeCanopy. If a number, a random set of (distinct) rows in x is chosen as the initial centers. If a matrix then number of rows determines the number of clusters as each row determines initial center. if a canopy object then number of centers it contains determines the number of clusters, plust it provides (and overrides) the following arguments: tableName, id, idAlias, include, except, scale, where, scaledTableName, schema

threshold

the convergence threshold. When the centroids move by less than this amount, the algorithm has converged.

iterMax

the maximum number of iterations the algorithm will run before quitting if the convergence threshold has not been met.

tableInfo

pre-built summary of data to use (require when test=TRUE). See getTableSummary.

id

column name or SQL expression containing unique table key. This argument is ignored if centers is a canopy object.

include

a vector of column names with variables (must be numeric). Model never contains variables other than in the list. This argument is ignored if centers is a canopy object.

except

a vector of column names to exclude from variables. Model never contains variables from the list. This argument is ignored if centers is a canopy object.

aggregates

vector with SQL aggregates that define arbitrary aggreate metrics to be computed on each cluster after running k-means. Aggregates may have optional aliases like in "AVG(era) avg_era". Subsequently, used in createClusterPlot as cluster properties.

scale

logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit standard deviation for each of input variables. when FALSE then function only removes incomplete data before clustering (conaining NULLs). This argument is ignored if centers is a canopy object.

persist

logical if TRUE then function saves clustered data in the table clusteredTableName (when defined) with cluster id assigned. Aster Analytics Foundation 6.20 or earlier can't support this option and so must use persisit=TRUE.

idAlias

SQL alias for table id. This is required when SQL expression is given for id. This argument is ignored if centers is a canopy object.

where

specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause). This argument is ignored if centers is a canopy object.

scaledTableName

the name of the Aster table with results of scaling. This argument is ignored if centers is a canopy object.

centroidTableName

the name of the Aster table with centroids found by kmeans.

clusteredTableName

the name of the Aster table in which to store the clustered output. If omitted and argument persist = TRUE the random table name is generated (always saved in the resulting toakmeans object). If persist = FALSE then the name is ignored and function does not generate a table of clustered output.

tempTableName

name of the temporary Aster table to use to store intermediate results. This table always gets dropped when function executes successfully.

schema

name of Aster schema that tables scaledTableName, centroidTableName, and clusteredTableName belong to. Make sure that when this argument is supplied no table name defined contain schema in its name.

test

logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions: sqlQuery and sqlSave).

version

version of Aster Analytics Foundation functions applicable when test=TRUE, ignored otherwise.

Details

The function fist scales not-null data (if scale=TRUE) or just removes data with NULLs without scaling. After that the data given (table tableName with option of filering with where) are clustered by the k-means in Aster. Next, all standard metrics of k-means clusters plus additional aggregates provided with aggregates are calculated again in-database.

Value

computeKmeans returns an object of class "toakmeans" (compatible with class "kmeans"). It is a list with at least the following components:

cluster

A vector of integers (from 0:K-1) indicating the cluster to which each point is allocated. computeKmeans leaves this component empty. Use function computeClusterSample to set this compoenent.

centers

A matrix of cluster centres.

totss

The total sum of squares.

withinss

Vector of within-cluster sum of squares, one component per cluster.

tot.withinss

Total within-cluster sum of squares, i.e. sum(withinss).

betweenss

The between-cluster sum of squares, i.e. totss-tot.withinss.

size

The number of points in each cluster. These includes all points in the Aster table specified that satisfy optional where condition.

iter

The number of (outer) iterations.

ifault

integer: indicator of a possible algorithm problem (always 0).

scale

logical: indicates if variable scaling was performed before clustering.

persist

logical: indicates if clustered data was saved in the table.

aggregates

Vectors (dataframe) of aggregates computed on each cluster.

tableName

Aster table name containing data for clustering.

columns

Vector of column names with variables used for clustering.

scaledTableName

Aster table containing scaled data for clustering.

centroidTableName

Aster table containing cluster centroids.

clusteredTableName

Aster table containing clustered output.

id

Column name or SQL expression containing unique table key.

idAlias

SQL alias for table id.

whereClause

SQL WHERE clause expression used (if any).

time

An object of class proc_time with user, system, and total elapsed times for the computeKmeans function call.

See Also

computeClusterSample, computeSilhouette, computeCanopy

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
                         
km = computeKmeans(conn, "batting", centers=5, iterMax = 25,
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                   include=c('g','r','h'), scaledTableName='kmeans_test_scaled', 
                   centroidTableName='kmeans_test_centroids',
                   where="yearid > 2000")
km
createCentroidPlot(km)
createClusterPlot(km)

# persist clustered data
kmc = computeKmeans(conn, "batting", centers=5, iterMax = 250,
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                   include=c('g','r','h'), 
                   persist = TRUE, 
                   scaledTableName='kmeans_test_scaled', 
                   centroidTableName='kmeans_test_centroids', 
                   clusteredTableName = 'kmeans_test_clustered',
                   tempTableName = 'kmeans_test_temp',
                   where="yearid > 2000")
createCentroidPlot(kmc)
createCentroidPlot(kmc, format="bar_dodge")
createCentroidPlot(kmc, format="heatmap", coordFlip=TRUE)

createClusterPlot(kmc)

kmc = computeClusterSample(conn, kmc, 0.01)
createClusterPairsPlot(kmc, title="Batters Clustered by G, H, R", ticks=FALSE)

kmc = computeSilhouette(conn, kmc)
createSilhouetteProfile(kmc, title="Cluster Silhouette Histograms (Profiles)")

}

Example output

Loading required package: RODBC

toaster documentation built on May 30, 2017, 3:51 a.m.