computeCanopy: Perform canopy clustering on the table to determine cluster...

Description Usage Arguments Details See Also Examples

Description

Canopy clustering algorithm runs in-database, returns centroids compatible with computeKmeans and pre-processes data for k-means and other clustering algorithms.

Usage

1
2
3
4
computeCanopy(channel, tableName, looseDistance, tightDistance, canopy,
  tableInfo, id, include = NULL, except = NULL, scale = TRUE,
  idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL,
  scaledTableName = NULL, schema = NULL, test = FALSE)

Arguments

channel

connection object as returned by odbcConnect.

tableName

Aster table name.

looseDistance

specifies the maximum distance that any point can be from a canopy center to be considered part of that canopy.

tightDistance

specifies the minimum distance that separates two canopy centers.

canopy

an object of class "toacanopy" obtained with computeCanopy.

tableInfo

pre-built summary of data to use (require when test=TRUE). See getTableSummary.

id

column name or SQL expression containing unique table key.

include

a vector of column names with variables (must be numeric). Model never contains variables other than in the list.

except

a vector of column names to exclude from variables. Model never contains variables from the list.

scale

logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit standard deviation for each of input variables. when FALSE then function only removes incomplete data before clustering (conaining NULLs).

idAlias

SQL alias for table id. This is required when SQL expression is given for id.

where

specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause).

scaledTableName

the name of the Aster table with results of scaling

schema

name of Aster schema that tables scaledTableName, centroidTableName, and clusteredTableName belong to. Make sure that when this argument is supplied no table name defined contain schema in its name.

test

logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions: sqlQuery and sqlSave).

Details

Canopy clustering often precedes kmeans algorithm (see computeKmeans) or other clustering algorithms. The goal is to speed up clustering by choosing initial centroids more efficiently than randomly or naively, especially for big data applications. An important notes are that:

The function fist scales not-null data (if scale=TRUE) or just eliminate nulls without scaling. After that the data given (table tableName with option of filering with where) are clustered using canopy algorithm in Aster. This results in

  1. set of centroids to use as initial cluster centers in k-means and

  2. pre-processed data persisted and ready for clustering with kmeans function computeKmeans.

See Also

computeClusterSample, computeSilhouette, computeCanopy

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
can = computeCanopy(conn, "batting", looseDistance = 1, tightDistance = 0.5,
                    id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                    include=c('g','r','h'), 
                    scaledTableName='test_canopy_scaled', 
                    where="yearid > 2000")
createCentroidPlot(can)

can = computeCanopy(conn, canopy = can, looseDistance = 2, tightDistance = 0.5)
createCentroidPlot(can)

can = computeCanopy(conn, canopy = can, looseDistance = 4, tightDistance = 1)
createCentroidPlot(can)

km = computeKmeans(conn, centers=can, iterMax = 1000, persist = TRUE, 
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   centroidTableName = "kmeans_test_centroids",
                   tempTableName = "kmeans_test_temp",
                   clusteredTableName = "kmeans_test_clustered") 
createCentroidPlot(km)

}

teradata-aster-field/toaster documentation built on May 31, 2019, 8:36 a.m.