computeCanopy: Perform canopy clustering on the table to determine cluster...
In teradata-aster-field/toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description Usage Arguments Details See Also Examples

Canopy clustering algorithm runs in-database, returns centroids compatible with computeKmeans and pre-processes data for k-means and other clustering algorithms.

computeCanopy(channel, tableName, looseDistance, tightDistance, canopy,
  tableInfo, id, include = NULL, except = NULL, scale = TRUE,
  idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL,
  scaledTableName = NULL, schema = NULL, test = FALSE)

`channel`	connection object as returned by `odbcConnect`.
`tableName`	Aster table name.
`looseDistance`	specifies the maximum distance that any point can be from a canopy center to be considered part of that canopy.
`tightDistance`	specifies the minimum distance that separates two canopy centers.
`canopy`	an object of class `"toacanopy"` obtained with `computeCanopy`.
`tableInfo`	pre-built summary of data to use (require when `test=TRUE`). See `getTableSummary`.
`id`	column name or SQL expression containing unique table key.
`include`	a vector of column names with variables (must be numeric). Model never contains variables other than in the list.
`except`	a vector of column names to exclude from variables. Model never contains variables from the list.
`scale`	logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit standard deviation for each of input variables. when `FALSE` then function only removes incomplete data before clustering (conaining `NULL`s).
`idAlias`	SQL alias for table id. This is required when SQL expression is given for `id`.
`where`	specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside `WHERE` clause).
`scaledTableName`	the name of the Aster table with results of scaling
`schema`	name of Aster schema that tables `scaledTableName`, `centroidTableName`, and `clusteredTableName` belong to. Make sure that when this argument is supplied no table name defined contain schema in its name.
`test`	logical: if TRUE show what would be done, only (similar to parameter `test` in RODBC functions: sqlQuery and sqlSave).

Canopy clustering often precedes kmeans algorithm (see computeKmeans) or other clustering algorithms. The goal is to speed up clustering by choosing initial centroids more efficiently than randomly or naively, especially for big data applications. An important notes are that:

function does not let specify number of canopies (clusters), instead it controls them with pair of threshold arguments looseDistance and tightDistance. By adjusting them one tunes computeCanopy to produce more or less canopies as desired.
individual data points may be part of several canopies and cluster memberhip is not available as result of the operation.
resulting toacanopy object should be passed to computeKmeans with canopy argument effectively overriding arguments in kmeans function.

The function fist scales not-null data (if scale=TRUE) or just eliminate nulls without scaling. After that the data given (table tableName with option of filering with where) are clustered using canopy algorithm in Aster. This results in

set of centroids to use as initial cluster centers in k-means and
pre-processed data persisted and ready for clustering with kmeans function computeKmeans.

computeClusterSample, computeSilhouette, computeCanopy

if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
can = computeCanopy(conn, "batting", looseDistance = 1, tightDistance = 0.5,
                    id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                    include=c('g','r','h'), 
                    scaledTableName='test_canopy_scaled', 
                    where="yearid > 2000")
createCentroidPlot(can)

can = computeCanopy(conn, canopy = can, looseDistance = 2, tightDistance = 0.5)
createCentroidPlot(can)

can = computeCanopy(conn, canopy = can, looseDistance = 4, tightDistance = 1)
createCentroidPlot(can)

km = computeKmeans(conn, centers=can, iterMax = 1000, persist = TRUE, 
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   centroidTableName = "kmeans_test_centroids",
                   tempTableName = "kmeans_test_temp",
                   clusteredTableName = "kmeans_test_clustered") 
createCentroidPlot(km)

}

teradata-aster-field/toaster documentation built on May 31, 2019, 8:36 a.m.

teradata-aster-field/toaster index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

teradata-aster-field/toaster
Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeCanopy: Perform canopy clustering on the table to determine cluster...
In teradata-aster-field/toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description

Usage

Arguments

Details

See Also

Examples

Related to computeCanopy in teradata-aster-field/toaster...

R Package Documentation

Browse R Packages

We want your feedback!

teradata-aster-field/toaster Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeCanopy: Perform canopy clustering on the table to determine cluster... In teradata-aster-field/toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description

Usage

Arguments

Details

See Also

Examples

Related to computeCanopy in teradata-aster-field/toaster...

R Package Documentation

Browse R Packages

We want your feedback!

teradata-aster-field/toaster
Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeCanopy: Perform canopy clustering on the table to determine cluster...
In teradata-aster-field/toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform