computeKmeans: Perform k-means clustering on the table.
In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description Usage Arguments Details Value See Also Examples

K-means clustering algorithm runs in-database, returns object compatible with kmeans and includes arbitrary aggregate metrics computed on resulting clusters.

computeKmeans(channel, tableName, centers, threshold = 0.0395, iterMax = 10,
  tableInfo, id, include = NULL, except = NULL,
  aggregates = "COUNT(*) cnt", scale = TRUE, persist = FALSE,
  idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL,
  scaledTableName = NULL, centroidTableName = NULL,
  clusteredTableName = NULL, tempTableName = NULL, schema = NULL,
  test = FALSE, version = "6.21")

`channel`	connection object as returned by `odbcConnect`.
`tableName`	Aster table name. This argument is ignored if `centers` is a canopy object.
`centers`	either the number of clusters, say `k`, a matrix of initial (distinct) cluster centres, or an object of class `"toacanopy"` obtained with `computeCanopy`. If a number, a random set of (distinct) rows in x is chosen as the initial centers. If a matrix then number of rows determines the number of clusters as each row determines initial center. if a canopy object then number of centers it contains determines the number of clusters, plust it provides (and overrides) the following arguments: `tableName`, `id`, `idAlias`, `include`, `except`, `scale`, `where`, `scaledTableName`, `schema`
`threshold`	the convergence threshold. When the centroids move by less than this amount, the algorithm has converged.
`iterMax`	the maximum number of iterations the algorithm will run before quitting if the convergence threshold has not been met.
`tableInfo`	pre-built summary of data to use (require when `test=TRUE`). See `getTableSummary`.
`id`	column name or SQL expression containing unique table key. This argument is ignored if `centers` is a canopy object.
`include`	a vector of column names with variables (must be numeric). Model never contains variables other than in the list. This argument is ignored if `centers` is a canopy object.
`except`	a vector of column names to exclude from variables. Model never contains variables from the list. This argument is ignored if `centers` is a canopy object.
`aggregates`	vector with SQL aggregates that define arbitrary aggreate metrics to be computed on each cluster after running k-means. Aggregates may have optional aliases like in `"AVG(era) avg_era"`. Subsequently, used in `createClusterPlot` as cluster properties.
`scale`	logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit standard deviation for each of input variables. when `FALSE` then function only removes incomplete data before clustering (conaining `NULL`s). This argument is ignored if `centers` is a canopy object.
`persist`	logical if TRUE then function saves clustered data in the table `clusteredTableName` (when defined) with cluster id assigned. Aster Analytics Foundation 6.20 or earlier can't support this option and so must use `persisit=TRUE`.
`idAlias`	SQL alias for table id. This is required when SQL expression is given for `id`. This argument is ignored if `centers` is a canopy object.
`where`	specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside `WHERE` clause). This argument is ignored if `centers` is a canopy object.
`scaledTableName`	the name of the Aster table with results of scaling. This argument is ignored if `centers` is a canopy object.
`centroidTableName`	the name of the Aster table with centroids found by kmeans.
`clusteredTableName`	the name of the Aster table in which to store the clustered output. If omitted and argument `persist = TRUE` the random table name is generated (always saved in the resulting `toakmeans` object). If `persist = FALSE` then the name is ignored and function does not generate a table of clustered output.
`tempTableName`	name of the temporary Aster table to use to store intermediate results. This table always gets dropped when function executes successfully.
`schema`	name of Aster schema that tables `scaledTableName`, `centroidTableName`, and `clusteredTableName` belong to. Make sure that when this argument is supplied no table name defined contain schema in its name.
`test`	logical: if TRUE show what would be done, only (similar to parameter `test` in RODBC functions: sqlQuery and sqlSave).
`version`	version of Aster Analytics Foundation functions applicable when `test=TRUE`, ignored otherwise.

The function fist scales not-null data (if scale=TRUE) or just removes data with NULLs without scaling. After that the data given (table tableName with option of filering with where) are clustered by the k-means in Aster. Next, all standard metrics of k-means clusters plus additional aggregates provided with aggregates are calculated again in-database.

computeKmeans returns an object of class "toakmeans" (compatible with class "kmeans"). It is a list with at least the following components:

cluster: A vector of integers (from 0:K-1) indicating the cluster to which each point is allocated. computeKmeans leaves this component empty. Use function computeClusterSample to set this compoenent.
centers: A matrix of cluster centres.
totss: The total sum of squares.
withinss: Vector of within-cluster sum of squares, one component per cluster.
tot.withinss: Total within-cluster sum of squares, i.e. sum(withinss).
betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss.
size: The number of points in each cluster. These includes all points in the Aster table specified that satisfy optional where condition.
iter: The number of (outer) iterations.
ifault: integer: indicator of a possible algorithm problem (always 0).
scale: logical: indicates if variable scaling was performed before clustering.
persist: logical: indicates if clustered data was saved in the table.
aggregates: Vectors (dataframe) of aggregates computed on each cluster.
tableName: Aster table name containing data for clustering.
columns: Vector of column names with variables used for clustering.
scaledTableName: Aster table containing scaled data for clustering.
centroidTableName: Aster table containing cluster centroids.
clusteredTableName: Aster table containing clustered output.
id: Column name or SQL expression containing unique table key.
idAlias: SQL alias for table id.
whereClause: SQL WHERE clause expression used (if any).
time: An object of class proc_time with user, system, and total elapsed times for the computeKmeans function call.

computeClusterSample, computeSilhouette, computeCanopy

if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
                         
km = computeKmeans(conn, "batting", centers=5, iterMax = 25,
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                   include=c('g','r','h'), scaledTableName='kmeans_test_scaled', 
                   centroidTableName='kmeans_test_centroids',
                   where="yearid > 2000")
km
createCentroidPlot(km)
createClusterPlot(km)

# persist clustered data
kmc = computeKmeans(conn, "batting", centers=5, iterMax = 250,
                   aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
                   id="playerid || '-' || stint || '-' || teamid || '-' || yearid", 
                   include=c('g','r','h'), 
                   persist = TRUE, 
                   scaledTableName='kmeans_test_scaled', 
                   centroidTableName='kmeans_test_centroids', 
                   clusteredTableName = 'kmeans_test_clustered',
                   tempTableName = 'kmeans_test_temp',
                   where="yearid > 2000")
createCentroidPlot(kmc)
createCentroidPlot(kmc, format="bar_dodge")
createCentroidPlot(kmc, format="heatmap", coordFlip=TRUE)

createClusterPlot(kmc)

kmc = computeClusterSample(conn, kmc, 0.01)
createClusterPairsPlot(kmc, title="Batters Clustered by G, H, R", ticks=FALSE)

kmc = computeSilhouette(conn, kmc)
createSilhouetteProfile(kmc, title="Cluster Silhouette Histograms (Profiles)")

}

Loading required package: RODBC

toaster documentation built on May 30, 2017, 3:51 a.m.

toaster index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

toaster
Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeKmeans: Perform k-means clustering on the table.
In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description

Usage

Arguments

Details

Value

See Also

Examples

Example output

Related to computeKmeans in toaster...

R Package Documentation

Browse R Packages

We want your feedback!

toaster Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeKmeans: Perform k-means clustering on the table. In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

Description

Usage

Arguments

Details

Value

See Also

Examples

Example output

Related to computeKmeans in toaster...

R Package Documentation

Browse R Packages

We want your feedback!

toaster
Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform

computeKmeans: Perform k-means clustering on the table.
In toaster: Big Data in-Database Analytics that Scales with Teradata Aster Distributed Platform