Description Usage Arguments Details Value See Also Examples
K-means clustering algorithm runs in-database, returns object compatible with kmeans and
includes arbitrary aggregate metrics computed on resulting clusters.
1 2 3 4 5 6 7 | computeKmeans(channel, tableName, centers, threshold = 0.0395, iterMax = 10,
tableInfo, id, include = NULL, except = NULL,
aggregates = "COUNT(*) cnt", scale = TRUE, persist = FALSE,
idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL,
scaledTableName = NULL, centroidTableName = NULL,
clusteredTableName = NULL, tempTableName = NULL, schema = NULL,
test = FALSE, version = "6.21")
|
channel |
connection object as returned by |
tableName |
Aster table name. This argument is ignored if |
centers |
either the number of clusters, say |
threshold |
the convergence threshold. When the centroids move by less than this amount, the algorithm has converged. |
iterMax |
the maximum number of iterations the algorithm will run before quitting if the convergence threshold has not been met. |
tableInfo |
pre-built summary of data to use (require when |
id |
column name or SQL expression containing unique table key. This argument is ignored if |
include |
a vector of column names with variables (must be numeric). Model never contains variables other than in the list.
This argument is ignored if |
except |
a vector of column names to exclude from variables. Model never contains variables from the list.
This argument is ignored if |
aggregates |
vector with SQL aggregates that define arbitrary aggreate metrics to be computed on each cluster
after running k-means. Aggregates may have optional aliases like in |
scale |
logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit
standard deviation for each of input variables. when |
persist |
logical if TRUE then function saves clustered data in the table |
idAlias |
SQL alias for table id. This is required when SQL expression is given for |
where |
specifies criteria to satisfy by the table rows before applying
computation. The creteria are expressed in the form of SQL predicates (inside
|
scaledTableName |
the name of the Aster table with results of scaling. This argument is ignored if |
centroidTableName |
the name of the Aster table with centroids found by kmeans. |
clusteredTableName |
the name of the Aster table in which to store the clustered output. If omitted
and argument |
tempTableName |
name of the temporary Aster table to use to store intermediate results. This table always gets dropped when function executes successfully. |
schema |
name of Aster schema that tables |
test |
logical: if TRUE show what would be done, only (similar to parameter |
version |
version of Aster Analytics Foundation functions applicable when |
The function fist scales not-null data (if scale=TRUE) or just removes data with NULLs without scaling.
After that the data given (table tableName with option of filering with where) are clustered by the
k-means in Aster. Next, all standard metrics of k-means clusters plus additional aggregates provided with
aggregates are calculated again in-database.
computeKmeans returns an object of class "toakmeans" (compatible with class "kmeans").
It is a list with at least the following components:
clusterA vector of integers (from 0:K-1) indicating the cluster to which each point is allocated.
computeKmeans leaves this component empty. Use function computeClusterSample to set this compoenent.
centersA matrix of cluster centres.
totssThe total sum of squares.
withinssVector of within-cluster sum of squares, one component per cluster.
tot.withinssTotal within-cluster sum of squares, i.e. sum(withinss).
betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss.
sizeThe number of points in each cluster. These includes all points in the Aster table specified that
satisfy optional where condition.
iterThe number of (outer) iterations.
ifaultinteger: indicator of a possible algorithm problem (always 0).
scalelogical: indicates if variable scaling was performed before clustering.
persistlogical: indicates if clustered data was saved in the table.
aggregatesVectors (dataframe) of aggregates computed on each cluster.
tableNameAster table name containing data for clustering.
columnsVector of column names with variables used for clustering.
scaledTableNameAster table containing scaled data for clustering.
centroidTableNameAster table containing cluster centroids.
clusteredTableNameAster table containing clustered output.
idColumn name or SQL expression containing unique table key.
idAliasSQL alias for table id.
whereClauseSQL WHERE clause expression used (if any).
timeAn object of class proc_time with user, system, and total elapsed times
for the computeKmeans function call.
computeClusterSample, computeSilhouette, computeCanopy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | if(interactive()){
# initialize connection to Lahman baseball database in Aster
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
km = computeKmeans(conn, "batting", centers=5, iterMax = 25,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
id="playerid || '-' || stint || '-' || teamid || '-' || yearid",
include=c('g','r','h'), scaledTableName='kmeans_test_scaled',
centroidTableName='kmeans_test_centroids',
where="yearid > 2000")
km
createCentroidPlot(km)
createClusterPlot(km)
# persist clustered data
kmc = computeKmeans(conn, "batting", centers=5, iterMax = 250,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
id="playerid || '-' || stint || '-' || teamid || '-' || yearid",
include=c('g','r','h'),
persist = TRUE,
scaledTableName='kmeans_test_scaled',
centroidTableName='kmeans_test_centroids',
clusteredTableName = 'kmeans_test_clustered',
tempTableName = 'kmeans_test_temp',
where="yearid > 2000")
createCentroidPlot(kmc)
createCentroidPlot(kmc, format="bar_dodge")
createCentroidPlot(kmc, format="heatmap", coordFlip=TRUE)
createClusterPlot(kmc)
kmc = computeClusterSample(conn, kmc, 0.01)
createClusterPairsPlot(kmc, title="Batters Clustered by G, H, R", ticks=FALSE)
kmc = computeSilhouette(conn, kmc)
createSilhouetteProfile(kmc, title="Cluster Silhouette Histograms (Profiles)")
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.