Description Usage Arguments Details Value See Also Examples
K-means clustering algorithm runs in-database, returns object compatible with kmeans
and
includes arbitrary aggregate metrics computed on resulting clusters.
1 2 3 4 5 6 7 | computeKmeans(channel, tableName, centers, threshold = 0.0395, iterMax = 10,
tableInfo, id, include = NULL, except = NULL,
aggregates = "COUNT(*) cnt", scale = TRUE, persist = FALSE,
idAlias = gsub("[^0-9a-zA-Z]+", "_", id), where = NULL,
scaledTableName = NULL, centroidTableName = NULL,
clusteredTableName = NULL, tempTableName = NULL, schema = NULL,
test = FALSE, version = "6.21")
|
channel |
connection object as returned by |
tableName |
Aster table name. This argument is ignored if |
centers |
either the number of clusters, say |
threshold |
the convergence threshold. When the centroids move by less than this amount, the algorithm has converged. |
iterMax |
the maximum number of iterations the algorithm will run before quitting if the convergence threshold has not been met. |
tableInfo |
pre-built summary of data to use (require when |
id |
column name or SQL expression containing unique table key. This argument is ignored if |
include |
a vector of column names with variables (must be numeric). Model never contains variables other than in the list.
This argument is ignored if |
except |
a vector of column names to exclude from variables. Model never contains variables from the list.
This argument is ignored if |
aggregates |
vector with SQL aggregates that define arbitrary aggreate metrics to be computed on each cluster
after running k-means. Aggregates may have optional aliases like in |
scale |
logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit
standard deviation for each of input variables. when |
persist |
logical if TRUE then function saves clustered data in the table |
idAlias |
SQL alias for table id. This is required when SQL expression is given for |
where |
specifies criteria to satisfy by the table rows before applying
computation. The creteria are expressed in the form of SQL predicates (inside
|
scaledTableName |
the name of the Aster table with results of scaling. This argument is ignored if |
centroidTableName |
the name of the Aster table with centroids found by kmeans. |
clusteredTableName |
the name of the Aster table in which to store the clustered output. If omitted
and argument |
tempTableName |
name of the temporary Aster table to use to store intermediate results. This table always gets dropped when function executes successfully. |
schema |
name of Aster schema that tables |
test |
logical: if TRUE show what would be done, only (similar to parameter |
version |
version of Aster Analytics Foundation functions applicable when |
The function fist scales not-null data (if scale=TRUE
) or just removes data with NULL
s without scaling.
After that the data given (table tableName
with option of filering with where
) are clustered by the
k-means in Aster. Next, all standard metrics of k-means clusters plus additional aggregates provided with
aggregates
are calculated again in-database.
computeKmeans
returns an object of class "toakmeans"
(compatible with class "kmeans"
).
It is a list with at least the following components:
cluster
A vector of integers (from 0:K-1) indicating the cluster to which each point is allocated.
computeKmeans
leaves this component empty. Use function computeClusterSample
to set this compoenent.
centers
A matrix of cluster centres.
totss
The total sum of squares.
withinss
Vector of within-cluster sum of squares, one component per cluster.
tot.withinss
Total within-cluster sum of squares, i.e. sum(withinss)
.
betweenss
The between-cluster sum of squares, i.e. totss-tot.withinss
.
size
The number of points in each cluster. These includes all points in the Aster table specified that
satisfy optional where
condition.
iter
The number of (outer) iterations.
ifault
integer: indicator of a possible algorithm problem (always 0).
scale
logical: indicates if variable scaling was performed before clustering.
persist
logical: indicates if clustered data was saved in the table.
aggregates
Vectors (dataframe) of aggregates computed on each cluster.
tableName
Aster table name containing data for clustering.
columns
Vector of column names with variables used for clustering.
scaledTableName
Aster table containing scaled data for clustering.
centroidTableName
Aster table containing cluster centroids.
clusteredTableName
Aster table containing clustered output.
id
Column name or SQL expression containing unique table key.
idAlias
SQL alias for table id.
whereClause
SQL WHERE
clause expression used (if any).
time
An object of class proc_time
with user, system, and total elapsed times
for the computeKmeans
function call.
computeClusterSample
, computeSilhouette
, computeCanopy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | if(interactive()){
# initialize connection to Lahman baseball database in Aster
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
km = computeKmeans(conn, "batting", centers=5, iterMax = 25,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
id="playerid || '-' || stint || '-' || teamid || '-' || yearid",
include=c('g','r','h'), scaledTableName='kmeans_test_scaled',
centroidTableName='kmeans_test_centroids',
where="yearid > 2000")
km
createCentroidPlot(km)
createClusterPlot(km)
# persist clustered data
kmc = computeKmeans(conn, "batting", centers=5, iterMax = 250,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
id="playerid || '-' || stint || '-' || teamid || '-' || yearid",
include=c('g','r','h'),
persist = TRUE,
scaledTableName='kmeans_test_scaled',
centroidTableName='kmeans_test_centroids',
clusteredTableName = 'kmeans_test_clustered',
tempTableName = 'kmeans_test_temp',
where="yearid > 2000")
createCentroidPlot(kmc)
createCentroidPlot(kmc, format="bar_dodge")
createCentroidPlot(kmc, format="heatmap", coordFlip=TRUE)
createClusterPlot(kmc)
kmc = computeClusterSample(conn, kmc, 0.01)
createClusterPairsPlot(kmc, title="Batters Clustered by G, H, R", ticks=FALSE)
kmc = computeSilhouette(conn, kmc)
createSilhouetteProfile(kmc, title="Cluster Silhouette Histograms (Profiles)")
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.