Description Usage Arguments Details See Also Examples
Canopy clustering algorithm runs in-database, returns centroids compatible with computeKmeans
and
pre-processes data for k-means and other clustering algorithms.
1 2 3 4 |
channel |
connection object as returned by |
tableName |
Aster table name. |
looseDistance |
specifies the maximum distance that any point can be from a canopy center to be considered part of that canopy. |
tightDistance |
specifies the minimum distance that separates two canopy centers. |
canopy |
an object of class |
tableInfo |
pre-built summary of data to use (require when |
id |
column name or SQL expression containing unique table key. |
include |
a vector of column names with variables (must be numeric). Model never contains variables other than in the list. |
except |
a vector of column names to exclude from variables. Model never contains variables from the list. |
scale |
logical if TRUE then scale each variable in-database before clustering. Scaling performed results in 0 mean and unit
standard deviation for each of input variables. when |
idAlias |
SQL alias for table id. This is required when SQL expression is given for |
where |
specifies criteria to satisfy by the table rows before applying
computation. The creteria are expressed in the form of SQL predicates (inside
|
scaledTableName |
the name of the Aster table with results of scaling |
schema |
name of Aster schema that tables |
test |
logical: if TRUE show what would be done, only (similar to parameter |
Canopy clustering often precedes kmeans algorithm (see computeKmeans
)
or other clustering algorithms. The goal is to speed up clustering by choosing initial centroids more efficiently
than randomly or naively, especially for big data applications. An important notes are that:
function does not let specify number of canopies (clusters), instead it controls them with pair of
threshold arguments looseDistance
and tightDistance
. By adjusting them one tunes
computeCanopy
to produce more or less canopies as desired.
individual data points may be part of several canopies and cluster memberhip is not available as result of the operation.
resulting toacanopy
object should be passed to computeKmeans
with canopy
argument
effectively overriding arguments in kmeans function.
The function fist scales not-null data (if scale=TRUE
) or just eliminate nulls without scaling. After
that the data given (table tableName
with option of filering with where
) are clustered using canopy
algorithm in Aster. This results in
set of centroids to use as initial cluster centers in k-means and
pre-processed data persisted and ready for clustering with kmeans function computeKmeans
.
computeClusterSample
, computeSilhouette
, computeCanopy
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | if(interactive()){
# initialize connection to Lahman baseball database in Aster
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")
can = computeCanopy(conn, "batting", looseDistance = 1, tightDistance = 0.5,
id="playerid || '-' || stint || '-' || teamid || '-' || yearid",
include=c('g','r','h'),
scaledTableName='test_canopy_scaled',
where="yearid > 2000")
createCentroidPlot(can)
can = computeCanopy(conn, canopy = can, looseDistance = 2, tightDistance = 0.5)
createCentroidPlot(can)
can = computeCanopy(conn, canopy = can, looseDistance = 4, tightDistance = 1)
createCentroidPlot(can)
km = computeKmeans(conn, centers=can, iterMax = 1000, persist = TRUE,
aggregates = c("COUNT(*) cnt", "AVG(g) avg_g", "AVG(r) avg_r", "AVG(h) avg_h"),
centroidTableName = "kmeans_test_centroids",
tempTableName = "kmeans_test_temp",
clusteredTableName = "kmeans_test_clustered")
createCentroidPlot(km)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.