ml_bisectingKmeans: Spark ML - Bisecting K-Means Clustering
In danzafar/tidyspark: A Tidy Interface to Spark

Description Usage Arguments Details Value Note Examples

A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. The algorithm starts from a single cluster that contains all points. Iteratively it finds divisible clusters on the bottom level and bisects each of them using k-means, until there are k leaf clusters in total or no leaf clusters are divisible. The bisecting steps of clusters on the same level are grouped together to increase parallelism. If bisecting all divisible clusters on the bottom level would result more than k leaf clusters, larger clusters get higher priority.

Get fitted result from a bisecting k-means model. Note: A saved-loaded model does not support this method.

ml_kmeans_bisecting(
  data,
  formula,
  k = 4,
  maxIter = 20,
  seed = NULL,
  minDivisibleClusterSize = 1
)

## S4 method for signature 'BisectingKMeansModel'
summary(object)

## S4 method for signature 'BisectingKMeansModel'
fitted(object, method = c("centers", "classes"))

## S4 method for signature 'BisectingKMeansModel,character'
write_ml(object, path, overwrite = FALSE)

`data`	a spark_tbl for training.
`formula`	a symbolic description of the model to be fitted. Currently only a few formula operators are supported, including '~', '.', ':', '+', and '-'. Note that the response variable of formula is empty in ml_bisectingKmeans.
`k`	the desired number of leaf clusters. Must be > 1. The actual number could be smaller if there are no divisible leaf clusters.
`maxIter`	maximum iteration number.
`seed`	the random seed.
`minDivisibleClusterSize`	The minimum number of points (if greater than or equal to 1.0) or the minimum proportion of points (if less than 1.0) of a divisible cluster. Note that it is an expert parameter. The default value should be good enough for most cases.
`object`	a fitted bisecting k-means model.
`method`	type of fitted results, `"centers"` for cluster centers or `"classes"` for assigned classes.
`path`	the directory where the model is saved.
`overwrite`	overwrites or not if the output path already exists. Default is FALSE which means throw exception if the output path exists.
`...`	additional argument(s) passed to the method.

Fits a bisecting k-means clustering model against a spark_tbl. Users can call summary to print a summary of the fitted model, predict to make predictions on new data, and write_ml/ read_ml to save/load fitted models.

ml_bisectingKmeans returns a fitted bisecting k-means model.

summary returns summary information of the fitted model, which is a list. The list includes the model's k (number of cluster centers), coefficients (model cluster centers), size (number of data points in each cluster), cluster (cluster centers of the transformed data; cluster is NULL if is.loaded is TRUE), and is.loaded (whether the model is loaded from a saved file).

fitted returns a spark_tbl containing fitted values.

summary(BisectingKMeansModel) since 2.2.0

write_ml(BisectingKMeansModel, character) since 2.2.0

## Not run: 
spark_session()
iris_fix <- iris %>%
setNames(names(iris) %>% sub("[//.]", "_", .)) %>%
 mutate(Species = levels(Species)[Species])
iris_spk <- spark_tbl(iris)
model <- ml_bisectingKmeans(iris_spk, Sepal_Width ~ Sepal_Length, k = 4)
summary(model)

## End(Not run)