dssKmeans: Kmeans on datashield nodes
In sib-swiss/dsSwissKnifeClient: DataSHIELD Tools and Utilities - client side

dssKmeans

R Documentation

Kmeans on datashield nodes

Description

Runs kmeans on the remote data, returns a kmeans object representing the cluster centers in either split or combined mode

Usage

dssKmeans(
  what,
  centers,
  iter.max = 10,
  nstart = 1,
  type = "combine",
  algorithm = "Forgy",
  membership_suffix = NULL,
  async = TRUE,
  datasources = NULL
)

Arguments

`what`	a character, name of the dataframe (it can contain non-numerics in which case only the numeric columns will be used)
`centers`	either a number (k - the number of clusters) or a matrix representing the initial number of initial distinct cluster centers (same as for kmeans)
`iter.max`	same as kmeans, maximum number of iterations
`nstart`	same as kmeans, if centers is a number, how many random sets should be chosen
`type`	a character, 'split' or 'combine', should it find the global cluster centers or one set for each node? Default 'combine'.
`algorithm`	same as kmeans, it defaults to "Forgy" as it's the only one that doesn't error out in the case of empty clusters
`membership_suffix`	a character. A factor with the cluster membership will be created on each node. Its name will be the name of the dataframe followed by this suffix. If null (the default) the suffix will be 'km_clust<number of clusters>'.
`async`	same as in datashield.assign
`datasources`	same as in datashield.assign

Details

If type = 'split' the function simply executes kmeans with the provided arguments and returns one set of cluster centers for each node. If type = 'combine', and centers are provided as a number it first chooses a set of random initial centers from the ranges of the combined dataset, then it executes exactly one iteration of kmeans (with these initial centers) on each node. The results are then retrieved, averaged and the newly obtained centers are sent to the nodes for a new iteration. The process continues until iter.max is reached. If nstart > 1 (recommended for any meaningful results), a new random set of initial centers is calculated and so on until nstart is reached. Then the 'best' cluster centers are chosen as being the ones with the lowest within cluster sum of squared distances. In both cases ('split' and 'combine') a factor representing the cluster membership of each point is created on the nodes. The name of the factor is derived from the dataframe name: <dataframe name>_km_clust<number of clusters>. If iter.max is 0 and centers is a matrix the function simply creates the cluster membership factor (as above) using the given centers.