dssKmeans: Kmeans on datashield nodes

View source: R/dssKmeans.R

dssKmeansR Documentation

Kmeans on datashield nodes

Description

Runs kmeans on the remote data, returns a kmeans object representing the cluster centers in either split or combined mode

Usage

dssKmeans(
  what,
  centers,
  iter.max = 10,
  nstart = 1,
  type = "combine",
  algorithm = "Forgy",
  membership_suffix = NULL,
  async = TRUE,
  datasources = NULL
)

Arguments

what

a character, name of the dataframe (it can contain non-numerics in which case only the numeric columns will be used)

centers

either a number (k - the number of clusters) or a matrix representing the initial number of initial distinct cluster centers (same as for kmeans)

iter.max

same as kmeans, maximum number of iterations

nstart

same as kmeans, if centers is a number, how many random sets should be chosen

type

a character, 'split' or 'combine', should it find the global cluster centers or one set for each node? Default 'combine'.

algorithm

same as kmeans, it defaults to "Forgy" as it's the only one that doesn't error out in the case of empty clusters

membership_suffix

a character. A factor with the cluster membership will be created on each node. Its name will be the name of the dataframe followed by this suffix. If null (the default) the suffix will be 'km_clust<number of clusters>'.

async

same as in datashield.assign

datasources

same as in datashield.assign

Details

If type = 'split' the function simply executes kmeans with the provided arguments and returns one set of cluster centers for each node. If type = 'combine', and centers are provided as a number it first chooses a set of random initial centers from the ranges of the combined dataset, then it executes exactly one iteration of kmeans (with these initial centers) on each node. The results are then retrieved, averaged and the newly obtained centers are sent to the nodes for a new iteration. The process continues until iter.max is reached. If nstart > 1 (recommended for any meaningful results), a new random set of initial centers is calculated and so on until nstart is reached. Then the 'best' cluster centers are chosen as being the ones with the lowest within cluster sum of squared distances. In both cases ('split' and 'combine') a factor representing the cluster membership of each point is created on the nodes. The name of the factor is derived from the dataframe name: <dataframe name>_km_clust<number of clusters>. If iter.max is 0 and centers is a matrix the function simply creates the cluster membership factor (as above) using the given centers.

Value

A list containing one (in the case of 'combined') or more ('split') stripped down kmeans objects.


sib-swiss/dsSwissKnifeClient documentation built on July 16, 2025, 6:25 p.m.