ds.kmeans: K-Means clustering of distributed table
In isglobal-brge/dsMLClient: DataSHIELD client site machine learning functions

ds.kmeans

R Documentation

K-Means clustering of distributed table

Description

Performs a k-means clustering on a distributed table using euclidean distance

Usage

ds.kmeans(
  x,
  k = NULL,
  convergence = 0.001,
  max.iter = 100,
  centroids = NULL,
  assign = TRUE,
  name = NULL,
  datasources = NULL
)

Arguments

`x`	`character` Name of the data frame on the study server with the data to train the k-means
`k`	`numeric` Integer numeric with the number of clusters to find
`convergence`	`numeric` (default `0.001`) Threshold error for the iterations
`max.iter`	`numeric` (default `100`) Maxim number of iterations to stop the algorithm
`centroids`	`data frame` (default `NULL`) If `NULL` random starting centroids will be calculated using the 10/90 inter-quartile range. If a value is supplied, those centroids will be used to start the algorithm. Structure of the data frame to be supplied: Each column corresponds to a centroid, so 3 columns correspond to a k-means with k = 3 Each row corresponds to the value of each variable, this has to match the data frame of name 'x' on the server in both length and order.
`assign`	`bool` (default `TRUE`) If `TRUE` the results of the cluster will be added to the data frame on the server side
`name`	`character` (default `NULL`) If `NULL` and `assign = TRUE`, the original table 'x' will be overwritten on the server side with an additional column named 'kmeans.cluster' that contain the results of the k-means. If a value is provided on this argument, a new object on the server side will be created with the values from the original table 'x' + the new 'kmeans.cluster' column.
`datasources`	a list of `DSConnection-class` (default `NULL`) objects obtained after login

Details

This implementation of the kmeans is basically a parallel kmeans where each server acts as a thread. It can be applied because the results that are passed to the master (client) are not disclosive since they are aggregated values that cannot be traced backwards. The assignations vector is not disclosive since all the information that can be extracted from it is the same given by the ds.summary function. For more information on the implementation please refer to 'Parallel K-Means Clustering Algorithm on DNA Dataset' by Fazilah Othman, RosniAbdullah, Nur’Aini Abdul Rashid and Rosalina Abdul Salam