ds.kmeans: K-Means clustering of distributed table

View source: R/ds.kmeans.R

ds.kmeansR Documentation

K-Means clustering of distributed table

Description

Performs a k-means clustering on a distributed table using euclidean distance

Usage

ds.kmeans(
  x,
  k = NULL,
  convergence = 0.001,
  max.iter = 100,
  centroids = NULL,
  assign = TRUE,
  name = NULL,
  datasources = NULL
)

Arguments

x

character Name of the data frame on the study server with the data to train the k-means

k

numeric Integer numeric with the number of clusters to find

convergence

numeric (default 0.001) Threshold error for the iterations

max.iter

numeric (default 100) Maxim number of iterations to stop the algorithm

centroids

data frame (default NULL) If NULL random starting centroids will be calculated using the 10/90 inter-quartile range. If a value is supplied, those centroids will be used to start the algorithm. Structure of the data frame to be supplied:

  • Each column corresponds to a centroid, so 3 columns correspond to a k-means with k = 3

  • Each row corresponds to the value of each variable, this has to match the data frame of name 'x' on the server in both length and order.

assign

bool (default TRUE) If TRUE the results of the cluster will be added to the data frame on the server side

name

character (default NULL) If NULL and assign = TRUE, the original table 'x' will be overwritten on the server side with an additional column named 'kmeans.cluster' that contain the results of the k-means. If a value is provided on this argument, a new object on the server side will be created with the values from the original table 'x' + the new 'kmeans.cluster' column.

datasources

a list of DSConnection-class (default NULL) objects obtained after login

Details

This implementation of the kmeans is basically a parallel kmeans where each server acts as a thread. It can be applied because the results that are passed to the master (client) are not disclosive since they are aggregated values that cannot be traced backwards. The assignations vector is not disclosive since all the information that can be extracted from it is the same given by the ds.summary function. For more information on the implementation please refer to 'Parallel K-Means Clustering Algorithm on DNA Dataset' by Fazilah Othman, RosniAbdullah, Nur’Aini Abdul Rashid and Rosalina Abdul Salam

Value

data frame Where:
-Each column corresponds to a centroid (1:k)
-Each row corresponds to the a variable of the server data frame


isglobal-brge/dsMLClient documentation built on March 14, 2023, 1:59 p.m.