kmeans: k-means clustering

Description Usage Arguments Details Value Algorithm AUTO Author(s)

Description

KMeans is used to perform k-means clustering, with the k-kmeans++ algorithm supported for picking initial clusters.

Usage

1
2
3
4
5
6
7
8
kmeans.data(x, ...)

KMeans(data, ..., inputs = AUTO, outputs = resul)

KMeansMake(centers, p = 2, m = 1, max.iteration = 25,
           algorithm = "kMeansPP", epsilon = .0001,
           sample.size = 10000, convergence = "relative",
           normalized = FALSE, debug = FALSE)

Arguments

x, data

an object of class "data".

...

arguments passed onto KMeansMake

inputs

which attributes of data to perform the GLA on.

outputs

the name of the result. If not length 1, an error is thrown.

centers

either an integer specifying k or a matrix specifying the initial centers. In the latter case, each row should specify a point with one column per input.

p

the parameter used for the p-Minkowski distance metric. The default is 2, the Euclidean metric. The only other commonly used value is 1, which specifies the Manhattan distance.

m

the parameter used for fuzzy clustering. In the default case of 1, no fuzziness is used. Otherwise, the value should be strictly larger than 1.

max.iteration

the maximum iteration allowed, after which the algorithm stops, regardless of whether convergence conditions have not been reached.

algorithm

which algorithm should be used to select the initial centers, either "k-means++" or "standard".

sample.size

the size of the sample to be used for k-means++.

convergence

the type of convergence to be tested for, either "relative" or "absolute".

epsilon

a numeric value greater than 0 used to test for convergence.

Details

For the purposes of consistency with the stats library implementation of kmeans, an S3 method version of kmeans was created, with the default simply referring to the built-in version; kmeans.data merely passes the call to KMeans and should not be used within other GLAs such as GroupBy.

centers should be a matrix regardless of k or the number of attributes used; i.e., even though a vector could provide the necessary information if either was 1, a matrix should still be used as some decisions within the function are based on class.

Value

An object of class "data" with a single attribute.

Algorithm

First, the initial centers for the clusters are set-up via the following:

Initialization

In the case that the initial cluster centers are not provided by the user, there are 2 methods by which they are randomly generated.

The original algorithm, specified by "standard", is to simply randomly pick k points from the data, performed using reservoir sampling. This option tends to need more iterations for convergence and also has a higher likelihood for sub-optimal results.

The other option, "k-means++", also uses random sampling. It differs by picking centers sequentially with the probability of being chosen as the next center being weighted by the squared distance to the nearest center among those already picked. The running time of this process would then be O(kn) which is infeasible for large data. Instead, a sample of sample.size many points is intially picked and used in place of the whole set of data.

After the initial centers are picked, the iterative step is applied until either convergence has been reached or max.iteration many iterations have been performed. If the initial centers were not supplied, then one iteration is spent on initialization, which is also counted against max.iteration.

Lloyd's algorithm is employed. During each pass over the data, the sum of the points in each cluster is aggregated, as well as the total number of points per cluster. Afterwards, the new cluster ceneter is simply set to the average of the points within that cluster for the current iteration.

Lastly, convergence is checked. Originally, Lloyd's Algorithm was run until the cluster centers were constant. However, this necessitates recording the cluster for each point, requiring O(n) space. Instead, there is merely a check that the cluster centers have not moved more than epsilon from their position after the previous iteration. If relative convergence was specified, the distance for a given center is first divided by its previous positive before taking the modulus to compute the distance.

AUTO

In the case of inputs = AUTO, all attributes of the data are used.

Author(s)

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC


tera-insights/gtStats documentation built on May 31, 2019, 8:36 a.m.