Home

/

GitHub

/

kmeans: k-means clustering
In tera-insights/gtStats: Higher Level Statistical Functions for Grokit

Description Usage Arguments Details Value Algorithm AUTO Author(s)

KMeans is used to perform k-means clustering, with the k-kmeans++ algorithm supported for picking initial clusters.

kmeans.data(x, ...)

KMeans(data, ..., inputs = AUTO, outputs = resul)

KMeansMake(centers, p = 2, m = 1, max.iteration = 25,
           algorithm = "kMeansPP", epsilon = .0001,
           sample.size = 10000, convergence = "relative",
           normalized = FALSE, debug = FALSE)

`x, data`	an object of class `"data"`.
`...`	arguments passed onto `KMeansMake`
`inputs`	which attributes of `data` to perform the GLA on.
`outputs`	the name of the result. If not length 1, an error is thrown.
`centers`	either an integer specifying `k` or a matrix specifying the initial centers. In the latter case, each row should specify a point with one column per input.
`p`	the parameter used for the `p`-Minkowski distance metric. The default is 2, the Euclidean metric. The only other commonly used value is 1, which specifies the Manhattan distance.
`m`	the parameter used for fuzzy clustering. In the default case of 1, no fuzziness is used. Otherwise, the value should be strictly larger than 1.
`max.iteration`	the maximum iteration allowed, after which the algorithm stops, regardless of whether convergence conditions have not been reached.
`algorithm`	which algorithm should be used to select the initial centers, either `"k-means++"` or `"standard"`.
`sample.size`	the size of the sample to be used for k-means++.
`convergence`	the type of convergence to be tested for, either `"relative"` or `"absolute"`.
`epsilon`	a numeric value greater than 0 used to test for convergence.

For the purposes of consistency with the stats library implementation of kmeans, an S3 method version of kmeans was created, with the default simply referring to the built-in version; kmeans.data merely passes the call to KMeans and should not be used within other GLAs such as GroupBy.

centers should be a matrix regardless of k or the number of attributes used; i.e., even though a vector could provide the necessary information if either was 1, a matrix should still be used as some decisions within the function are based on class.

An object of class "data" with a single attribute.

First, the initial centers for the clusters are set-up via the following:

Initialization

In the case that the initial cluster centers are not provided by the user, there are 2 methods by which they are randomly generated.

The original algorithm, specified by "standard", is to simply randomly pick k points from the data, performed using reservoir sampling. This option tends to need more iterations for convergence and also has a higher likelihood for sub-optimal results.

The other option, "k-means++", also uses random sampling. It differs by picking centers sequentially with the probability of being chosen as the next center being weighted by the squared distance to the nearest center among those already picked. The running time of this process would then be O(kn) which is infeasible for large data. Instead, a sample of sample.size many points is intially picked and used in place of the whole set of data.

After the initial centers are picked, the iterative step is applied until either convergence has been reached or max.iteration many iterations have been performed. If the initial centers were not supplied, then one iteration is spent on initialization, which is also counted against max.iteration.

Lloyd's algorithm is employed. During each pass over the data, the sum of the points in each cluster is aggregated, as well as the total number of points per cluster. Afterwards, the new cluster ceneter is simply set to the average of the points within that cluster for the current iteration.

Lastly, convergence is checked. Originally, Lloyd's Algorithm was run until the cluster centers were constant. However, this necessitates recording the cluster for each point, requiring O(n) space. Instead, there is merely a check that the cluster centers have not moved more than epsilon from their position after the previous iteration. If relative convergence was specified, the distance for a given center is first divided by its previous positive before taking the modulus to compute the distance.

In the case of inputs = AUTO, all attributes of the data are used.

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC

tera-insights/gtStats documentation built on May 31, 2019, 8:36 a.m.

tera-insights/gtStats index

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tera-insights/gtStats
Higher Level Statistical Functions for Grokit

kmeans: k-means clustering
In tera-insights/gtStats: Higher Level Statistical Functions for Grokit

Description

Usage

Arguments

Details

Value

Algorithm

Initialization

AUTO

Author(s)

Related to kmeans in tera-insights/gtStats...

R Package Documentation

Browse R Packages

We want your feedback!

tera-insights/gtStats Higher Level Statistical Functions for Grokit

kmeans: k-means clustering In tera-insights/gtStats: Higher Level Statistical Functions for Grokit

Description

Usage

Arguments

Details

Value

Algorithm

Initialization

AUTO

Author(s)

Related to kmeans in tera-insights/gtStats...

R Package Documentation

Browse R Packages

We want your feedback!

tera-insights/gtStats
Higher Level Statistical Functions for Grokit

kmeans: k-means clustering
In tera-insights/gtStats: Higher Level Statistical Functions for Grokit