Description Usage Arguments Details Value Algorithm AUTO Author(s)
KMeans
is used to perform k-means clustering, with the
k-kmeans++ algorithm supported for picking initial clusters.
1 2 3 4 5 6 7 8 |
x, data |
an object of class |
... |
arguments passed onto |
inputs |
which attributes of |
outputs |
the name of the result. If not length 1, an error is thrown. |
centers |
either an integer specifying k or a matrix specifying the initial centers. In the latter case, each row should specify a point with one column per input. |
p |
the parameter used for the p-Minkowski distance metric. The default is 2, the Euclidean metric. The only other commonly used value is 1, which specifies the Manhattan distance. |
m |
the parameter used for fuzzy clustering. In the default case of 1, no fuzziness is used. Otherwise, the value should be strictly larger than 1. |
max.iteration |
the maximum iteration allowed, after which the algorithm stops, regardless of whether convergence conditions have not been reached. |
algorithm |
which algorithm should be used to select the initial centers, either
|
sample.size |
the size of the sample to be used for k-means++. |
convergence |
the type of convergence to be tested for, either |
epsilon |
a numeric value greater than 0 used to test for convergence. |
For the purposes of consistency with the stats
library
implementation of kmeans
, an S3 method version of kmeans
was created, with the default simply referring to the built-in
version; kmeans.data
merely passes the call to KMeans
and should not be used within other GLAs such as
GroupBy
.
centers
should be a matrix regardless of k or the number
of attributes used; i.e., even though a vector could provide the
necessary information if either was 1, a matrix should still be used
as some decisions within the function are based on class.
An object of class "data"
with a single attribute.
First, the initial centers for the clusters are set-up via the following:
In the case that the initial cluster centers are not provided by the user, there are 2 methods by which they are randomly generated.
The original algorithm, specified by "standard"
, is to simply
randomly pick k points from the data, performed using reservoir
sampling. This option tends to need more iterations for convergence
and also has a higher likelihood for sub-optimal results.
The other option, "k-means++"
, also uses random sampling. It
differs by picking centers sequentially with the probability of
being chosen as the next center being weighted by the squared
distance to the nearest center among those already picked. The
running time of this process would then be O(kn) which is
infeasible for large data. Instead, a sample of sample.size
many points is intially picked and used in place of the whole
set of data.
After the initial centers are picked, the iterative step is applied
until either convergence has been reached or max.iteration
many iterations have been performed. If the initial centers were not
supplied, then one iteration is spent on initialization, which is
also counted against max.iteration
.
Lloyd's algorithm is employed. During each pass over the data, the sum of the points in each cluster is aggregated, as well as the total number of points per cluster. Afterwards, the new cluster ceneter is simply set to the average of the points within that cluster for the current iteration.
Lastly, convergence is checked. Originally, Lloyd's Algorithm was run
until the cluster centers were constant. However, this necessitates
recording the cluster for each point, requiring O(n) space.
Instead, there is merely a check that the cluster centers have
not moved more than epsilon
from their position after the
previous iteration. If relative convergence was specified, the
distance for a given center is first divided by its previous positive
before taking the modulus to compute the distance.
In the case of inputs = AUTO
, all attributes of the data are
used.
Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.