kmeansClustering: K-Means Clustering
In FCPS: Fundamental Clustering Problems Suite

kmeansClustering

R Documentation

K-Means Clustering

Description

Perform k-means clustering on a data matrix.

Usage

kmeansClustering(DataOrDistances, ClusterNo,

 Type = 'LBG',RandomNo=5000, CategoricalData,
 
 PlotIt=FALSE, Verbose = FALSE,... )

Arguments

`DataOrDistances`	Either nonsymmetric [1:n,1:d] datamatrix of n cases and d numerical features or symmetric [1:n,1:n] distance matrix
`ClusterNo`	A number k which defines k different clusters to be built by the algorithm.
`Type`	Choice of Kmeans algorithm, currently either " `Hartigan`" [Hartigan/Wong, 1979], "`LBG`" [Linde et al., 1980], "`Sparse`" sparse k-means proposed in [Witten/Tibshirani, 2010], "`Steinley`" best method of [Steinley/Brusco, 2007] proposed in Steinley 2003, "`Lloyd`" [Lloyd, 1982], "`Forgy`"[Forgy, 1965], `MacQueen` [MacQueen, 1967], `kcentroids` [Leisch, 2006], "`kprototypes`" [Szepannek, 2018], "`Pelleg-moore`" [Pelleg & Moores,2000], "`Elkan`" [Elkan, 2003], "`kmeans++`"" [Arthur & Vassilvitskii], `Hamerly`"[Hamerly, 2010] ,`Dualtree`" or `Dualtree-covertree` [Curtin, 2017]"
`RandomNo`	Only for " `Steinley`" or in case of distance matrix, number of random initializations with searching for minimal SSE, see [Steinley/Brusco, 2007]
`CategoricalData`	Only for " `kprototypes`", [1:n,1:m] matrix of categorical features]
`PlotIt`	Default: FALSE, If TRUE plots the first three dimensions of the dataset with colored three-dimensional data points defined by the clustering stored in `Cls`
`Verbose`	Print details, if true
`...`	Further arguments like `iter.max`, `nstart`, for `kcentroids` please see `kcca` function of the flexclust package, or `KMeansSparseCluster`

Details

Uses either stats package function 'kmeans', cclust package implemention, flexclust package implemention or own code. In case of a distance matrix, RandomNo should be significantly lower than 5000, otherwise a long computation time is to be expected.

Value

List V of

Cls

[1:n] numerical vector with n numbers defining the classification as the main output of the clustering algorithm. It has k unique numbers representing the arbitrary labels of the clustering.

Object

Object of the clustering algorithm used if existent, otherwise

SumDistsToCentroids: Vector of within-cluster sum of squares, one component per cluster

Centroids

the final cluster centers.

Note

The version using a distance matrix is still in the test phase and not yet verified.

Author(s)

Michael Thrun

References

[Hartigan/Wong, 1979] Hartigan, J. A., & Wong, M. A.: Algorithm AS 136: A k-means clustering algorithm, Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 28(1), pp. 100-108. 1979.

[Linde et al., 1980] Linde, Y., Buzo, A., & Gray, R.: An algorithm for vector quantizer design, IEEE Transactions on communications, Vol. 28(1), pp. 84-95. 1980.

[Steinley/Brusco, 2007] Steinley, D., & Brusco, M. J.: Initializing k-means batch clustering: A critical evaluation of several techniques, Journal of Classification, Vol. 24(1), pp. 99-121. 2007.

[Forgy, 1965] Forgy, E. W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications, Biometrics, Vol. 21, pp. 768-769. 1965.

[MacQueen, 1967] MacQueen, J.: Some methods for classification and analysis of multivariate observations, Proc. Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1, pp. 281-297, Oakland, CA, USA., 1967.

[Pelleg & Moores,2000] Pelleg, Dan, and Andrew W. Moore. X-means: Extending k-means with efficient estimation of the number of clusters, ICML. Vol. 1. 2000.

[Elkan, 2003] Elkan, Charles: Using the triangle inequality to acceler- ate k-means, In Tom Fawcett and Nina Mishra, editors, ICML, pages Vol.3, 147-153. AAAI Press, 2003.

[Lloyd, 1982] Lloyd, S.: Least squares quantization in PCM, IEEE transactions on information theory, Vol. 28(2), pp. 129-137. 1982.

[Leisch, 2006] Leisch, F.: A toolbox for k-centroids cluster analysis, Computational Statistics & Data Analysis, Vol. 51(2), pp. 526-544. 2006.

[Arthur & Vassilvitskii] Arthur, David, and Vassilvitskii, Sergei: K-means++ the advantages of careful seeding, Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 2007

[Witten/Tibshirani, 2010] Witten, D. and Tibshirani, R.: A Framework for Feature Selection in Clustering. Journal of the American Statistical Association, Vol. 105(490), pp. 713-726, 2010.

[Hamerly, 2010] Hamerly, Greg: Making k-means even faster, Proceedings of the 2010 SIAM international conference on data mining, Society for Industrial and Applied Mathematics, pp. 130-140, 2010.

[Szepannek, 2018] Szepannek, G.: clustMixType: User-Friendly Clustering of Mixed-Type Data in R, The R Journal, Vol. 10/2, pp. 200-208, doi:10.32614/RJ2018048, 2018.

[Curtin, 2017] Curtin, Ryan R: A dual-tree algorithm for fast k-means clustering with large k, Proceedings of the 2017 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2017.

Examples

data('Hepta')
out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE)


data('Leukemia')
# As expected does not perform well
# For non-spherical cluster structures:
out=kmeansClustering(Leukemia$DistanceMatrix,ClusterNo=6,RandomNo =10,PlotIt=TRUE)




data('Hepta')
out=kmeansClustering(Hepta$Data,ClusterNo=7,PlotIt=FALSE,Type="Steinley")



data('Hepta')
out=kmeansClustering(Hepta$Data,ClusterNo = 7,
Type = "kprototypes",CategoricalData = as.matrix(Hepta$Cls))

FCPS documentation built on Oct. 19, 2023, 5:06 p.m.