knitr::opts_chunk$set( collapse = TRUE, fig.width = 7, fig.height = 4, comment = "#>" )
This vignette introduces the package kMeans, which is an implementation of the k-means algorithm in the statistical programming language R. The algorithm can be allocated to the field of unsupervised learning since the processed data do not come with any labels. In other words, it investigates the underlying pattern of the data without having an outcome variable. The main application of the k-means algorithm is the cluster analysis, which is characterized by grouping similar data points into clusters.
Due to its popularity there already exists an implementation of the algorithm in R's stats package, namely the function kmeans()
. Taking a closer look at the function's body (accessed via body(kmeans)
) reveals that it makes several calls to compiled code written in C or Fortran that has been loaded into R. While this improves the function's runtime, it is not possible for a user to investigate the details of R's k-means implementation.
The package kMeans presented here, addresses this shortcoming by implementing the k-means algorithm without calling compiled code. Furthermore, integrated S3 object orientation expands the scope of usable methods and, thus, leads to more functionality compared to the existing implementation. Notably, the installation of any additional packages is not required.
The underlying methodology of the k-means algorithm is outlined in section 2 of this vignette. In section 3 the implementation of the method is discussed and in section 4 the creation of S3 methods is explained. Afterwards, an example is shown and its results are compared to the existing function kmeans()
in section 5. Finally, the advantages and disadvantages of the package are discussed in section 6.
The main application of the k-means algorithm is the cluster analysis. It aims to measure the similarity of data points and group the observations into clusters. Lloyd [-@Lloyd] and Forgy [-@Forgy] proposed the same algorithm to achieve this target, Lloyd considering the discrete data distribution and Forgy considering the continuous data distribution. While the methodological findings are equally attributable to both of them, for simplicity, only Lloyd will be referred to in the following.
Formally, the algorithm groups the multidimensional data set $X_{n \times p} = {\mathbf{x}{1}, \dots, \mathbf{x}{n}}^{T}$, where each observation $\mathbf{x}{i}\in \mathbb{R}^{p}, i = 1, \dots, n$ into $K$ different clusters. Each cluster is associated with a centroid $\mathbf{c}^{(t)}_k$, where $\mathbf{c}^{(t)}_k \in \mathbb{R}^{p}$ for all $k = 1,\dots,K$. The matrix $C^{(t)}{K \times p}$ combines the centroids row-wise. Since the algorithm has an iterative nature, the upper subscript is determined by the number of iterations $t$. Assuming that the starting centroids matrix $C^{(1)}$ is given, the algorithm can be partitioned into two steps.
The degree of dissimilarity is measured by computing squared Euclidean distances between each data point and each centroid, for example $$d(\mathbf{x}{i}, \mathbf{c}^{(t)}{k}) = \sum_{l=1}^{p} (x_{il} - c^{(t)}{kl})^2$$ where $\mathbf{x}{i}$ is the $i$-th observation and $\mathbf{c}^{(t)}{k}$ is the $k$-th centroid at iteration $t$. The repeated computation results in a distance matrix $D{n \times K}^{(t)}$, where the element with the indices $(4,2)$ is the squared Euclidean distance between observation $i=4$ and the centroid $k = 2$ at iteration $t$ given by $d(\mathbf{x}{4}, \mathbf{c}{2}^{(t)})$.
Then, each observation is assigned to the cluster whose centroid has the smallest squared Euclidean distance. The cluster $k$ at iteration $t$ is described by $G^{(t)}{k}$: $$G^{(t)}{k} = {\mathbf{x}{i}|d(\mathbf{x}{i}, \mathbf{c}^{(t)}{k}) \leq d(\mathbf{x}{i}, \mathbf{c}^{(t)}_{k^{}}),\quad \forall k^{}=1,\dots,K}$$ The newly generated cluster is needed for the next update step.
After (re-)assigning the data points the new means of the new created clusters are computed as: $$\mathbf{c}^{(t+1)}{k} = \frac{1}{|G^{(t)}{k}|}\sum_{\mathbf{x}{i} \in G^{(t)}{k}}\mathbf{x}{i}$$ where $|G^{(t)}{k}|$ is the cardinality of cluster $k$ at iteration $t$. If $\mathbf{c}^{(t)}{k} = \mathbf{c}^{(t+1)}{k},\,\,\forall k \in K$ the algorithm has converged. At this point, another iteration would not lead to changes, neither in cluster assignments nor in updated centroids. In case of non-convergence a further iteration is run. The distance matrix $D^{(t+1)}$ is based on $C^{(t +1)}$ in this next iteration.
Because the entire data set is used together to update the centroids, the presented algorithm is sometimes called “batched” k-means algorithm.
For flexible usage of the k-means algorithm the package kMeans contains two refinements, an option to randomly draw starting centroids and an iteration counter.
While the starting centroids matrix $C^{(1)}$ may be provided, it does not have to be. It is also possible to define only the number of different groups that are to be generated. If so, the starting centroids are chosen at random from the data matrix $X$. Therefore, the starting centroids correspond to data points before running through any iteration. The advantage of the random draw predominantly lies in its programming implication: It is impossible to generate empty clusters.
However, using random starting centroids does not come without any shortcoming. Different starting centroids can lead to different stable (i.e. converged) solutions. For robustness, the algorithm can be run several times, each time using different randomly drawn starting centroids. This leads to several different results of which the best one should be selected. How this selection works is discussed in section 3 (cp. kMeansLloyd()
).
In case of circular data, the algorithm may need to pass many iterations for achieving convergence. Sometimes this behavior may not be desired and should be avoided. Hence, it is recommended to add an iteration counter in the algorithm. If the algorithm exceeds the number of maximum iterations it is forced to stop.
The implementation of the k-means algorithm consists of three core functions, namely kMeansLloyd()
, lloyd()
and distances()
. The functions build on each other in the order named. While the function kMeansLloyd()
may repeatedly call the function lloyd()
, the latter in turn calls the function distances()
. To simplify a cluster analysis only the former one is exported. The function distances()
for obtaining the distance matrix $D$ is given by:
getFromNamespace("distances", "kMeans")
Since the function is not exported it can be printed by using getFromNamespace("distances", "kMeans")
. The correct type of the inputs is checked in the function kMeansLloyd()
. It guarantees that x
is a numeric data matrix $X$ and y
is a matrix containing the centroids $C$. While the computation of the squared Euclidean distances may be achieved by using a loop over the number of observations and clusters, the former is avoided by using sophisticated matrix algebra (cp.rowSums(t(t(x) - y[i, ])^2)
). By exploiting R's recycling behavior and by suitable pre-allocation the runtime of the computation is reduced. The return value distMatrix
corresponds to the distance matrix $D$ described in section 2. In the function lloyd()
the function distances()
and its return value are used to assign the data points to the clusters. So apart from the distances computation the k-means algorithm is performed by the function lloyd()
, which is implemented as:
getFromNamespace("lloyd", "kMeans")
Again, there is no need to check the correctness of the inputs because this is done in the outer function kMeansLloyd()
. The function lloyd()
processes the numeric data matrix z
, the initial centroids centers
and the number of maximal iterations maxIterations
. After initialization the algorithm runs as long as the expression !conv && (iter <= maxIterations)
returns the value TRUE
. Once the distances are computed the call apply(distanceMatrix, 1, which.min)
gives the index of the column in which the smallest distance is located. As stated in section 2.1 the index corresponds to the cluster. If the user defines a centroid matrix (i.e. starting point are not randomly drawn out of $X$) it can happen that a cluster is empty. In that case subsequent operations issue an undifferentiated error. To make the problem more traceable, a more explicit error message is implemented. After cluster allocation the updated centroids have to be computed. To avoid a for
-loop the data matrix z
is converted into a data frame and split (using split()
) into the clusters, in order to compute the updated centroids by calling sapply
. Subsequently, the convergence of the algorithm is checked (isTRUE(all.equal(centers, centroidsNew))
). If given, the variable conv
is set to TRUE
, if not given, iter
increments by one and the centroids are set to the updated centroids (centers <- centroidsNew
). In case the while
-loop is performed until the maximum number of iterations is reached, a warning message is printed.
Next, the within sum of squares withinSS
and their sum withinTot
are computed. The latter is needed for choosing the best result if the algorithm is performed with several random starts, i.e. the starting centroids are specified by an integer.
The function lloyds()
returns a list with the elements cluster
, centroids
, iterations
, withinSS
and withinTot
. These elements are also returned by the function kMeansLloyd()
, which is given as:
library(kMeans)
kMeansLloyd
The exported function kMeansLloyd
processes the four arguments x
, centroids
, maxIter
and nStart
. The argument x
denotes the data matrix and centroids
the number of clusters or the starting centroids. The arguments maxIter
determines the maximum number of iterations and nStart
the number of random sets that are drawn. In the case where centroids
is a number, several calls of kMeansLloyd()
can lead to distinct stable results due to the randomness of starting points (cp. Randomly Drawn Starting Centroids). To get a robust solution it is recommended to randomly draw several sets of starting centroids and choose the best returned solution. The term “best” refers to the total within sum of squares withinTot
calculated in the function lloyds()
. Since the cluster analysis aims to group similar data points the best solution is given by the minimal total within sum of squares.
The body of the function begins with the retrieval of the dimensions of x
and several checks of the correctness of the inputs. (Note: Great effort is put into increasing the usability of the function kMeansLloyd()
. For this purpose, a comprehensive error/warning system is implemented, albeit this may reduce the readability of the function slightly.)
The subsequent code section determines the centroids matrix $C$ that has to be passed when using the algorithm lloyd()
. There are two possibilities:
Integer case: If a number is matched to the argument centroids
, it is stored in the variable k
. Then, the centroid matrix $C$ is defined by k
random data points of the set of unique rows of $X$ (cp. indic <- unique(x)
and centroids <- indic[sample.int(numUniquePoints, k), , drop = FALSE]
). The argument drop
of the function [
is set to FALSE
in order to allow the case of $k=1$. It should be noted that the variable indic
(for an indicator) is different from NULL
when the actual argument of centroids
is of length 1.
Matrix case: If a matrix is matched to the argument centroids
, only the correctness of the inputs needs to be checked (dimensions and uniqueness). In the matrix case the variable indic
is set to NULL
.
After receiving the starting centroids $C$ the algorithm can be run, and the result is stored in res
. (cp. res <- lloyd(x, centroids, maxIter)
). As stated above, it is recommended to use multiple sets when using randomly drawn centroids. This is implemented by the if
-clause, which checks the Boolean expression(!is.null(indic) && nStart >= 2L
). Again, the variable indic
is different if and only if a number is provided for the argument centroids
. Assuming that nStart
is bigger than 1, the algorithm draws nStart
different sets used as starting centroids. The best result is maintained. For this reason the computation of withinSS
and withinTot
is part of the function lloyd()
. Both, the integer case with or without random starts and the matrix case return a list res
determined by the lloyd()
function. Based on this list, the cluster sizes are computed, and the column and row names are added to the element res$centroids
.
The return output of kMeansLloyd()
is given by the variable out
. It contains all elements of res
. Furthermore, it contains the cluster sizes and the data. The variable out
is of class list
(cp. out <- res
). In order to implement S3 object orientation, the class of the list is changed to class kMeans
. Different S3 methods can be defined for the class kMeans
. The subsequent section deals with the implemented methods.
First, a print
method is defined:
getFromNamespace("print.kMeans", "kMeans")
It produces a more readable output when (auto-)printing the result of kMeansLloyd
. Furthermore, a summary
method is implemented:
getFromNamespace("summary.kMeans", "kMeans")
It returns a list containing the elements of an object of class kMeans
, the total sum of squares totalSS
and the between sum of squares betweenSS
. The returned list is of class kMeansSummary
to allow its own print
method:
getFromNamespace("print.kMeansSummary", "kMeans")
When (auto-)printing an object kMeansSummary
all important cluster statistics can be seen at a glance. Moreover, a fitted
method is included. Its implementation is based on the existing fitted.kmeans()
method.
getFromNamespace("fitted.kMeans", "kMeans")
It can be used to calculate the residuals, as is exemplified in the subsequent section 5. The function match.arg
matches the argument against the vector c("centroids", "cluster")
. If method
is not specified it takes the first element of the vector.
Lastly, a plot
function is implemented as:
getFromNamespace("plot.kMeans", "kMeans")
While the package kMeans can be installed and run without any packages, the plot
method requires the package ggplot2
. If not installed (i.e. !requireNamespace("ggplot2", quietly = TRUE)
is TRUE
) the plot()
function issues an error. Otherwise, the dimensions of the data set are retrieved. The function returns a picturesque 2-dimensional scatterplot. If the dimension is bigger than 2 the user is required to provide the names of the two variables that shall be plotted. This is implemented by using the interactive function readline()
. After outlining the features and the most important expressions of the package kMeans
, the next section deals with an example analyzing the iris
data set.
iris
Data Set and Validation of the ResultsThis section illustrates how the package kMeans is to be used and it compares the results to the function kmeans()
with algorithm
is set to “Lloyd”
. (Note: Automated tests are implemented in the scripts testInput.R
and testOutput.R
.)
The underlying data set is given by iris
. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of three species of iris. The species are Iris setosa, versicolor, and virginica. Calling head(iris)
reveals that the last column of the data set provides the label which is nominal. Thus, the data set is restricted to the first four columns. (Note: The data set is standardized before performing the cluster analysis by using the scale()
function.)
scIris <- scale(iris[, -5])
The cluster analysis can be performed and results printed with the following calls. Moreover, a check for equal results is conducted:
# Example 1: integer case (starting centroids have to be drawn randomly) set.seed(2) # fix randomness ex1 <- kMeansLloyd(x = scIris, centroids = 3, maxIter = 8, nStart = 5) set.seed(2) ref1 <- kmeans(x = scIris, centers = 3, iter.max = 8, nstart = 5, algorithm = "Lloyd") # check equivalence, example 1 check1 <- c(all.equal(ex1$cluster, ref1$cluster), all.equal(ex1$centroids, ref1$centers), all.equal(ex1$iterations, ref1$iter), all.equal(ex1$withinSS, ref1$withinss), all.equal(ex1$withinTot, ref1$tot.withinss), all.equal(ex1$groupSizes, ref1$size)) # Example 2: matrix case # define starting centroids by hand y <- matrix(c(-1, .1, .9, .8, -.6, -.2, -1.3, .3, 1, -1.25, .15, 1), ncol = 4) ex2 <- kMeansLloyd(x = scIris, centroids = y, maxIter = 10) ref2 <- kmeans(x = scIris, centers = y, iter.max = 10, algorithm = "Lloyd") # check equivalence, example 2 check2 <- c(all.equal(ex2$cluster, ref2$cluster), all.equal(ex2$centroids, ref2$centers), all.equal(ex2$iterations, ref2$iter), all.equal(ex2$withinSS, ref2$withinss), all.equal(ex2$withinTot, ref2$tot.withinss), all.equal(ex2$groupSizes, ref2$size)) # print object of class kMeans print(ex1) # result check 1 print(all(check1)) # result check 2 print(all(check2))
Furthermore, the summary of ex1
can be computed and the result printed, and a check for equal results is provided.
sumEx1 <- summary(ex1) # check equivalence of the two added elements check3 <- c(all.equal(sumEx1$totalSS, ref1$totss), all.equal(sumEx1$betweenSS, ref1$betweenss)) print(sumEx1) print(all(check3))
The fitted
method can be applied as follows:
fitScIris <- fitted(ex1) # the result can be used to compute the residuals residScIris residScIris <- scIris - fitScIris
Lastly, the plot
method is illustrated. Since it is interactive when dealing with data sets of dimensions bigger than two, it cannot be illustrated for scIris
in this vignette (Markdown cannot handle interactivity). Therefore, a second example is performed restricting the iris data set to two dimensions.
# call plot(ex1) in the console to see the interactive functioning in the multidimensional case # reduced example (since Markdown cannot handle the interactive character): scIris2 <- scale(iris[,3:4]) ex3 <- kMeansLloyd(x = scIris2, centroids = 3, nStart = 5) plot(ex3)
After representing the functioning of the package kMeans
, the following section discusses the advantages and disadvantages of the implementation.
The advantage of the package kMeans
lies in its transparency and its implemented S3 methods. All functions are purely written in R and users without experience in a programming language other than R can comprehend and reconstruct the code. Additionally, the featured interactivity of the plot
method provides a convenient and easy way to draw a 2-dimensional scatterplot in case of multidimensional data.
Shortcomings of kMeans
are the limited number of implemented algorithms. The existing function kmeans()
provides two more algorithms. A look at the formal arguments of the existing function (via formals(kmeans)
) reveals that it can handle the three different algorithms proposed by Lloyd [-@Lloyd], Forgy [-@Forgy], Macqueen [-@MacQueen] and Hartigan and Wong [-@Hartigan]. Additionally, even though great effort has been put into avoiding runtime-increasing operations (e.g. for
-loops) in kMeans
, the runtime is slightly higher compared to the existing kmeans()
function. This can be validated even with a low dimensional data set ($n = 150,\,p = 4$) grouped into two clusters using the package microbenchmark
.
library(microbenchmark) u <- matrix(c(-1, .1, .8, -.6, -1.3, .3, -1.25, .15), ncol = 4) microbenchmark(kMeansLloyd(scIris, u), unit = "ms") # vs. microbenchmark(kmeans(scIris, u, algorithm = "Lloyd"), unit = "ms")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.