knitr::opts_chunk$set(
  collapse = TRUE,
  fig.width = 7,
  fig.height = 4,
  comment = "#>"
)

1. Introduction

This vignette introduces the package kMeans, which is an implementation of the k-means algorithm in the statistical programming language R. The algorithm can be allocated to the field of unsupervised learning since the processed data do not come with any labels. In other words, it investigates the underlying pattern of the data without having an outcome variable. The main application of the k-means algorithm is the cluster analysis, which is characterized by grouping similar data points into clusters.

Due to its popularity there already exists an implementation of the algorithm in R's stats package, namely the function kmeans(). Taking a closer look at the function's body (accessed via body(kmeans)) reveals that it makes several calls to compiled code written in C or Fortran that has been loaded into R. While this improves the function's runtime, it is not possible for a user to investigate the details of R's k-means implementation.

The package kMeans presented here, addresses this shortcoming by implementing the k-means algorithm without calling compiled code. Furthermore, integrated S3 object orientation expands the scope of usable methods and, thus, leads to more functionality compared to the existing implementation. Notably, the installation of any additional packages is not required.

The underlying methodology of the k-means algorithm is outlined in section 2 of this vignette. In section 3 the implementation of the method is discussed and in section 4 the creation of S3 methods is explained. Afterwards, an example is shown and its results are compared to the existing function kmeans() in section 5. Finally, the advantages and disadvantages of the package are discussed in section 6.


2. K-Means Algorithm

The main application of the k-means algorithm is the cluster analysis. It aims to measure the similarity of data points and group the observations into clusters. Lloyd [-@Lloyd] and Forgy [-@Forgy] proposed the same algorithm to achieve this target, Lloyd considering the discrete data distribution and Forgy considering the continuous data distribution. While the methodological findings are equally attributable to both of them, for simplicity, only Lloyd will be referred to in the following.

Formally, the algorithm groups the multidimensional data set $X_{n \times p} = {\mathbf{x}{1}, \dots, \mathbf{x}{n}}^{T}$, where each observation $\mathbf{x}{i}\in \mathbb{R}^{p}, i = 1, \dots, n$ into $K$ different clusters. Each cluster is associated with a centroid $\mathbf{c}^{(t)}_k$, where $\mathbf{c}^{(t)}_k \in \mathbb{R}^{p}$ for all $k = 1,\dots,K$. The matrix $C^{(t)}{K \times p}$ combines the centroids row-wise. Since the algorithm has an iterative nature, the upper subscript is determined by the number of iterations $t$. Assuming that the starting centroids matrix $C^{(1)}$ is given, the algorithm can be partitioned into two steps.

2.1 The Assignment Step

The degree of dissimilarity is measured by computing squared Euclidean distances between each data point and each centroid, for example $$d(\mathbf{x}{i}, \mathbf{c}^{(t)}{k}) = \sum_{l=1}^{p} (x_{il} - c^{(t)}{kl})^2$$ where $\mathbf{x}{i}$ is the $i$-th observation and $\mathbf{c}^{(t)}{k}$ is the $k$-th centroid at iteration $t$. The repeated computation results in a distance matrix $D{n \times K}^{(t)}$, where the element with the indices $(4,2)$ is the squared Euclidean distance between observation $i=4$ and the centroid $k = 2$ at iteration $t$ given by $d(\mathbf{x}{4}, \mathbf{c}{2}^{(t)})$.

Then, each observation is assigned to the cluster whose centroid has the smallest squared Euclidean distance. The cluster $k$ at iteration $t$ is described by $G^{(t)}{k}$: $$G^{(t)}{k} = {\mathbf{x}{i}|d(\mathbf{x}{i}, \mathbf{c}^{(t)}{k}) \leq d(\mathbf{x}{i}, \mathbf{c}^{(t)}_{k^{}}),\quad \forall k^{}=1,\dots,K}$$ The newly generated cluster is needed for the next update step.

2.2 The Update Step

After (re-)assigning the data points the new means of the new created clusters are computed as: $$\mathbf{c}^{(t+1)}{k} = \frac{1}{|G^{(t)}{k}|}\sum_{\mathbf{x}{i} \in G^{(t)}{k}}\mathbf{x}{i}$$ where $|G^{(t)}{k}|$ is the cardinality of cluster $k$ at iteration $t$. If $\mathbf{c}^{(t)}{k} = \mathbf{c}^{(t+1)}{k},\,\,\forall k \in K$ the algorithm has converged. At this point, another iteration would not lead to changes, neither in cluster assignments nor in updated centroids. In case of non-convergence a further iteration is run. The distance matrix $D^{(t+1)}$ is based on $C^{(t +1)}$ in this next iteration.

Because the entire data set is used together to update the centroids, the presented algorithm is sometimes called “batched” k-means algorithm.

Further Refinements

For flexible usage of the k-means algorithm the package kMeans contains two refinements, an option to randomly draw starting centroids and an iteration counter.

Randomly Drawn Starting Centroids

While the starting centroids matrix $C^{(1)}$ may be provided, it does not have to be. It is also possible to define only the number of different groups that are to be generated. If so, the starting centroids are chosen at random from the data matrix $X$. Therefore, the starting centroids correspond to data points before running through any iteration. The advantage of the random draw predominantly lies in its programming implication: It is impossible to generate empty clusters.

However, using random starting centroids does not come without any shortcoming. Different starting centroids can lead to different stable (i.e. converged) solutions. For robustness, the algorithm can be run several times, each time using different randomly drawn starting centroids. This leads to several different results of which the best one should be selected. How this selection works is discussed in section 3 (cp. kMeansLloyd()).

Iteration Counter

In case of circular data, the algorithm may need to pass many iterations for achieving convergence. Sometimes this behavior may not be desired and should be avoided. Hence, it is recommended to add an iteration counter in the algorithm. If the algorithm exceeds the number of maximum iterations it is forced to stop.


3. kMeans

The implementation of the k-means algorithm consists of three core functions, namely kMeansLloyd(), lloyd() and distances(). The functions build on each other in the order named. While the function kMeansLloyd() may repeatedly call the function lloyd(), the latter in turn calls the function distances(). To simplify a cluster analysis only the former one is exported. The function distances() for obtaining the distance matrix $D$ is given by:

getFromNamespace("distances", "kMeans")

Since the function is not exported it can be printed by using getFromNamespace("distances", "kMeans"). The correct type of the inputs is checked in the function kMeansLloyd(). It guarantees that x is a numeric data matrix $X$ and y is a matrix containing the centroids $C$. While the computation of the squared Euclidean distances may be achieved by using a loop over the number of observations and clusters, the former is avoided by using sophisticated matrix algebra (cp.rowSums(t(t(x) - y[i, ])^2)). By exploiting R's recycling behavior and by suitable pre-allocation the runtime of the computation is reduced. The return value distMatrix corresponds to the distance matrix $D$ described in section 2. In the function lloyd() the function distances() and its return value are used to assign the data points to the clusters. So apart from the distances computation the k-means algorithm is performed by the function lloyd(), which is implemented as:

getFromNamespace("lloyd", "kMeans")

Again, there is no need to check the correctness of the inputs because this is done in the outer function kMeansLloyd(). The function lloyd() processes the numeric data matrix z, the initial centroids centers and the number of maximal iterations maxIterations. After initialization the algorithm runs as long as the expression !conv && (iter <= maxIterations) returns the value TRUE. Once the distances are computed the call apply(distanceMatrix, 1, which.min) gives the index of the column in which the smallest distance is located. As stated in section 2.1 the index corresponds to the cluster. If the user defines a centroid matrix (i.e. starting point are not randomly drawn out of $X$) it can happen that a cluster is empty. In that case subsequent operations issue an undifferentiated error. To make the problem more traceable, a more explicit error message is implemented. After cluster allocation the updated centroids have to be computed. To avoid a for-loop the data matrix z is converted into a data frame and split (using split()) into the clusters, in order to compute the updated centroids by calling sapply. Subsequently, the convergence of the algorithm is checked (isTRUE(all.equal(centers, centroidsNew))). If given, the variable conv is set to TRUE, if not given, iter increments by one and the centroids are set to the updated centroids (centers <- centroidsNew). In case the while-loop is performed until the maximum number of iterations is reached, a warning message is printed.

Next, the within sum of squares withinSS and their sum withinTot are computed. The latter is needed for choosing the best result if the algorithm is performed with several random starts, i.e. the starting centroids are specified by an integer.

The function lloyds() returns a list with the elements cluster, centroids, iterations, withinSS and withinTot. These elements are also returned by the function kMeansLloyd(), which is given as:

library(kMeans)
kMeansLloyd

The exported function kMeansLloyd processes the four arguments x, centroids, maxIter and nStart. The argument x denotes the data matrix and centroids the number of clusters or the starting centroids. The arguments maxIter determines the maximum number of iterations and nStart the number of random sets that are drawn. In the case where centroids is a number, several calls of kMeansLloyd() can lead to distinct stable results due to the randomness of starting points (cp. Randomly Drawn Starting Centroids). To get a robust solution it is recommended to randomly draw several sets of starting centroids and choose the best returned solution. The term “best” refers to the total within sum of squares withinTot calculated in the function lloyds(). Since the cluster analysis aims to group similar data points the best solution is given by the minimal total within sum of squares.

The body of the function begins with the retrieval of the dimensions of x and several checks of the correctness of the inputs. (Note: Great effort is put into increasing the usability of the function kMeansLloyd(). For this purpose, a comprehensive error/warning system is implemented, albeit this may reduce the readability of the function slightly.)

The subsequent code section determines the centroids matrix $C$ that has to be passed when using the algorithm lloyd(). There are two possibilities:

  1. Integer case: If a number is matched to the argument centroids, it is stored in the variable k. Then, the centroid matrix $C$ is defined by k random data points of the set of unique rows of $X$ (cp. indic <- unique(x) and centroids <- indic[sample.int(numUniquePoints, k), , drop = FALSE]). The argument drop of the function [ is set to FALSE in order to allow the case of $k=1$. It should be noted that the variable indic (for an indicator) is different from NULL when the actual argument of centroids is of length 1.

  2. Matrix case: If a matrix is matched to the argument centroids, only the correctness of the inputs needs to be checked (dimensions and uniqueness). In the matrix case the variable indic is set to NULL.

After receiving the starting centroids $C$ the algorithm can be run, and the result is stored in res. (cp. res <- lloyd(x, centroids, maxIter)). As stated above, it is recommended to use multiple sets when using randomly drawn centroids. This is implemented by the if-clause, which checks the Boolean expression(!is.null(indic) && nStart >= 2L). Again, the variable indic is different if and only if a number is provided for the argument centroids. Assuming that nStart is bigger than 1, the algorithm draws nStart different sets used as starting centroids. The best result is maintained. For this reason the computation of withinSS and withinTot is part of the function lloyd(). Both, the integer case with or without random starts and the matrix case return a list res determined by the lloyd() function. Based on this list, the cluster sizes are computed, and the column and row names are added to the element res$centroids.

The return output of kMeansLloyd() is given by the variable out. It contains all elements of res. Furthermore, it contains the cluster sizes and the data. The variable out is of class list (cp. out <- res). In order to implement S3 object orientation, the class of the list is changed to class kMeans. Different S3 methods can be defined for the class kMeans. The subsequent section deals with the implemented methods.


4. S3 methods

First, a print method is defined:

getFromNamespace("print.kMeans", "kMeans")

It produces a more readable output when (auto-)printing the result of kMeansLloyd. Furthermore, a summary method is implemented:

getFromNamespace("summary.kMeans", "kMeans")

It returns a list containing the elements of an object of class kMeans, the total sum of squares totalSS and the between sum of squares betweenSS. The returned list is of class kMeansSummary to allow its own print method:

getFromNamespace("print.kMeansSummary", "kMeans")

When (auto-)printing an object kMeansSummary all important cluster statistics can be seen at a glance. Moreover, a fitted method is included. Its implementation is based on the existing fitted.kmeans() method.

getFromNamespace("fitted.kMeans", "kMeans")

It can be used to calculate the residuals, as is exemplified in the subsequent section 5. The function match.arg matches the argument against the vector c("centroids", "cluster"). If method is not specified it takes the first element of the vector. Lastly, a plot function is implemented as:

getFromNamespace("plot.kMeans", "kMeans")

While the package kMeans can be installed and run without any packages, the plot method requires the package ggplot2. If not installed (i.e. !requireNamespace("ggplot2", quietly = TRUE) is TRUE) the plot() function issues an error. Otherwise, the dimensions of the data set are retrieved. The function returns a picturesque 2-dimensional scatterplot. If the dimension is bigger than 2 the user is required to provide the names of the two variables that shall be plotted. This is implemented by using the interactive function readline(). After outlining the features and the most important expressions of the package kMeans, the next section deals with an example analyzing the iris data set.


5. Cluster Analysis of the iris Data Set and Validation of the Results

This section illustrates how the package kMeans is to be used and it compares the results to the function kmeans() with algorithm is set to “Lloyd”. (Note: Automated tests are implemented in the scripts testInput.R and testOutput.R.)

The underlying data set is given by iris. It contains the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of three species of iris. The species are Iris setosa, versicolor, and virginica. Calling head(iris) reveals that the last column of the data set provides the label which is nominal. Thus, the data set is restricted to the first four columns. (Note: The data set is standardized before performing the cluster analysis by using the scale() function.)

scIris <- scale(iris[, -5])

The cluster analysis can be performed and results printed with the following calls. Moreover, a check for equal results is conducted:

# Example 1: integer case (starting centroids have to be drawn randomly)
set.seed(2)  # fix randomness
ex1 <- kMeansLloyd(x = scIris, centroids = 3, maxIter = 8, nStart = 5)
set.seed(2)
ref1 <- kmeans(x = scIris, centers = 3, iter.max = 8, nstart = 5, algorithm = "Lloyd")

# check equivalence, example 1
check1 <- c(all.equal(ex1$cluster, ref1$cluster),
            all.equal(ex1$centroids, ref1$centers),
            all.equal(ex1$iterations, ref1$iter),
            all.equal(ex1$withinSS, ref1$withinss),
            all.equal(ex1$withinTot, ref1$tot.withinss),
            all.equal(ex1$groupSizes, ref1$size))

# Example 2: matrix case
# define starting centroids by hand
y <- matrix(c(-1, .1, .9, .8, -.6, -.2, -1.3, .3, 1, -1.25, .15, 1), ncol = 4)
ex2 <- kMeansLloyd(x = scIris, centroids = y, maxIter = 10)
ref2 <- kmeans(x = scIris, centers = y, iter.max = 10, algorithm = "Lloyd")

# check equivalence, example 2
check2 <- c(all.equal(ex2$cluster, ref2$cluster),
            all.equal(ex2$centroids, ref2$centers),
            all.equal(ex2$iterations, ref2$iter),
            all.equal(ex2$withinSS, ref2$withinss),
            all.equal(ex2$withinTot, ref2$tot.withinss),
            all.equal(ex2$groupSizes, ref2$size))

# print object of class kMeans
print(ex1)
# result check 1
print(all(check1))
# result check 2
print(all(check2))

Furthermore, the summary of ex1 can be computed and the result printed, and a check for equal results is provided.

sumEx1 <- summary(ex1)

# check equivalence of the two added elements
check3 <- c(all.equal(sumEx1$totalSS, ref1$totss),
            all.equal(sumEx1$betweenSS, ref1$betweenss))

print(sumEx1)
print(all(check3))

The fitted method can be applied as follows:

fitScIris <- fitted(ex1)
# the result can be used to compute the residuals residScIris
residScIris <- scIris - fitScIris

Lastly, the plot method is illustrated. Since it is interactive when dealing with data sets of dimensions bigger than two, it cannot be illustrated for scIris in this vignette (Markdown cannot handle interactivity). Therefore, a second example is performed restricting the iris data set to two dimensions.

# call plot(ex1) in the console to see the interactive functioning in the multidimensional case

# reduced example (since Markdown cannot handle the interactive character):
scIris2 <- scale(iris[,3:4])
ex3 <- kMeansLloyd(x = scIris2, centroids = 3, nStart = 5)
plot(ex3)

After representing the functioning of the package kMeans, the following section discusses the advantages and disadvantages of the implementation.


6. Conclusion

The advantage of the package kMeans lies in its transparency and its implemented S3 methods. All functions are purely written in R and users without experience in a programming language other than R can comprehend and reconstruct the code. Additionally, the featured interactivity of the plot method provides a convenient and easy way to draw a 2-dimensional scatterplot in case of multidimensional data.

Shortcomings of kMeans are the limited number of implemented algorithms. The existing function kmeans() provides two more algorithms. A look at the formal arguments of the existing function (via formals(kmeans)) reveals that it can handle the three different algorithms proposed by Lloyd [-@Lloyd], Forgy [-@Forgy], Macqueen [-@MacQueen] and Hartigan and Wong [-@Hartigan]. Additionally, even though great effort has been put into avoiding runtime-increasing operations (e.g. for-loops) in kMeans, the runtime is slightly higher compared to the existing kmeans() function. This can be validated even with a low dimensional data set ($n = 150,\,p = 4$) grouped into two clusters using the package microbenchmark.

library(microbenchmark)
u <- matrix(c(-1, .1, .8, -.6, -1.3, .3, -1.25, .15), ncol = 4)
microbenchmark(kMeansLloyd(scIris, u), unit = "ms")
# vs.
microbenchmark(kmeans(scIris, u, algorithm = "Lloyd"), unit = "ms")

References



heiligerl/kMeans_Rpackage documentation built on Aug. 16, 2020, 4:04 p.m.