Get the List of Classes From A Clustering Algorithm

Share:

Description

Unsupervised clustering algorithms, such as partitioning around medoids (pam), K-means (kmeans), or hierarchical clustering (hclust) after cutting the tree, produce a list of class assignments along with other structure. To simplify the interface for the BootstrapClusterTest and PerturbationClusterTest, we have written these routines that simply extract these cluster assignments.

Usage

1
2
3
4
5
6
cutHclust(data, k, method = "average", metric = "pearson")
cutPam(data, k)
cutKmeans(data, k)
cutRepeatedKmeans(data, k, nTimes)

repeatedKmeans(data, k, nTimes)

Arguments

data

A numerical data matrix

k

The number of classes desired from the algorithm

method

Any valid linkage method that can be passed to the hclust function

metric

Any valid distance metric that can be passed to the distanceMatrix function

nTimes

An integer; the number of times to repeat the K-means algorithm with a different random starting point

Details

Each of the clustering routines used here has a different structure for storing cluster assignments. The kmeans function stores the assignments in a ‘cluster’ attribute. The pam function uses a ‘clustering’ attribute. For hclust, the assignments are produced by a call to the cutree function.

It has been observed that the K-means algorithm can converge to different solutions depending on the starting values of the group centers. We also include a routine (repeatedKmeans) that runs the K-means algorithm repeatedly, using different randomly generated staring points each time, saving the best results.

Value

Each of the cut... functions returns a vector of integer values representing the cluster assignments found by the algorithm.

The repeatedKmeans function returns a list x with three components. The component x$kmeans is the result of the call to the kmeans function that produced the best fit to the data. The component x$centers is a matrix containing the list of group centers that were used in the best call to kmeans. The component x$withinss contains the sum of the within-group sums of squares, which is used as the measure of fitness.

Author(s)

Kevin R. Coombes krc@silicovore.com

See Also

cutree, hclust, kmeans, pam

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# simulate data from three different groups
d1 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d2 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d3 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
dd <- cbind(d1, d2, d3)

cutKmeans(dd, k=3)
cutKmeans(dd, k=4)

cutHclust(dd, k=3)
cutHclust(dd, k=4)

cutPam(dd, k=3)
cutPam(dd, k=4)

cutRepeatedKmeans(dd, k=3, nTimes=10)
cutRepeatedKmeans(dd, k=4, nTimes=10)

# cleanup
rm(d1, d2, d3, dd)