Cluster Stability - Similarity Index and Pattern-wise Stability Approaches with User Defined Cluster Algorithms

Description

cls.stab.sim.ind.usr and cls.stab.opt.assign.usr reports validation measures for clustering results. Both functions return lists of cluster stability results computed for user defined cluster algorithms according to similarity index and pattern-wise stability approaches.

Usage

1
2
3
cls.stab.sim.ind.usr( data, cl.num, clust.alg, sim.ind.type, rep.num, subset.ratio )
cls.stab.opt.assign.usr( data, cl.num, clust.alg, rep.num, subset.ratio )
cls.alg( clust.method, clust.wrap, fast )

Arguments

data

numeric matrix or data.frame where columns correspond to variables and rows to observations.

cl.num

integer vector with information about numbers of cluster to which data will be partitioned. If vector is not an integer type, it will be coerced with warning.

clust.alg

there are two possible types of input:

1. clustering function that takes two arguments: "data" to be partitioned described in data section and "clust.num" that represents number of cluster to which data will be partitioned. Function represents partitioning algorithm.

2. an object of type "cls.alg" returned by cls.alg function (see "Details" for explanation). Object represents hierarchical algorithm.

clust.method

hierarchical clustering function that takes only one argument named "data" described in data section. Function should return hierarchical structure that might be applied as parameter to clust.wrap function.

clust.wrap

cluster function that takes exactly two arguments: "clust.res" that represents the result of clust.method function and "clust.num" which is the number of clusters to which "clust.res" is going to be cut. Function should return integer vector that represents object id (comming from data set) to cluster id (integer between 1 and clust.num) association.

sim.ind.type

string vector with information useful only for cls.stab.sim.ind.usr function. User is able to choose which similarity indicies (external measures) to use to compare two partitionings. Available are: "dot.pr", "sim.ind", "rand", "jaccard" (for more details see similarity.index, dot.product, std.ext). Combinations are also possible. By default c("dot.pr","sim.ind") vector is applied.

rep.num

integer number which tells how many pairs of data subsets will be partitioned for particular number of clusters. The results of partitioning for given pair of subsets is used to compute similarity indices (in case of cls.stab.sim.ind.usr) or pattern-wise stability (in case of cls.stab.opt.assign.usr, for more details see references). By default rep.num value is 10. If wrong argument is applied it will be repaced with default value.

subset.ratio

a number comming from (0,1) section which tells how big data subsets should be. 0 means empty subset, 1 means all data. By default subset.ratio is set to 0.75. If wrong argument is applied it will be repaced with default value.

fast

logical argument which sets the way of computing cluster stability for hierarchical algorithms. By default it is set to TRUE, which means that each result produced by hierarchical algorithm is partitioned for the number of clusters chosen in cl.num argument and given clustering results are put for further computation. In this way computation of cluster stability is faster. If wrong argument is applied it will be repaced with default value.

Details

Both functions realize cluster stability approaches described in Detecting stable clusters using principal component analysis chapters 3.1 and 3.2 (see references).

The cls.stab.sim.ind.usr as well as cls.stab.opt.assign.usr do the same thing as cls.stab.sim.ind and cls.stab.opt.assign functions. Main difference is that using this functions user is able to define and apply its own cluster algorithm to measure its cluster stability. For that reason clust.alg argument is introduced. This argument may represent partitioning algorithm (by passing it directly as a function) or hierarchical algorithm (by passing an object of "cls.alg" type produced by cls.alg function).

If a partitioning algorithm is going to be used the decalration of this function that represents this algorithm should always look like this: function(data, clust.num) { ... return(integer.vector)} . As an output function should always return integer vector that represents single clustering result on data.

If a hierarchical algorithm is going to be used user has to use helper cls.alg function that produces an object of "cls.alg" type. This object encapsulates a pair of methods that are used in hierarchical version (which is faster if the fast argument is not FALSE) of cluster stability approach. These methods are:
1. clust.method - which builds hierarchical structure that might be cut. The declaration of this function should always look like this one: function(data) { ... return(hierarchical.struct) } ,
2. clust.wrap - which cuts this hierarchical structure to clust.num clusters. This function definition should always look like this one: function(clust.res, clust.num) { ... return(integer.vector)} . As an output function should always return integer vector that represents single clustering result on clust.res.

cls.alg function has also third argument that indicates if fast computation should be taken (when TRUE) or if these two methods should be converted to one partitioning algorithm and to be run as a normal partitioning algorithm.

Well defined cluster functions "f" should always follow this rules (size(data) means number of object to be partitioned, res - integer vector with cluster ids):
1. when data is empty or cl.num is less than 2 or more than size(data) then f(data, cl.num) returns error. 2. if f(data, cl.num) -> res then length(res) == size(data),
3. if f(data, cl.num) -> res then for all "elem" in "res" the folowing condition is true: 0 < elem <= cl.num.

It often happens that clustering algorithms can't produce amount of clusters that user wants. In this situation only the warning is produced and cluster stability is computed for partitionings with unequal number of clusters.

The cluster stability will not be calculated for all cluster numbers that are bigger than the subset size. For example if data contains about 20 objects and the subset.ratio equals 0.5 then the highest cluster number to calculate is 10. In that case all elements above 10 will be removed from cl.num vector.

Value

cls.stab.sim.ind.usr returns a lists of matrices. Each matrix consists of the set of external similarity indices (which one similarity index see below) where number of columns is equal to cl.num vector length and row number is equal to rep.num value what means that each column contain a set of similarity indices computed for fixed number of clusters. The order of the matrices depends on sim.ind.type argument. Each element of this list correspond to one of similarity index type chosen thanks to sim.ind.type argument. The order of the names exactly match to the order given in those arguments description.

cls.stab.opt.assign.usr returns a vector. The vector consists of the set of cluster stability indices described in Detecting stable clusters using principal component analysis chapter 3.2 (see references). Vector length is equal to cl.num vector length what means that each position in vector is assigned to proper clusters' number given in cl.num argument.

Author(s)

Lukasz Nieweglowski

References

A. Ben-Hur and I. Guyon Detecting stable clusters using principal component analysis, http://citeseerx.ist.psu.edu/

C. D. Giurcaneanu, I. Tabus, I. Shmulevich, W. Zhang Stability-Based Cluster Analysis Applied To Microarray Data, http://citeseerx.ist.psu.edu/.

T. Lange, V. Roth, M. L. Braun and J. M. Buhmann Stability-Based Validation of Clustering Solutions, ml-pub.inf.ethz.ch/publications/papers/2004/lange.neco_stab.03.pdf

See Also

Other cluster stability methods: cls.stab.sim.ind, cls.stab.opt.assign.

Functions that compare two different partitionings: clv.Rand, dot.product,similarity.index.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# load and prepare data
library(clv)
data(iris)
iris.data <- iris[,1:4]

# example of wrapper for partitioning algorithm 
pam.clust <- function(data, clust.num) pam(data, clust.num, cluster.only=TRUE)

# example of wrapper for hierarchical algorithm
cutree.wrap <- function(clust.res, clust.num)  cutree(clust.res, clust.num)
agnes.single <- function(data) agnes(data, method="single") 

# converting hierarchical algorithm to partitioning one
agnes.part1 <- function(data, clust.num) cutree.wrap( agnes.single(data), clust.num )
# the same using "cls.alg"
agnes.part2 <- cls.alg(agnes.single, cutree.wrap, fast=FALSE)

# fix arguments for cls.stab.* function
iter = c(2,4,5,7,9,12,15)

res1 = cls.stab.sim.ind.usr( iris.data, iter, pam.clust, 
    sim.ind.type=c("rand","dot.pr","sim.ind"), rep.num=5, subset.ratio=0.7 )
res2 = cls.stab.opt.assign.usr( iris.data, iter, clust.alg=cls.alg(agnes.single, cutree.wrap) )

res3 = cls.stab.sim.ind.usr( iris.data, iter, agnes.part1,
     sim.ind.type=c("rand","dot.pr","sim.ind"), rep.num=5, subset.ratio=0.7 )
res4 = cls.stab.opt.assign.usr( iris.data, iter, clust.alg=agnes.part2 )

print(res1)
boxplot(res1$sim.ind)
plot(res2)