Cluster Stability  Similarity Index and Patternwise Stability Approaches
Description
cls.stab.sim.ind
and cls.stab.opt.assign
reports validation measures for clustering results. Both functions return lists of
cluster stability results computed according to similarity index and patternwise stability approaches.
Usage
1 2 3 4  cls.stab.sim.ind( data, cl.num, rep.num, subset.ratio, clust.method,
method.type, sim.ind.type, fast, ... )
cls.stab.opt.assign( data, cl.num, rep.num, subset.ratio, clust.method,
method.type, fast, ... )

Arguments
data 

cl.num 
integer 
rep.num 
integer number which tells how many pairs of data subsets will be partitioned for particular number of clusters.
The results of partitioning for given pair of subsets is used to compute similarity indices (in case of 
subset.ratio 
a number comming from (0,1) section which tells how big data subsets should be. 0 means empty subset, 1 means all data.
By default 
clust.method 
string vector with names of cluster algorithms to be used. Available are:
"agnes", "diana", "hclust", "kmeans", "pam", "clara". Combinations are also possible.
By default 
method.type 
string vector with information useful only in context of "agnes" and "hclust" algorithms . Available are:
"single", "average", "complete", "ward" and "weighted" (for more details see 
sim.ind.type 
string vector with information useful only for 
fast 
logical argument which sets the way of computing cluster stability for hierarchical algorithms. By default it is set to
TRUE, which means that each result produced by hierarchical algorithm is partitioned for the number of clusters chosen in

... 
additional parameters for clustering algorithms. Note: use with caution! Different clustering methods chosen in 
Details
Both functions realize cluster stability approaches described in Detecting stable clusters using principal component analysis (see references).
The cls.stab.sim.ind
function realizes algorithm given in chapter 3.1 where only cosine similarity index (see dot.product
)
is introduced as a similarity index between two different partitionings. This function realize this cluster stability approach also for other
similarity indices such us similarity.index
, clv.Rand
and clv.Jaccard
.
The important thing is that similarity index
(if chosen) produced by this function is not exactly the same as index produced by
similarity.index
function. The value of the similarity.index
is a number which depends on number of clusters.
Eg. if two "nclusters" partitionings are compared the value always will be a number which belong to the [1/n, 1]
section. That means the
results produced by this similarity index are not comparable for different number of clusters. That's why each result is scaled thanks to
the linear function f:[1/n, 1] > [0, 1]
where "n" is a number of clusters.
The results' layout is described in Value section.
The cls.stab.opt.assign
function realizes algorithm given in chapter 3.2 where patternwise agreement and
patternwise stability was introduced. Function returns the lowest patternwise stability value for given number of
clusters. The results' layout is described in Value section.
It often happens that clustering algorithms can't produce amount of clusters that user wants. In this situation only the warning is produced and cluster stability is computed for partitionings with unequal number of clusters.
The cluster stability will not be calculated for all cluster numbers that are bigger than the subset size.
For example if data
contains about 20 objects and the subset.ratio
equals 0.5 then the highest cluster number to
calculate is 10. In that case all elements above 10 will be removed from cl.num
vector.
Value
cls.stab.sim.ind
returns a list of lists of matrices. Each matrix consists of the set of external similarity indices (which one similarity
index see below) where number of columns is equal to cl.num
vector length and row number is equal to rep.num
value what means
that each column contain a set of similarity indices computed for fixed number of clusters.
The order of the matricides depends on three input arguments: clust.method
, method.type
, and sim.ind.type
.
Combination of clust.method
and method.type
give a names for elements listed in the first list. Each element of this list is also a
list type where each element name correspond to one of similarity index type chosen thanks to sim.ind.type
argument.
The order of the names exactly match to the order given in those arguments description. It is easy to understand after considering the
following example.
Let say we are running cls.stab.sim.ind
with default arguments then the results will be given in the following order:
$agnes.single$dot.pr
, $agnes.single$sim.ind
, $agnes.average$dot.pr
, $agnes.average$sim.ind
, $pam$dot.pr
,
$pam$sim.ind
.
cls.stab.opt.assign
returns a list of vectors. Each vector consists of the set of cluster stability indices described in
Detecting stable clusters using principal component analysis (see references). Vector length is equal to cl.num
vector length what
means that each position in vector is assigned to proper clusters' number given in cl.num
argument.
The order of the vectors depends on two input arguments: clust.method
, method.type
. The order of the names exactly match to the order
given in arguments description. It is easy to understand after considering the following example.
Let say we are running cls.stab.opt.assign
with c("pam", "kmeans", "hclust", "agnes")
as clust.method
and c("ward","average")
as method.type
then the results will be given in the following order:
$hclust.average
, $hclust.ward
, $agnes.average
, $agnes.ward
, $kmeans
, $pam
.
Author(s)
Lukasz Nieweglowski
References
A. BenHur and I. Guyon Detecting stable clusters using principal component analysis, http://citeseerx.ist.psu.edu/
C. D. Giurcaneanu, I. Tabus, I. Shmulevich, W. Zhang StabilityBased Cluster Analysis Applied To Microarray Data, http://citeseerx.ist.psu.edu/.
T. Lange, V. Roth, M. L. Braun and J. M. Buhmann StabilityBased Validation of Clustering Solutions, mlpub.inf.ethz.ch/publications/papers/2004/lange.neco_stab.03.pdf
See Also
Advanced cluster stability functions:
cls.stab.sim.ind.usr
, cls.stab.opt.assign.usr
.
Functions that compare two different partitionings:
clv.Rand
, dot.product
, similarity.index
.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18  # load and prepare data
library(clv)
data(iris)
iris.data < iris[,1:4]
# fix arguments for cls.stab.* function
iter = c(2,3,4,5,6,7,9,12,15)
smp.num = 5
ratio = 0.8
res1 = cls.stab.sim.ind( iris.data, iter, rep.num=smp.num, subset.ratio=0.7,
sim.ind.type=c("rand","dot.pr","sim.ind") )
res2 = cls.stab.opt.assign( iris.data, iter, clust.method=c("hclust","kmeans"),
method.type=c("single","average") )
print(res1)
boxplot(res1$agnes.average$sim.ind)
plot(res2$hclust.single)
