cls.stab.sim.ind.usr | R Documentation |
cls.stab.sim.ind.usr
and cls.stab.opt.assign.usr
reports validation measures for clustering results. Both functions return lists of
cluster stability results computed for user defined cluster algorithms according to similarity index and pattern-wise stability approaches.
cls.stab.sim.ind.usr( data, cl.num, clust.alg, sim.ind.type, rep.num, subset.ratio )
cls.stab.opt.assign.usr( data, cl.num, clust.alg, rep.num, subset.ratio )
cls.alg( clust.method, clust.wrap, fast )
data |
|
cl.num |
integer |
clust.alg |
there are two possible types of input: 1. clustering function that takes two arguments: "data" to be partitioned described in 2. an object of type "cls.alg" returned by |
clust.method |
hierarchical clustering function that takes only one argument named "data" described in |
clust.wrap |
cluster function that takes exactly two arguments: "clust.res" that represents the result of |
sim.ind.type |
string vector with information useful only for |
rep.num |
integer number which tells how many pairs of data subsets will be partitioned for particular number of clusters.
The results of partitioning for given pair of subsets is used to compute similarity indices (in case of |
subset.ratio |
a number comming from (0,1) section which tells how big data subsets should be. 0 means empty subset, 1 means all data.
By default |
fast |
logical argument which sets the way of computing cluster stability for hierarchical algorithms. By default it is set to
TRUE, which means that each result produced by hierarchical algorithm is partitioned for the number of clusters chosen in
|
Both functions realize cluster stability approaches described in Detecting stable clusters using principal component analysis chapters 3.1 and 3.2 (see references).
The cls.stab.sim.ind.usr
as well as cls.stab.opt.assign.usr
do the same thing as cls.stab.sim.ind
and
cls.stab.opt.assign
functions. Main difference is that using this functions user is able to define and apply its own cluster
algorithm to measure its cluster stability. For that reason clust.alg
argument is introduced. This argument may represent partitioning
algorithm (by passing it directly as a function) or hierarchical algorithm (by passing an object of "cls.alg" type produced by cls.alg
function).
If a partitioning algorithm is going to be used the decalration of this function that represents this algorithm should always look
like this: function(data, clust.num) { ... return(integer.vector)}
.
As an output function should always return integer vector that represents single clustering result on data
.
If a hierarchical algorithm is going to be used user has to use helper cls.alg
function that produces an object of "cls.alg" type.
This object encapsulates a pair of methods that are used in hierarchical version (which is faster if the fast
argument is not FALSE)
of cluster stability approach. These methods are:
1. clust.method - which builds hierarchical structure that might be cut. The declaration of this function should always look like
this one: function(data) { ... return(hierarchical.struct) } ,
2. clust.wrap - which cuts this hierarchical structure to clust.num
clusters. This function definition should always look
like this one: function(clust.res, clust.num) { ... return(integer.vector)} . As an output function should
always return integer vector that represents single clustering result on clust.res
.
cls.alg
function has also third argument that indicates if fast computation should be taken (when TRUE
) or if these two
methods should be converted to one partitioning algorithm and to be run as a normal partitioning algorithm.
Well defined cluster functions "f" should always follow this rules (size(data) means number of object to be partitioned,
res - integer vector with cluster ids):
1. when data
is empty or cl.num
is less than 2 or more than size(data)
then f(data, cl.num)
returns error.
2. if f(data, cl.num) -> res
then length(res) == size(data),
3. if f(data, cl.num) -> res
then for all "elem" in "res" the folowing condition is true: 0 < elem <= cl.num
.
It often happens that clustering algorithms can't produce amount of clusters that user wants. In this situation only the warning is produced and cluster stability is computed for partitionings with unequal number of clusters.
The cluster stability will not be calculated for all cluster numbers that are bigger than the subset size.
For example if data
contains about 20 objects and the subset.ratio
equals 0.5 then the highest cluster number to
calculate is 10. In that case all elements above 10 will be removed from cl.num
vector.
cls.stab.sim.ind.usr
returns a lists of matrices. Each matrix consists of the set of external similarity indices (which one similarity
index see below) where number of columns is equal to cl.num
vector length and row number is equal to rep.num
value what means
that each column contain a set of similarity indices computed for fixed number of clusters. The order of the matrices depends on
sim.ind.type
argument. Each element of this list correspond to one of similarity index type chosen thanks to sim.ind.type
argument.
The order of the names exactly match to the order given in those arguments description.
cls.stab.opt.assign.usr
returns a vector. The vector consists of the set of cluster stability indices described in
Detecting stable clusters using principal component analysis chapter 3.2 (see references). Vector length is equal to cl.num
vector length what
means that each position in vector is assigned to proper clusters' number given in cl.num
argument.
Lukasz Nieweglowski
A. Ben-Hur and I. Guyon Detecting stable clusters using principal component analysis, http://citeseerx.ist.psu.edu/
C. D. Giurcaneanu, I. Tabus, I. Shmulevich, W. Zhang Stability-Based Cluster Analysis Applied To Microarray Data, http://citeseerx.ist.psu.edu/.
T. Lange, V. Roth, M. L. Braun and J. M. Buhmann Stability-Based Validation of Clustering Solutions, ml-pub.inf.ethz.ch/publications/papers/2004/lange.neco_stab.03.pdf
Other cluster stability methods:
cls.stab.sim.ind
, cls.stab.opt.assign
.
Functions that compare two different partitionings:
clv.Rand
, dot.product
,similarity.index
.
# load and prepare data
library(clv)
data(iris)
iris.data <- iris[,1:4]
# example of wrapper for partitioning algorithm
pam.clust <- function(data, clust.num) pam(data, clust.num, cluster.only=TRUE)
# example of wrapper for hierarchical algorithm
cutree.wrap <- function(clust.res, clust.num) cutree(clust.res, clust.num)
agnes.single <- function(data) agnes(data, method="single")
# converting hierarchical algorithm to partitioning one
agnes.part1 <- function(data, clust.num) cutree.wrap( agnes.single(data), clust.num )
# the same using "cls.alg"
agnes.part2 <- cls.alg(agnes.single, cutree.wrap, fast=FALSE)
# fix arguments for cls.stab.* function
iter = c(2,4,5,7,9,12,15)
res1 = cls.stab.sim.ind.usr( iris.data, iter, pam.clust,
sim.ind.type=c("rand","dot.pr","sim.ind"), rep.num=5, subset.ratio=0.7 )
res2 = cls.stab.opt.assign.usr( iris.data, iter, clust.alg=cls.alg(agnes.single, cutree.wrap) )
res3 = cls.stab.sim.ind.usr( iris.data, iter, agnes.part1,
sim.ind.type=c("rand","dot.pr","sim.ind"), rep.num=5, subset.ratio=0.7 )
res4 = cls.stab.opt.assign.usr( iris.data, iter, clust.alg=agnes.part2 )
print(res1)
boxplot(res1$sim.ind)
plot(res2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.