Similarity index based on confusion matrix - External Measure, Cluster Stability

Share:

Description

Similarity index based on confusion matrix is the measure which estimates how those two different partitionings, that comming from one dataset, are different from each other. For given matrix returned by confusion.matrix function similarity index is found.

Usage

1

Arguments

cnf.mx

not negative, integer matrix or data.frame which represents object returned by confusion.matrix function.

Details

Let M is n x m (n <= m) confusion matrix for partitionings P and P'. Any one to one function sigma: {1,2,...,n} -> {1,2,... ,m}. is called assignment (or also association). Using set of assignment functions, A(P,P') index defined as:

A(P,P') = max{ sum( forall i in 1:length(sigma) ) M[i,sigma(i)]: sigma is an assignment }

is found. (Assignment which satisfy above equation is called optimal assignment). Using this value we can compute similarity index S(P.P') = (A(P,P') - 1)/(N - 1) where N is quantity of partitioned objects (here is equal to sum(M)).

Value

similarity.index returns value from section [0,1] which is a measure of similarity between two different partitionings. Value 1 means that we have two the same partitionings.

Author(s)

Lukasz Nieweglowski

References

C. D. Giurcaneanu, I. Tabus, I. Shmulevich, W. Zhang Stability-Based Cluster Analysis Applied To Microarray Data, http://citeseer.ist.psu.edu/577114.html.

T. Lange, V. Roth, M. L. Braun and J. M. Buhmann Stability-Based Validation of Clustering Solutions, ml-pub.inf.ethz.ch/publications/papers/2004/lange.neco_stab.03.pdf

See Also

confusion.matrix as matrix representation of two partitionings. Other functions created to compare two different partitionings: std.ext, dot.product

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# similarity.index function(and also dot.product) is used to compute 
# cluster stability, additional stability functions will be 
# defined - as its arguments some additional functions (wrappers) 
# will be needed

# define wrappers
pam.wrapp <-function(data)
{
	return( as.integer(data$clustering) )
}

identity <- function(data) { return( as.integer(data) ) }

agnes.average <- function(data, clust.num)
{
	return( cutree( agnes(data,method="average"), clust.num ) )
}

# define cluster stability function - cls.stabb

# cls.stabb arguments description:
# data - data to be clustered
# clust.num - number of clusters to which data will be clustered
# sample.num - number of pairs of data subsets to be clustered,
#              each clustered pair will be given as argument for 
#              dot.product and similarity.index functions 
# ratio - value comming from (0,1) section: 
#		  0 - means sample emtpy subset,
#		  1 - means chose all "data" objects
# method - cluster method (see wrapper functions)
# wrapp - function which extract information about cluster id assigned 
#         to each clustered object 

# as a result mean of similarity.index (and dot.product) results,
# computed for subsampled pairs of subsets is given
cls.stabb <- function( data, clust.num, sample.num , ratio, method, wrapp  )
{
	dot.pr  = 0
	sim.ind = 0
	obj.num = dim(data)[1]

	for( j in 1:sample.num )
	{
		smp1 = sort( sample( 1:obj.num, ratio*obj.num ) )
		smp2 = sort( sample( 1:obj.num, ratio*obj.num ) )

		d1 = data[smp1,]
		cls1 = wrapp( method(d1,clust.num) )

		d2 = data[smp2,]
		cls2 = wrapp( method(d2,clust.num) )

		clsm1 = t(rbind(smp1,cls1))
		clsm2 = t(rbind(smp2,cls2))

		m = cls.set.section(clsm1, clsm2)
		cls1 = as.integer(m[,2])
		cls2 = as.integer(m[,3])
		cnf.mx = confusion.matrix(cls1,cls2)
		std.ms = std.ext(cls1,cls2)
		
		# external measures - compare partitioning
		dt = dot.product(cls1,cls2)
		si = similarity.index(cnf.mx)

		if( !is.nan(dt) ) dot.pr = dot.pr + dt/sample.num 
		sim.ind = sim.ind + si/sample.num 
	}
	return( c(dot.pr, sim.ind) )
}

# load and prepare data
library(clv)
data(iris)
iris.data <- iris[,1:4]

# fix arguments for cls.stabb function
iter = c(2,3,4,5,6,7,9,12,15)
smp.num = 5
sub.smp.ratio = 0.8

# cluster stability for PAM
print("PAM method:")
for( i in iter )
{
	result = cls.stabb(iris.data, clust.num=i, sample.num=smp.num,
            ratio=sub.smp.ratio, method=pam, wrapp=pam.wrapp)
	print(result)
}

# cluster stability for Agnes (average-link)
print("Agnes (single) method:")
for( i in iter )
{
	result = cls.stabb(iris.data, clust.num=i, sample.num=smp.num,
            ratio=sub.smp.ratio, method=agnes.average, wrapp=identity)
	print(result)
}