# similarity_index: Similarity index based on confusion matrix - External... In clv: Cluster Validation Techniques

## Description

Similarity index based on confusion matrix is the measure which estimates how those two different partitionings, that comming from one dataset, are different from each other. For given `matrix` returned by `confusion.matrix` function similarity index is found.

## Usage

 `1` ```similarity.index(cnf.mx) ```

## Arguments

 `cnf.mx` not negative, integer `matrix` or `data.frame` which represents object returned by `confusion.matrix` function.

## Details

Let M is n x m (n <= m) confusion matrix for partitionings P and P'. Any one to one function sigma: {1,2,...,n} -> {1,2,... ,m}. is called assignment (or also association). Using set of assignment functions, A(P,P') index defined as:

A(P,P') = max{ sum( forall i in 1:length(sigma) ) M[i,sigma(i)]: sigma is an assignment }

is found. (Assignment which satisfy above equation is called optimal assignment). Using this value we can compute similarity index S(P.P') = (A(P,P') - 1)/(N - 1) where N is quantity of partitioned objects (here is equal to `sum(M)`).

## Value

`similarity.index` returns value from section [0,1] which is a measure of similarity between two different partitionings. Value 1 means that we have two the same partitionings.

## Author(s)

Lukasz Nieweglowski

## References

C. D. Giurcaneanu, I. Tabus, I. Shmulevich, W. Zhang Stability-Based Cluster Analysis Applied To Microarray Data, http://citeseer.ist.psu.edu/577114.html.

T. Lange, V. Roth, M. L. Braun and J. M. Buhmann Stability-Based Validation of Clustering Solutions, ml-pub.inf.ethz.ch/publications/papers/2004/lange.neco_stab.03.pdf

`confusion.matrix` as matrix representation of two partitionings. Other functions created to compare two different partitionings: `std.ext`, `dot.product`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98``` ```# similarity.index function(and also dot.product) is used to compute # cluster stability, additional stability functions will be # defined - as its arguments some additional functions (wrappers) # will be needed # define wrappers pam.wrapp <-function(data) { return( as.integer(data\$clustering) ) } identity <- function(data) { return( as.integer(data) ) } agnes.average <- function(data, clust.num) { return( cutree( agnes(data,method="average"), clust.num ) ) } # define cluster stability function - cls.stabb # cls.stabb arguments description: # data - data to be clustered # clust.num - number of clusters to which data will be clustered # sample.num - number of pairs of data subsets to be clustered, # each clustered pair will be given as argument for # dot.product and similarity.index functions # ratio - value comming from (0,1) section: # 0 - means sample emtpy subset, # 1 - means chose all "data" objects # method - cluster method (see wrapper functions) # wrapp - function which extract information about cluster id assigned # to each clustered object # as a result mean of similarity.index (and dot.product) results, # computed for subsampled pairs of subsets is given cls.stabb <- function( data, clust.num, sample.num , ratio, method, wrapp ) { dot.pr = 0 sim.ind = 0 obj.num = dim(data)[1] for( j in 1:sample.num ) { smp1 = sort( sample( 1:obj.num, ratio*obj.num ) ) smp2 = sort( sample( 1:obj.num, ratio*obj.num ) ) d1 = data[smp1,] cls1 = wrapp( method(d1,clust.num) ) d2 = data[smp2,] cls2 = wrapp( method(d2,clust.num) ) clsm1 = t(rbind(smp1,cls1)) clsm2 = t(rbind(smp2,cls2)) m = cls.set.section(clsm1, clsm2) cls1 = as.integer(m[,2]) cls2 = as.integer(m[,3]) cnf.mx = confusion.matrix(cls1,cls2) std.ms = std.ext(cls1,cls2) # external measures - compare partitioning dt = dot.product(cls1,cls2) si = similarity.index(cnf.mx) if( !is.nan(dt) ) dot.pr = dot.pr + dt/sample.num sim.ind = sim.ind + si/sample.num } return( c(dot.pr, sim.ind) ) } # load and prepare data library(clv) data(iris) iris.data <- iris[,1:4] # fix arguments for cls.stabb function iter = c(2,3,4,5,6,7,9,12,15) smp.num = 5 sub.smp.ratio = 0.8 # cluster stability for PAM print("PAM method:") for( i in iter ) { result = cls.stabb(iris.data, clust.num=i, sample.num=smp.num, ratio=sub.smp.ratio, method=pam, wrapp=pam.wrapp) print(result) } # cluster stability for Agnes (average-link) print("Agnes (single) method:") for( i in iter ) { result = cls.stabb(iris.data, clust.num=i, sample.num=smp.num, ratio=sub.smp.ratio, method=agnes.average, wrapp=identity) print(result) } ```