variation_information: Computes the Variation of Information distance between two...

Description Usage Arguments Details Value References Examples

Description

This function calculates Meila's (2007) Variation of Information (VI) metric between two clusterings of the same data set. VI is an information-theoretic criterion that measures the amount of information lost and gained between two clusterings.

Usage

1
variation_information(labels1, labels2)

Arguments

labels1

a vector of n clustering labels

labels2

a vector of n clustering labels

Details

If n is the number of observations in the data set, VI is bound between 0 and log(n). Furthermore, VI == 0 if and only if the two clusterings are the same.

The definition of VI, more properties, and connections to other criteria are given in the Meila (2007) paper, which has open access: http://www.sciencedirect.com/science/article/pii/S0047259X06002016

NOTE: We define 0 log 0 = 0.

Value

the VI distance between labels1 and labels2

References

Meila, M. (2007). "Comparing clusterings - an information based distance," Journal of Multivariate Analysis, 98, 5, 873-895. http://www.sciencedirect.com/science/article/pii/S0047259X06002016

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# We generate K = 3 labels for each of n = 30 observations and compute the
# Variation of Information (VI) between the two clusterings.
set.seed(42)
K <- 3
n <- 30
labels1 <- sample.int(K, n, replace=TRUE)
labels2 <- sample.int(K, n, replace=TRUE)
variation_information(labels1, labels2)

# Here, we cluster the \code{\link{iris}} data set with the K-means and
# hierarchical algorithms using the true number of clusters, K = 3.
# Then, we compute the VI between the two clusterings.
iris_kmeans <- kmeans(iris[, -5], centers = 3)$cluster
iris_hclust <- cutree(hclust(dist(iris[, -5])), k = 3)
variation_information(iris_kmeans, iris_hclust)

ramhiser/clusteval documentation built on May 26, 2019, 10:07 p.m.