purity: Purity and Entropy of a Clustering

Description Usage Arguments Details Value Methods (by generic) References See Also Examples

Description

The functions purity and entropy respectively compute the purity and the entropy of a clustering given a priori known classes.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
purity(x, y, ...)

## S4 method for signature 'table,missing'
purity(x, y)

## S4 method for signature 'factor,ANY'
purity(x, y, ...)

## S4 method for signature 'ANY,ANY'
purity(x, y, ...)

entropy(x, y, ...)

## S4 method for signature 'table,missing'
entropy(x, y, ...)

## S4 method for signature 'factor,ANY'
entropy(x, y, ...)

## S4 method for signature 'ANY,ANY'
entropy(x, y, ...)

Arguments

x

an object that can be interpreted as a factor or can generate such an object, e.g. via a suitable method predict, which gives the cluster membership for each sample.

y

a factor or an object coerced into a factor that gives the true class labels for each sample. It may be missing if x is a contingency table.

...

extra arguments to allow extension, and usually passed to the next method.

Details

The purity and entropy measure the ability of a clustering method, to recover known classes (e.g. one knows the true class labels of each sample), that are applicable even when the number of cluster is different from the number of known classes. Kim and Park (2007) used these measures to evaluate the performance of their alternate least-squares NMF algorithm.

Suppose we are given l categories, while the clustering method generates k clusters.

The purity of the clustering with respect to the known categories is given by:

Purity = \frac{1}{n} ∑_{q=1}^k \max_{1 ≤q j ≤q l} n_q^j

,

where:

The purity is therefore a real number in [0,1]. The larger the purity, the better the clustering performance.

The entropy of the clustering with respect to the known categories is given by:

- 1/(n log2(l) ) sum_q sum_j n(q,j) log2( n(q,j) / n_q )

,

where:

The smaller the entropy, the better the clustering performance.

Value

a single numeric value

the entropy (i.e. a single numeric value)

Methods (by generic)

entropy:

purity:

References

Kim H, Park H (2007). “Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis.” _Bioinformatics (Oxford, England)_, *23*(12), 1495-502. ISSN 1460-2059, doi: 10.1093/bioinformatics/btm134 (URL: https://doi.org/10.1093/bioinformatics/btm134).

See Also

Other assess: sparseness()

Other assess: sparseness()

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# generate a synthetic dataset with known classes: 50 features, 18 samples (5+5+8)
n <- 50; counts <- c(5, 5, 8);
V <- syntheticNMF(n, counts)
cl <- unlist(mapply(rep, 1:3, counts))

# perform default NMF with rank=2
x2 <- nmf(V, 2)
purity(x2, cl)
entropy(x2, cl)
# perform default NMF with rank=2
x3 <- nmf(V, 3)
purity(x3, cl)
entropy(x3, cl)

renozao/NMF documentation built on June 14, 2020, 9:35 p.m.