purity: Clustering Purity

Description Usage Arguments Details Value Author(s) References Examples

View source: R/purity.R

Description

Calculate purity of the clustering results. For example, see \insertCiteSchaeffer_etal_2016_trust;textualfuntimes.

Usage

1
purity(classes, clusters)

Arguments

classes

a vector with labels of true classes.

clusters

a vector with labels of assigned clusters for which purity is to be tested. Should be of the same length as classes.

Details

Following \insertCiteManning_etal_2008;textualfuntimes, each cluster is assigned to the class which is most frequent in the cluster, then

Purity(Ω,C) = \frac{1}{N}∑_{k}\max_{j}|ω_k\cap c_j|,

where Ω=\{ω_1,…,ω_K \} is the set of identified clusters and C=\{c_1,…,c_J\} is the set of classes. That is, within each class j=1,…,J find the size of the most populous cluster from the K-j unassigned clusters. Then, sum together the \min(K,J) sizes found and divide by N, where N = length(classes) = length(clusters).

If \max_{j}|ω_k\cap c_j| is not unique for some j, it is assigned to the class which second maximum is the smallest, to maximize the Purity (see ‘Examples’).

Number of unique elements in classes and clusters may differ.

Value

A list with two elements:

pur

purity value.

out

table with \min(K,J) = min(length(unique(classes)), length(unique(clusters))) rows and the following columns: ClassLabels, ClusterLabels, and ClusterSize.

Author(s)

Vyacheslav Lyubchich

References

\insertAllCited

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Fix seed for reproducible simulations:
# RNGkind(sample.kind = "Rounding") #run this line to have same seed across R versions > R 3.6.0
set.seed(1)

##### Example 1
#Create some classes and cluster labels:
classes <- rep(LETTERS[1:3], each = 5)
clusters <- sample(letters[1:5], length(classes), replace = TRUE)

#From the table below:
# - cluster 'b' corresponds to class A;
# - either of the clusters 'd' and 'e' can correspond to class B,
#   however, 'e' should be chosen, because cluster 'd' also highly 
#   intersects with Class C. Thus,
# - cluster 'd' corresponds to class C.
table(classes, clusters)
##       clusters
##classes a b c d e
##      A 0 3 1 0 1
##      B 1 0 0 2 2
##      C 1 2 0 2 0

#The function does this choice automatically:
purity(classes, clusters)

#Sample output:
##$pur
##[1] 0.4666667
##
##$out
##  ClassLabels ClusterLabels ClusterSize
##1           A             b           3
##2           B             e           2
##3           C             d           2


##### Example 2
#The labels can be also numeric:
classes <- rep(1:5, each = 3)
clusters <- sample(1:3, length(classes), replace = TRUE)
purity(classes, clusters)

funtimes documentation built on Nov. 28, 2020, 1:06 a.m.