get_clustering_stats: Get clustering statistics

Description Usage Arguments Details Value Examples

View source: R/utilities.R

Description

get_clustering_stats calculates statistics of a clustering.

Usage

1
get_clustering_stats(distances, clustering)

Arguments

distances

a distances object describing the distances between the data points in clustering.

clustering

a scclust object containing a non-empty clustering.

Details

The function reports the following measures:

num_data_points total number of data points
num_assigned number of points assigned to a cluster
num_clusters number of clusters
min_cluster_size size of the smallest cluster
max_cluster_size size of the largest cluster
avg_cluster_size average cluster size
sum_dists sum of all within-cluster distances
min_dist smallest within-cluster distance
max_dist largest within-cluster distance
avg_min_dist average of the clusters' smallest distances
avg_max_dist average of the clusters' largest distances
avg_dist_weighted average of the clusters' average distances weighed by cluster size
avg_dist_unweighted average of the clusters' average distances (unweighed)

Let d(i,j) denote the distance between data points i and j. Let c be a cluster containing the indices of points assigned to the cluster. Let

D(c) = { d(i,j) : i,j in c and i > j }

be a function returning all within-cluster distances in c. Let C be a set containing all clusters.

sum_dists is defined as:

∑_[c in C] sum(D(c))

min_dist is defined as:

min_[c in C] min(D(c))

max_dist is defined as:

max_[c in C] max(D(c))

avg_min_dist is defined as:

∑_[c in C] min(D(c)) / count(C)

avg_max_dist is defined as:

∑_[c in C] max(D(c)) / count(C)

Let:

AD(c) = sum(D(c)) / count(D(c))

be the average within-cluster distance in cluster c.

avg_dist_weighted is defined as:

∑_[c in C] count(c) * AD(c) / num_assigned

where num_assigned is the number of assigned data points (see above).

avg_dist_unweighted is defined as:

∑_[c in C] AD(c) / count(C)

Value

Returns a list of class clustering_stats containing the statistics.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
my_data_points <- data.frame(x = c(0.1, 0.2, 0.3, 0.4, 0.5,
                                   0.6, 0.7, 0.8, 0.9, 1.0),
                             y = c(10, 9, 8, 7, 6,
                                   10, 9, 8, 7, 6))

my_distances <- distances(my_data_points)

my_scclust <- scclust(c("A", "A", "B", "C", "B",
                        "C", "C", "A", "B", "B"))

get_clustering_stats(my_distances, my_scclust)

# >                     Value
# > num_data_points     10.0000000
# > num_assigned        10.0000000
# > num_clusters         3.0000000
# > min_cluster_size     3.0000000
# > max_cluster_size     4.0000000
# > avg_cluster_size     3.3333333
# > sum_dists           18.2013097
# > min_dist             0.5000000
# > max_dist             3.0066593
# > avg_min_dist         0.8366584
# > avg_max_dist         2.4148611
# > avg_dist_weighted    1.5575594
# > avg_dist_unweighted  1.5847484

scclust documentation built on May 2, 2019, 4:04 p.m.