get_clustering_stats: Get clustering statistics
In scclust: Size-Constrained Clustering

View source: R/utilities.R

get_clustering_stats

R Documentation

Get clustering statistics

Description

get_clustering_stats calculates statistics of a clustering.

Usage

get_clustering_stats(distances, clustering)

Arguments

`distances`	a `distances` object describing the distances between the data points in `clustering`.
`clustering`	a `scclust` object containing a non-empty clustering.

Details

The function reports the following measures:

`num_data_points`	total number of data points
`num_assigned`	number of points assigned to a cluster
`num_clusters`	number of clusters
`min_cluster_size`	size of the smallest cluster
`max_cluster_size`	size of the largest cluster
`avg_cluster_size`	average cluster size
`sum_dists`	sum of all within-cluster distances
`min_dist`	smallest within-cluster distance
`max_dist`	largest within-cluster distance
`avg_min_dist`	average of the clusters' smallest distances
`avg_max_dist`	average of the clusters' largest distances
`avg_dist_weighted`	average of the clusters' average distances weighed by cluster size
`avg_dist_unweighted`	average of the clusters' average distances (unweighed)

Let d(i,j) denote the distance between data points i and j. Let c be a cluster containing the indices of points assigned to the cluster. Let

D(c) = \{d(i,j): i,j \in c \wedge i>j\}

be a function returning all within-cluster distances in c. Let C be a set containing all clusters.

sum_dists is defined as:

\sum_{c\in C} sum(D(c))

min_dist is defined as:

\min_{c\in C} \min(D(c))

max_dist is defined as:

\max_{c\in C} \max(D(c))

avg_min_dist is defined as:

\sum_{c\in C} \frac{\min(D(c))}{|C|}

avg_max_dist is defined as:

\sum_{c\in C} \frac{\max(D(c))}{|C|}

Let:

AD(c) = \frac{sum(D(c))}{|D(c)|}

be the average within-cluster distance in cluster c.

avg_dist_weighted is defined as:

\sum_{c\in C} \frac{|c| AD(c)}{num_assigned}

where num_assigned is the number of assigned data points (see above).

avg_dist_unweighted is defined as:

\sum_{c\in C} \frac{AD(c)}{|C|}

Value

Returns a list of class clustering_stats containing the statistics.

Examples

my_data_points <- data.frame(x = c(0.1, 0.2, 0.3, 0.4, 0.5,
                                   0.6, 0.7, 0.8, 0.9, 1.0),
                             y = c(10, 9, 8, 7, 6,
                                   10, 9, 8, 7, 6))

my_distances <- distances(my_data_points)

my_scclust <- scclust(c("A", "A", "B", "C", "B",
                        "C", "C", "A", "B", "B"))

get_clustering_stats(my_distances, my_scclust)

# >                     Value
# > num_data_points     10.0000000
# > num_assigned        10.0000000
# > num_clusters         3.0000000
# > min_cluster_size     3.0000000
# > max_cluster_size     4.0000000
# > avg_cluster_size     3.3333333
# > sum_dists           18.2013097
# > min_dist             0.5000000
# > max_dist             3.0066593
# > avg_min_dist         0.8366584
# > avg_max_dist         2.4148611
# > avg_dist_weighted    1.5575594
# > avg_dist_unweighted  1.5847484

scclust documentation built on Sept. 11, 2024, 6:38 p.m.