R/RcppExports.R

Defines functions devergottini_index bonferroni_index gini_index .gclust .genie .mst.dist .mst.default wcnn_index silhouette_w_index silhouette_index negated_wcss_index negated_davies_bouldin_index negated_ball_hall_index generalised_dunn_index dunnowa_index calinski_harabasz_index normalizing_permutation normalized_confusion_matrix adjusted_mi_score normalized_mi_score mi_score fm_score adjusted_fm_score rand_score adjusted_rand_score pair_sets_index normalized_pivoted_accuracy normalized_clustering_accuracy

Documented in adjusted_fm_score adjusted_mi_score adjusted_rand_score bonferroni_index calinski_harabasz_index devergottini_index dunnowa_index fm_score generalised_dunn_index gini_index mi_score negated_ball_hall_index negated_davies_bouldin_index negated_wcss_index normalized_clustering_accuracy normalized_confusion_matrix normalized_mi_score normalized_pivoted_accuracy normalizing_permutation pair_sets_index rand_score silhouette_index silhouette_w_index wcnn_index

# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#' @title External Cluster Validity Measures and Pairwise Partition Similarity Scores
#'
#' @description
#' The functions described in this section quantify the similarity between
#' two label vectors \code{x} and \code{y} which represent two partitions
#' of a set of \eqn{n} elements into, respectively, \eqn{K} and \eqn{L}
#' nonempty and pairwise disjoint subsets.
#'
#' For instance, \code{x} and \code{y} can represent two clusterings
#' of a dataset with \eqn{n} observations specified by two vectors
#' of labels. The functions described here can be used as external cluster
#' validity measures, where we assume that \code{x} is
#' a reference (ground-truth) partition whilst \code{y} is the vector
#' of predicted cluster memberships.
#'
#' All indices except \code{normalized_clustering_accuracy()}
#' can act as a pairwise partition similarity score: they are symmetric,
#' i.e., \code{index(x, y) == index(y, x)}.
#'
#' Each index except \code{mi_score()} (which computes the mutual
#' information score) outputs 1 given two identical partitions.
#' Note that partitions are always defined up to a permutation (bijection)
#' of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4)
#' represent the same 2-partition.
#'
#' @details
#' \code{normalized_clustering_accuracy()} (Gagolewski, 2023)
#' is an asymmetric external cluster validity measure
#' which assumes that the label vector \code{x} (or rows in the confusion
#' matrix) represents the reference (ground truth) partition.
#' It is an average proportion of correctly classified points in each cluster
#' above the worst case scenario of uniform membership assignment,
#' with cluster ID matching based on the solution to the maximal linear
#' sum assignment problem; see \code{\link{normalized_confusion_matrix}}).
#' It is given by:
#' \eqn{\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K}},
#' where \eqn{C} is a confusion matrix with \eqn{K} rows and \eqn{L} columns,
#' \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, and
#' \eqn{c_{i, \cdot}=c_{i, 1}+...+c_{i, L}} is the i-th row sum,
#' under the assumption that \eqn{c_{i,j}=0} for \eqn{i>K} or \eqn{j>L}
#' and \eqn{0/0=0}.
#'
#' \code{normalized_pivoted_accuracy()} is defined as
#' \eqn{(\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))},
#' where \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}},
#' and \eqn{n} is the sum of all elements in \eqn{C}.
#' For non-square matrices, missing rows/columns are assumed
#' to be filled with 0s.
#'
#' \code{pair_sets_index()} (PSI) was introduced in (Rezaei, Franti, 2016).
#' The simplified PSI assumes E=1 in the definition of the index,
#' i.e., uses Eq. (20) in the said paper instead of Eq. (18).
#' For non-square matrices, missing rows/columns are assumed
#' to be filled with 0s.
#'
#' \code{rand_score()} gives the Rand score (the "probability" of agreement
#' between the two partitions) and
#' \code{adjusted_rand_score()} is its version corrected for chance,
#' see (Hubert, Arabie, 1985): its expected value is 0 given two independent
#' partitions. Due to the adjustment, the resulting index may be negative
#' for some inputs.
#'
#' Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score
#' and \code{adjusted_fm_score()} is its adjusted-for-chance version;
#' see (Hubert, Arabie, 1985).
#'
#' \code{mi_score()}, \code{adjusted_mi_score()} and
#' \code{normalized_mi_score()} are information-theoretic
#' scores, based on mutual information,
#' see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}}
#' in (Vinh et al., 2010).
#'
#'
#' \code{normalized_confusion_matrix()} computes the confusion matrix
#' and permutes its rows and columns so that the sum of the elements
#' of the main diagonal is the largest possible (by solving
#' the maximal assignment problem).
#' The function only accepts \eqn{K \leq L}.
#' The reordering of the columns of a confusion matrix can be determined
#' by calling \code{normalizing_permutation()}.
#'
#' Also note that the built-in
#' \code{\link{table}()} determines the standard confusion matrix.
#'
#'
#' @references
#' Gagolewski M., \emph{A Framework for Benchmarking Clustering Algorithms},
#' 2022, \url{https://clustering-benchmarks.gagolewski.com}.
#'
#' Gagolewski M., Normalised clustering accuracy: An asymmetric external
#' cluster validity measure, 2023, under review (preprint),
#' \doi{10.48550/arXiv.2209.02935}.
#'
#' Hubert L., Arabie P., Comparing partitions,
#' \emph{Journal of Classification} 2(1), 1985, 193-218, esp. Eqs. (2) and (4).
#'
#' Meila M., Heckerman D., An experimental comparison of model-based clustering
#' methods, \emph{Machine Learning} 42, 2001, pp. 9-29,
#' \doi{10.1023/A:1007648401407}.
#'
#' Rezaei M., Franti P., Set matching measures for external cluster validity,
#' \emph{IEEE Transactions on Knowledge and Data Mining} 28(8), 2016,
#' 2173-2186.
#'
#' Steinley D., Properties of the Hubert-Arabie adjusted Rand index,
#' \emph{Psychological Methods} 9(3), 2004, pp. 386-396,
#' \doi{10.1037/1082-989X.9.3.386}.
#'
#' Vinh N.X., Epps J., Bailey J.,
#' Information theoretic measures for clusterings comparison:
#' Variants, properties, normalization and correction for chance,
#' \emph{Journal of Machine Learning Research} 11, 2010, 2837-2854.
#'
#'
#'
#' @param x an integer vector of length n (or an object coercible to)
#' representing a K-partition of an n-set (e.g., a reference partition),
#' or a confusion matrix with K rows and L columns
#' (see \code{\link{table}(x, y)})
#'
#' @param y an integer vector of length n (or an object coercible to)
#' representing an L-partition of the same set (e.g., the output of a
#' clustering algorithm we wish to compare with \code{x}),
#' or NULL (if x is an K*L confusion matrix)
#'
#' @param simplified whether to assume E=1 in the definition of the pair sets index index,
#'     i.e., use Eq. (20) in (Rezaei, Franti, 2016) instead of Eq. (18)
#'
#' @param clipped whether the result should be clipped to the unit interval, i.e., [0, 1]
#'
#'
#' @return Each cluster validity measure is a single numeric value.
#'
#' \code{normalized_confusion_matrix()} returns a numeric matrix.
#'
#' \code{normalizing_permutation()} returns a vector of indexes.
#'
#'
#' @examples
#' y_true <- iris[[5]]
#' y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster
#' normalized_clustering_accuracy(y_true, y_pred)
#' normalized_pivoted_accuracy(y_true, y_pred)
#' pair_sets_index(y_true, y_pred)
#' pair_sets_index(y_true, y_pred, simplified=TRUE)
#' adjusted_rand_score(y_true, y_pred)
#' rand_score(table(y_true, y_pred)) # the same
#' adjusted_fm_score(y_true, y_pred)
#' fm_score(y_true, y_pred)
#' mi_score(y_true, y_pred)
#' normalized_mi_score(y_true, y_pred)
#' adjusted_mi_score(y_true, y_pred)
#' normalized_confusion_matrix(y_true, y_pred)
#' normalizing_permutation(y_true, y_pred)
#'
#' @rdname compare_partitions
#' @name compare_partitions
#' @export
normalized_clustering_accuracy <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_clustering_accuracy`, x, y)
}

#' @rdname compare_partitions
#' @export
normalized_pivoted_accuracy <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_pivoted_accuracy`, x, y)
}

#' @rdname compare_partitions
#' @export
pair_sets_index <- function(x, y = NULL, simplified = FALSE, clipped = TRUE) {
    .Call(`_genieclust_pair_sets_index`, x, y, simplified, clipped)
}

#' @rdname compare_partitions
#' @export
adjusted_rand_score <- function(x, y = NULL, clipped = FALSE) {
    .Call(`_genieclust_adjusted_rand_score`, x, y, clipped)
}

#' @rdname compare_partitions
#' @export
rand_score <- function(x, y = NULL) {
    .Call(`_genieclust_rand_score`, x, y)
}

#' @rdname compare_partitions
#' @export
adjusted_fm_score <- function(x, y = NULL, clipped = FALSE) {
    .Call(`_genieclust_adjusted_fm_score`, x, y, clipped)
}

#' @rdname compare_partitions
#' @export
fm_score <- function(x, y = NULL) {
    .Call(`_genieclust_fm_score`, x, y)
}

#' @rdname compare_partitions
#' @export
mi_score <- function(x, y = NULL) {
    .Call(`_genieclust_mi_score`, x, y)
}

#' @rdname compare_partitions
#' @export
normalized_mi_score <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_mi_score`, x, y)
}

#' @rdname compare_partitions
#' @export
adjusted_mi_score <- function(x, y = NULL, clipped = FALSE) {
    .Call(`_genieclust_adjusted_mi_score`, x, y, clipped)
}

#' @rdname compare_partitions
#' @export
normalized_confusion_matrix <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_confusion_matrix`, x, y)
}

#' @rdname compare_partitions
#' @export
normalizing_permutation <- function(x, y = NULL) {
    .Call(`_genieclust_normalizing_permutation`, x, y)
}

#' @title Internal Cluster Validity Measures
#'
#' @description
#' Implementation of a number of so-called cluster validity indices critically
#' reviewed in (Gagolewski, Bartoszuk, Cena, 2021). See Section 2
#' therein and (Gagolewski, 2022) for the respective definitions.
#'
#' The greater the index value, the more \emph{valid} (whatever that means)
#' the assessed partition. For consistency, the Ball-Hall and
#' Davies-Bouldin indexes as well as the within-cluster sum of squares (WCSS)
#' take negative values.
#'
#'
#' @param X numeric matrix with \code{n} rows and \code{d} columns,
#'     representing \code{n} points in a \code{d}-dimensional space
#'
#' @param y vector of \code{n} integer labels,
#'     representing a partition whose \emph{quality} is to be
#'     assessed; \code{y[i]} is the cluster ID of the \code{i}-th point,
#'     \code{X[i, ]}; \code{1 <= y[i] <= K}, where \code{K} is the number
#'     or clusters
#'
#' @param M number of nearest neighbours
#'
#' @param lowercase_d an integer between 1 and 5, denoting
#'     \eqn{d_1}, ..., \eqn{d_5} in the definition
#'     of the generalised Dunn (Bezdek-Pal) index (numerator:
#'     min, max, and mean pairwise intracluster distance,
#'     distance between cluster centroids,
#'     weighted point-centroid distance, respectively)
#'
#' @param uppercase_d an integer between 1 and 3, denoting
#'     \eqn{D_1}, ..., \eqn{D_3} in the definition
#'     of the generalised Dunn (Bezdek-Pal) index (denominator:
#'       max and min pairwise intracluster distance, average point-centroid
#'       distance, respectively)
#'
#' @param owa_numerator,owa_denominator single string specifying
#'     the OWA operators to use in the definition of the DuNN index;
#'     one of: \code{"Mean"}, \code{"Min"}, \code{"Max"}, \code{"Const"},
#'     \code{"SMin:D"}, \code{"SMax:D"}, where \code{D} is an integer
#'     defining the degree of smoothness
#'
#'
#' @return
#' A single numeric value (the more, the \emph{better}).
#'
#' @references
#' Ball G.H., Hall D.J.,
#' \emph{ISODATA: A novel method of data analysis and pattern classification},
#' Technical report No. AD699616, Stanford Research Institute, 1965.
#'
#' Bezdek J., Pal N., Some new indexes of cluster validity,
#' \emph{IEEE Transactions on Systems, Man, and Cybernetics, Part B} 28,
#' 1998, 301-315, \doi{10.1109/3477.678624}.
#'
#' Calinski T., Harabasz J., A dendrite method for cluster analysis,
#' \emph{Communications in Statistics} 3(1), 1974, 1-27,
#' \doi{10.1080/03610927408827101}.
#'
#' Davies D.L., Bouldin D.W.,
#' A Cluster Separation Measure,
#' \emph{IEEE Transactions on Pattern Analysis and Machine Intelligence}
#' PAMI-1 (2), 1979, 224-227, \doi{10.1109/TPAMI.1979.4766909}.
#'
#' Dunn J.C., A Fuzzy Relative of the ISODATA Process and Its Use in Detecting
#' Compact Well-Separated Clusters, \emph{Journal of Cybernetics} 3(3), 1973,
#' 32-57, \doi{10.1080/01969727308546046}.
#'
#' Gagolewski M., Bartoszuk M., Cena A.,
#' Are cluster validity measures (in)valid?, \emph{Information Sciences} 581,
#' 620-636, 2021, \doi{10.1016/j.ins.2021.10.004};
#' preprint: \url{https://raw.githubusercontent.com/gagolews/bibliography/master/preprints/2021cvi.pdf}.
#'
#' Gagolewski M., \emph{A Framework for Benchmarking Clustering Algorithms},
#' 2022, \url{https://clustering-benchmarks.gagolewski.com}.
#'
#' Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and
#' Validation of Cluster Analysis, \emph{Computational and Applied Mathematics}
#' 20, 1987, 53-65, \doi{10.1016/0377-0427(87)90125-7}.
#'
#'
#'
#' @examples
#' X <- as.matrix(iris[,1:4])
#' X[,] <- jitter(X)  # otherwise we get a non-unique solution
#' y <- as.integer(iris[[5]])
#' calinski_harabasz_index(X, y)  # good
#' calinski_harabasz_index(X, sample(1:3, nrow(X), replace=TRUE))  # bad
#'
#' @name cluster_validity
#' @rdname cluster_validity
#' @export
calinski_harabasz_index <- function(X, y) {
    .Call(`_genieclust_calinski_harabasz_index`, X, y)
}

#' @rdname cluster_validity
#' @export
dunnowa_index <- function(X, y, M = 25L, owa_numerator = "SMin:5", owa_denominator = "Const") {
    .Call(`_genieclust_dunnowa_index`, X, y, M, owa_numerator, owa_denominator)
}

#' @rdname cluster_validity
#' @export
generalised_dunn_index <- function(X, y, lowercase_d, uppercase_d) {
    .Call(`_genieclust_generalised_dunn_index`, X, y, lowercase_d, uppercase_d)
}

#' @rdname cluster_validity
#' @export
negated_ball_hall_index <- function(X, y) {
    .Call(`_genieclust_negated_ball_hall_index`, X, y)
}

#' @rdname cluster_validity
#' @export
negated_davies_bouldin_index <- function(X, y) {
    .Call(`_genieclust_negated_davies_bouldin_index`, X, y)
}

#' @rdname cluster_validity
#' @export
negated_wcss_index <- function(X, y) {
    .Call(`_genieclust_negated_wcss_index`, X, y)
}

#' @rdname cluster_validity
#' @export
silhouette_index <- function(X, y) {
    .Call(`_genieclust_silhouette_index`, X, y)
}

#' @rdname cluster_validity
#' @export
silhouette_w_index <- function(X, y) {
    .Call(`_genieclust_silhouette_w_index`, X, y)
}

#' @rdname cluster_validity
#' @export
wcnn_index <- function(X, y, M = 25L) {
    .Call(`_genieclust_wcnn_index`, X, y, M)
}

.mst.default <- function(X, distance = "euclidean", M = 1L, cast_float32 = TRUE, verbose = FALSE) {
    .Call(`_genieclust_dot_mst_default`, X, distance, M, cast_float32, verbose)
}

.mst.dist <- function(d, M = 1L, verbose = FALSE) {
    .Call(`_genieclust_dot_mst_dist`, d, M, verbose)
}

.genie <- function(mst, k, gini_threshold, postprocess, detect_noise, verbose) {
    .Call(`_genieclust_dot_genie`, mst, k, gini_threshold, postprocess, detect_noise, verbose)
}

.gclust <- function(mst, gini_threshold, verbose) {
    .Call(`_genieclust_dot_gclust`, mst, gini_threshold, verbose)
}

#' @title Inequality Measures
#'
#' @description
#' \code{gini_index()} gives the normalised Gini index,
#' \code{bonferroni_index()} implements the Bonferroni index, and
#' \code{devergottini_index()} implements the De Vergottini index.
#'
#' @details
#' These indices can be used to quantify the "inequality" of a numeric sample.
#' They can be conceived as normalised measures of data dispersion.
#' For constant vectors (perfect equity), the indices yield values of 0.
#' Vectors with all elements but one equal to 0 (perfect inequality),
#' are assigned scores of 1.
#' They follow the Pigou-Dalton principle (are Schur-convex):
#' setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0}
#' and \eqn{x_i - h \geq  x_j + h} (taking from the "rich" and
#' giving to the "poor") decreases the inequality
#'
#' These indices have applications in economics, amongst others.
#' The Genie clustering algorithm uses the Gini index as a measure
#' of the inequality of cluster sizes.
#'
#'
#' The normalised Gini index is given by:
#' \deqn{
#'     G(x_1,\dots,x_n) = \frac{
#'     \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)}
#'     }{
#'     (n-1) \sum_{i=1}^n x_i
#'     },
#' }
#'
#' The normalised Bonferroni index is given by:
#' \deqn{
#'     B(x_1,\dots,x_n) = \frac{
#'     \sum_{i=1}^{n}  (n-\sum_{j=1}^i \frac{n}{n-j+1})
#'          x_{\sigma(n-i+1)}
#'     }{
#'     (n-1) \sum_{i=1}^n x_i
#'     }.
#' }
#'
#' The normalised De Vergottini index is given by:
#' \deqn{
#'     V(x_1,\dots,x_n) =
#'     \frac{1}{\sum_{i=2}^n \frac{1}{i}} \left(
#'        \frac{ \sum_{i=1}^n \left( \sum_{j=i}^{n} \frac{1}{j}\right)
#'        x_{\sigma(n-i+1)} }{\sum_{i=1}^{n} x_i} - 1
#'     \right).
#' }
#'
#' Here, \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}.
#'
#' Time complexity: \eqn{O(n)} for sorted (increasingly) data.
#' Otherwise, the vector will be sorted.
#'
#'
#' @references
#' Bonferroni C., \emph{Elementi di Statistica Generale}, Libreria Seber,
#' Firenze, 1930.
#'
#' Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and
#' outlier-resistant hierarchical clustering algorithm,
#' \emph{Information Sciences} 363, 2016, pp. 8-23.
#' \doi{10.1016/j.ins.2016.05.003}
#'
#' Gini C., \emph{Variabilita e Mutabilita},
#' Tipografia di Paolo Cuppini, Bologna, 1912.
#'
#'
#' @param x numeric vector of non-negative values
#'
#' @return The value of the inequality index, a number in \eqn{[0, 1]}.
#'
#' @examples
#' gini_index(c(2, 2, 2, 2, 2))   # no inequality
#' gini_index(c(0, 0, 10, 0, 0))  # one has it all
#' gini_index(c(7, 0, 3, 0, 0))   # give to the poor, take away from the rich
#' gini_index(c(6, 0, 3, 1, 0))   # (a.k.a. Pigou-Dalton principle)
#' bonferroni_index(c(2, 2, 2, 2, 2))
#' bonferroni_index(c(0, 0, 10, 0, 0))
#' bonferroni_index(c(7, 0, 3, 0, 0))
#' bonferroni_index(c(6, 0, 3, 1, 0))
#' devergottini_index(c(2, 2, 2, 2, 2))
#' devergottini_index(c(0, 0, 10, 0, 0))
#' devergottini_index(c(7, 0, 3, 0, 0))
#' devergottini_index(c(6, 0, 3, 1, 0))
#'
#' @name inequality
#' @rdname inequality
#' @export
gini_index <- function(x) {
    .Call(`_genieclust_gini_index`, x)
}

#' @rdname inequality
#' @export
bonferroni_index <- function(x) {
    .Call(`_genieclust_bonferroni_index`, x)
}

#' @rdname inequality
#' @export
devergottini_index <- function(x) {
    .Call(`_genieclust_devergottini_index`, x)
}

Try the genieclust package in your browser

Any scripts or data that you put into this service are public.

genieclust documentation built on Oct. 18, 2023, 5:08 p.m.