R/RcppExports.R
In genieclust: Fast and Robust Hierarchical Clustering with Noise Points Detection

Documented in adjusted_fm_score adjusted_mi_score adjusted_rand_score bonferroni_index calinski_harabasz_index devergottini_index dunnowa_index fm_score generalised_dunn_index gini_index mi_score negated_ball_hall_index negated_davies_bouldin_index negated_wcss_index normalized_clustering_accuracy normalized_confusion_matrix normalized_mi_score normalized_pivoted_accuracy normalizing_permutation pair_sets_index rand_score silhouette_index silhouette_w_index wcnn_index

# Generated by using Rcpp::compileAttributes() -> do not edit by hand
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393

#' @title External Cluster Validity Measures and Pairwise Partition Similarity Scores
#'
#' @description
#' The functions described in this section quantify the similarity between
#' two label vectors \code{x} and \code{y} which represent two partitions
#' of a set of \eqn{n} elements into, respectively, \eqn{K} and \eqn{L}
#' nonempty and pairwise disjoint subsets.
#'
#' For instance, \code{x} and \code{y} can represent two clusterings
#' of a dataset with \eqn{n} observations specified by two vectors
#' of labels. The functions described here can be used as external cluster
#' validity measures, where we assume that \code{x} is
#' a reference (ground-truth) partition whilst \code{y} is the vector
#' of predicted cluster memberships.
#'
#' All indices except \code{normalized_clustering_accuracy()}
#' can act as a pairwise partition similarity score: they are symmetric,
#' i.e., \code{index(x, y) == index(y, x)}.
#'
#' Each index except \code{mi_score()} (which computes the mutual
#' information score) outputs 1 given two identical partitions.
#' Note that partitions are always defined up to a permutation (bijection)
#' of the set of possible labels, e.g., (1, 1, 2, 1) and (4, 4, 2, 4)
#' represent the same 2-partition.
#'
#' @details
#' \code{normalized_clustering_accuracy()} (Gagolewski, 2023)
#' is an asymmetric external cluster validity measure
#' which assumes that the label vector \code{x} (or rows in the confusion
#' matrix) represents the reference (ground truth) partition.
#' It is an average proportion of correctly classified points in each cluster
#' above the worst case scenario of uniform membership assignment,
#' with cluster ID matching based on the solution to the maximal linear
#' sum assignment problem; see \code{\link{normalized_confusion_matrix}}).
#' It is given by:
#' \eqn{\max_\sigma \frac{1}{K} \sum_{j=1}^K \frac{c_{\sigma(j), j}-c_{\sigma(j),\cdot}/K}{c_{\sigma(j),\cdot}-c_{\sigma(j),\cdot}/K}},
#' where \eqn{C} is a confusion matrix with \eqn{K} rows and \eqn{L} columns,
#' \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}}, and
#' \eqn{c_{i, \cdot}=c_{i, 1}+...+c_{i, L}} is the i-th row sum,
#' under the assumption that \eqn{c_{i,j}=0} for \eqn{i>K} or \eqn{j>L}
#' and \eqn{0/0=0}.
#'
#' \code{normalized_pivoted_accuracy()} is defined as
#' \eqn{(\max_\sigma \sum_{j=1}^{\max(K,L)} c_{\sigma(j),j}/n-1/\max(K,L))/(1-1/\max(K,L))},
#' where \eqn{\sigma} is a permutation of the set \eqn{\{1,\dots,\max(K,L)\}},
#' and \eqn{n} is the sum of all elements in \eqn{C}.
#' For non-square matrices, missing rows/columns are assumed
#' to be filled with 0s.
#'
#' \code{pair_sets_index()} (PSI) was introduced in (Rezaei, Franti, 2016).
#' The simplified PSI assumes E=1 in the definition of the index,
#' i.e., uses Eq. (20) in the said paper instead of Eq. (18).
#' For non-square matrices, missing rows/columns are assumed
#' to be filled with 0s.
#'
#' \code{rand_score()} gives the Rand score (the "probability" of agreement
#' between the two partitions) and
#' \code{adjusted_rand_score()} is its version corrected for chance,
#' see (Hubert, Arabie, 1985): its expected value is 0 given two independent
#' partitions. Due to the adjustment, the resulting index may be negative
#' for some inputs.
#'
#' Similarly, \code{fm_score()} gives the Fowlkes-Mallows (FM) score
#' and \code{adjusted_fm_score()} is its adjusted-for-chance version;
#' see (Hubert, Arabie, 1985).
#'
#' \code{mi_score()}, \code{adjusted_mi_score()} and
#' \code{normalized_mi_score()} are information-theoretic
#' scores, based on mutual information,
#' see the definition of \eqn{AMI_{sum}} and \eqn{NMI_{sum}}
#' in (Vinh et al., 2010).
#'
#'
#' \code{normalized_confusion_matrix()} computes the confusion matrix
#' and permutes its rows and columns so that the sum of the elements
#' of the main diagonal is the largest possible (by solving
#' the maximal assignment problem).
#' The function only accepts \eqn{K \leq L}.
#' The reordering of the columns of a confusion matrix can be determined
#' by calling \code{normalizing_permutation()}.
#'
#' Also note that the built-in
#' \code{\link{table}()} determines the standard confusion matrix.
#'
#'
#' @references
#' Gagolewski M., A framework for benchmarking clustering algorithms,
#' \emph{SoftwareX} 20, 2022, 101270,
#' \doi{10.1016/j.softx.2022.101270},
#' \url{https://clustering-benchmarks.gagolewski.com}.
#'
#' Gagolewski M., Normalised clustering accuracy: An asymmetric external
#' cluster validity measure, \emph{Journal of Classification}, 2024, in press,
#' \doi{10.1007/s00357-024-09482-2}.
#'
#' Hubert L., Arabie P., Comparing partitions,
#' \emph{Journal of Classification} 2(1), 1985, 193-218, esp. Eqs. (2) and (4).
#'
#' Meila M., Heckerman D., An experimental comparison of model-based clustering
#' methods, \emph{Machine Learning} 42, 2001, pp. 9-29,
#' \doi{10.1023/A:1007648401407}.
#'
#' Rezaei M., Franti P., Set matching measures for external cluster validity,
#' \emph{IEEE Transactions on Knowledge and Data Mining} 28(8), 2016,
#' 2173-2186.
#'
#' Steinley D., Properties of the Hubert-Arabie adjusted Rand index,
#' \emph{Psychological Methods} 9(3), 2004, pp. 386-396,
#' \doi{10.1037/1082-989X.9.3.386}.
#'
#' Vinh N.X., Epps J., Bailey J.,
#' Information theoretic measures for clusterings comparison:
#' Variants, properties, normalization and correction for chance,
#' \emph{Journal of Machine Learning Research} 11, 2010, 2837-2854.
#'
#'
#'
#' @param x an integer vector of length n (or an object coercible to)
#' representing a K-partition of an n-set (e.g., a reference partition),
#' or a confusion matrix with K rows and L columns
#' (see \code{\link{table}(x, y)})
#'
#' @param y an integer vector of length n (or an object coercible to)
#' representing an L-partition of the same set (e.g., the output of a
#' clustering algorithm we wish to compare with \code{x}),
#' or NULL (if x is an K*L confusion matrix)
#'
#' @param simplified whether to assume E=1 in the definition of the pair sets index index,
#'     i.e., use Eq. (20) in (Rezaei, Franti, 2016) instead of Eq. (18)
#'
#' @param clipped whether the result should be clipped to the unit interval, i.e., [0, 1]
#'
#'
#' @return Each cluster validity measure is a single numeric value.
#'
#' \code{normalized_confusion_matrix()} returns a numeric matrix.
#'
#' \code{normalizing_permutation()} returns a vector of indexes.
#'
#'
#' @examples
#' y_true <- iris[[5]]
#' y_pred <- kmeans(as.matrix(iris[1:4]), 3)$cluster
#' normalized_clustering_accuracy(y_true, y_pred)
#' normalized_pivoted_accuracy(y_true, y_pred)
#' pair_sets_index(y_true, y_pred)
#' pair_sets_index(y_true, y_pred, simplified=TRUE)
#' adjusted_rand_score(y_true, y_pred)
#' rand_score(table(y_true, y_pred)) # the same
#' adjusted_fm_score(y_true, y_pred)
#' fm_score(y_true, y_pred)
#' mi_score(y_true, y_pred)
#' normalized_mi_score(y_true, y_pred)
#' adjusted_mi_score(y_true, y_pred)
#' normalized_confusion_matrix(y_true, y_pred)
#' normalizing_permutation(y_true, y_pred)
#'
#' @rdname compare_partitions
#' @name compare_partitions
#' @export
normalized_clustering_accuracy <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_clustering_accuracy`, x, y)
}

#' @rdname compare_partitions
#' @export
normalized_pivoted_accuracy <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_pivoted_accuracy`, x, y)
}

#' @rdname compare_partitions
#' @export
pair_sets_index <- function(x, y = NULL, simplified = FALSE, clipped = TRUE) {
    .Call(`_genieclust_pair_sets_index`, x, y, simplified, clipped)
}

#' @rdname compare_partitions
#' @export
adjusted_rand_score <- function(x, y = NULL, clipped = FALSE) {
    .Call(`_genieclust_adjusted_rand_score`, x, y, clipped)
}

#' @rdname compare_partitions
#' @export
rand_score <- function(x, y = NULL) {
    .Call(`_genieclust_rand_score`, x, y)
}

#' @rdname compare_partitions
#' @export
adjusted_fm_score <- function(x, y = NULL, clipped = FALSE) {
    .Call(`_genieclust_adjusted_fm_score`, x, y, clipped)
}

#' @rdname compare_partitions
#' @export
fm_score <- function(x, y = NULL) {
    .Call(`_genieclust_fm_score`, x, y)
}

#' @rdname compare_partitions
#' @export
mi_score <- function(x, y = NULL) {
    .Call(`_genieclust_mi_score`, x, y)
}

#' @rdname compare_partitions
#' @export
normalized_mi_score <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_mi_score`, x, y)
}

#' @rdname compare_partitions
#' @export
adjusted_mi_score <- function(x, y = NULL, clipped = FALSE) {
    .Call(`_genieclust_adjusted_mi_score`, x, y, clipped)
}

#' @rdname compare_partitions
#' @export
normalized_confusion_matrix <- function(x, y = NULL) {
    .Call(`_genieclust_normalized_confusion_matrix`, x, y)
}

#' @rdname compare_partitions
#' @export
normalizing_permutation <- function(x, y = NULL) {
    .Call(`_genieclust_normalizing_permutation`, x, y)
}

#' @title Internal Cluster Validity Measures
#'
#' @description
#' Implementation of a number of so-called cluster validity indices critically
#' reviewed in (Gagolewski, Bartoszuk, Cena, 2021). See Section 2
#' therein and (Gagolewski, 2022) for the respective definitions.
#'
#' The greater the index value, the more \emph{valid} (whatever that means)
#' the assessed partition. For consistency, the Ball-Hall and
#' Davies-Bouldin indexes as well as the within-cluster sum of squares (WCSS)
#' take negative values.
#'
#'
#' @param X numeric matrix with \code{n} rows and \code{d} columns,
#'     representing \code{n} points in a \code{d}-dimensional space
#'
#' @param y vector of \code{n} integer labels,
#'     representing a partition whose \emph{quality} is to be
#'     assessed; \code{y[i]} is the cluster ID of the \code{i}-th point,
#'     \code{X[i, ]}; \code{1 <= y[i] <= K}, where \code{K} is the number
#'     or clusters
#'
#' @param M number of nearest neighbours
#'
#' @param lowercase_d an integer between 1 and 5, denoting
#'     \eqn{d_1}, ..., \eqn{d_5} in the definition
#'     of the generalised Dunn (Bezdek-Pal) index (numerator:
#'     min, max, and mean pairwise intracluster distance,
#'     distance between cluster centroids,
#'     weighted point-centroid distance, respectively)
#'
#' @param uppercase_d an integer between 1 and 3, denoting
#'     \eqn{D_1}, ..., \eqn{D_3} in the definition
#'     of the generalised Dunn (Bezdek-Pal) index (denominator:
#'       max and min pairwise intracluster distance, average point-centroid
#'       distance, respectively)
#'
#' @param owa_numerator,owa_denominator single string specifying
#'     the OWA operators to use in the definition of the DuNN index;
#'     one of: \code{"Mean"}, \code{"Min"}, \code{"Max"}, \code{"Const"},
#'     \code{"SMin:D"}, \code{"SMax:D"}, where \code{D} is an integer
#'     defining the degree of smoothness
#'
#'
#' @return
#' A single numeric value (the more, the \emph{better}).
#'
#' @references
#' Ball G.H., Hall D.J.,
#' \emph{ISODATA: A novel method of data analysis and pattern classification},
#' Technical report No. AD699616, Stanford Research Institute, 1965.
#'
#' Bezdek J., Pal N., Some new indexes of cluster validity,
#' \emph{IEEE Transactions on Systems, Man, and Cybernetics, Part B} 28,
#' 1998, 301-315, \doi{10.1109/3477.678624}.
#'
#' Calinski T., Harabasz J., A dendrite method for cluster analysis,
#' \emph{Communications in Statistics} 3(1), 1974, 1-27,
#' \doi{10.1080/03610927408827101}.
#'
#' Davies D.L., Bouldin D.W.,
#' A Cluster Separation Measure,
#' \emph{IEEE Transactions on Pattern Analysis and Machine Intelligence}
#' PAMI-1 (2), 1979, 224-227, \doi{10.1109/TPAMI.1979.4766909}.
#'
#' Dunn J.C., A Fuzzy Relative of the ISODATA Process and Its Use in Detecting
#' Compact Well-Separated Clusters, \emph{Journal of Cybernetics} 3(3), 1973,
#' 32-57, \doi{10.1080/01969727308546046}.
#'
#' Gagolewski M., Bartoszuk M., Cena A.,
#' Are cluster validity measures (in)valid?, \emph{Information Sciences} 581,
#' 620-636, 2021, \doi{10.1016/j.ins.2021.10.004};
#' preprint: \url{https://raw.githubusercontent.com/gagolews/bibliography/master/preprints/2021cvi.pdf}.
#'
#' Gagolewski M., A Framework for Benchmarking Clustering Algorithms,
#' \emph{SoftwareX} 20, 2022, 101270,
#' \doi{10.1016/j.softx.2022.101270},
#' \url{https://clustering-benchmarks.gagolewski.com}.
#'
#' Rousseeuw P.J., Silhouettes: A Graphical Aid to the Interpretation and
#' Validation of Cluster Analysis, \emph{Computational and Applied Mathematics}
#' 20, 1987, 53-65, \doi{10.1016/0377-0427(87)90125-7}.
#'
#'
#'
#' @examples
#' X <- as.matrix(iris[,1:4])
#' X[,] <- jitter(X)  # otherwise we get a non-unique solution
#' y <- as.integer(iris[[5]])
#' calinski_harabasz_index(X, y)  # good
#' calinski_harabasz_index(X, sample(1:3, nrow(X), replace=TRUE))  # bad
#'
#' @name cluster_validity
#' @rdname cluster_validity
#' @export
calinski_harabasz_index <- function(X, y) {
    .Call(`_genieclust_calinski_harabasz_index`, X, y)
}

#' @rdname cluster_validity
#' @export
dunnowa_index <- function(X, y, M = 25L, owa_numerator = "SMin:5", owa_denominator = "Const") {
    .Call(`_genieclust_dunnowa_index`, X, y, M, owa_numerator, owa_denominator)
}

#' @rdname cluster_validity
#' @export
generalised_dunn_index <- function(X, y, lowercase_d, uppercase_d) {
    .Call(`_genieclust_generalised_dunn_index`, X, y, lowercase_d, uppercase_d)
}

#' @rdname cluster_validity
#' @export
negated_ball_hall_index <- function(X, y) {
    .Call(`_genieclust_negated_ball_hall_index`, X, y)
}

#' @rdname cluster_validity
#' @export
negated_davies_bouldin_index <- function(X, y) {
    .Call(`_genieclust_negated_davies_bouldin_index`, X, y)
}

#' @rdname cluster_validity
#' @export
negated_wcss_index <- function(X, y) {
    .Call(`_genieclust_negated_wcss_index`, X, y)
}

#' @rdname cluster_validity
#' @export
silhouette_index <- function(X, y) {
    .Call(`_genieclust_silhouette_index`, X, y)
}

#' @rdname cluster_validity
#' @export
silhouette_w_index <- function(X, y) {
    .Call(`_genieclust_silhouette_w_index`, X, y)
}

#' @rdname cluster_validity
#' @export
wcnn_index <- function(X, y, M = 25L) {
    .Call(`_genieclust_wcnn_index`, X, y, M)
}

.mst.default <- function(X, distance = "euclidean", M = 1L, cast_float32 = TRUE, verbose = FALSE) {
    .Call(`_genieclust_dot_mst_default`, X, distance, M, cast_float32, verbose)
}

.mst.dist <- function(d, M = 1L, verbose = FALSE) {
    .Call(`_genieclust_dot_mst_dist`, d, M, verbose)
}

.genie <- function(mst, k, gini_threshold, postprocess, detect_noise, verbose) {
    .Call(`_genieclust_dot_genie`, mst, k, gini_threshold, postprocess, detect_noise, verbose)
}

.gclust <- function(mst, gini_threshold, verbose) {
    .Call(`_genieclust_dot_gclust`, mst, gini_threshold, verbose)
}

#' @title Inequality Measures
#'
#' @description
#' \code{gini_index()} gives the normalised Gini index,
#' \code{bonferroni_index()} implements the Bonferroni index, and
#' \code{devergottini_index()} implements the De Vergottini index.
#'
#' @details
#' These indices can be used to quantify the "inequality" of a numeric sample.
#' They can be conceived as normalised measures of data dispersion.
#' For constant vectors (perfect equity), the indices yield values of 0.
#' Vectors with all elements but one equal to 0 (perfect inequality),
#' are assigned scores of 1.
#' They follow the Pigou-Dalton principle (are Schur-convex):
#' setting \eqn{x_i = x_i - h} and \eqn{x_j = x_j + h} with \eqn{h > 0}
#' and \eqn{x_i - h \geq  x_j + h} (taking from the "rich" and
#' giving to the "poor") decreases the inequality
#'
#' These indices have applications in economics, amongst others.
#' The Genie clustering algorithm uses the Gini index as a measure
#' of the inequality of cluster sizes.
#'
#'
#' The normalised Gini index is given by:
#' \deqn{
#'     G(x_1,\dots,x_n) = \frac{
#'     \sum_{i=1}^{n} (n-2i+1) x_{\sigma(n-i+1)}
#'     }{
#'     (n-1) \sum_{i=1}^n x_i
#'     },
#' }
#'
#' The normalised Bonferroni index is given by:
#' \deqn{
#'     B(x_1,\dots,x_n) = \frac{
#'     \sum_{i=1}^{n}  (n-\sum_{j=1}^i \frac{n}{n-j+1})
#'          x_{\sigma(n-i+1)}
#'     }{
#'     (n-1) \sum_{i=1}^n x_i
#'     }.
#' }
#'
#' The normalised De Vergottini index is given by:
#' \deqn{
#'     V(x_1,\dots,x_n) =
#'     \frac{1}{\sum_{i=2}^n \frac{1}{i}} \left(
#'        \frac{ \sum_{i=1}^n \left( \sum_{j=i}^{n} \frac{1}{j}\right)
#'        x_{\sigma(n-i+1)} }{\sum_{i=1}^{n} x_i} - 1
#'     \right).
#' }
#'
#' Here, \eqn{\sigma} is an ordering permutation of \eqn{(x_1,\dots,x_n)}.
#'
#' Time complexity: \eqn{O(n)} for sorted (increasingly) data.
#' Otherwise, the vector will be sorted.
#'
#'
#' @references
#' Bonferroni C., \emph{Elementi di Statistica Generale}, Libreria Seber,
#' Firenze, 1930.
#'
#' Gagolewski M., Bartoszuk M., Cena A., Genie: A new, fast, and
#' outlier-resistant hierarchical clustering algorithm,
#' \emph{Information Sciences} 363, 2016, pp. 8-23.
#' \doi{10.1016/j.ins.2016.05.003}
#'
#' Gini C., \emph{Variabilita e Mutabilita},
#' Tipografia di Paolo Cuppini, Bologna, 1912.
#'
#'
#' @param x numeric vector of non-negative values
#'
#' @return The value of the inequality index, a number in \eqn{[0, 1]}.
#'
#' @examples
#' gini_index(c(2, 2, 2, 2, 2))   # no inequality
#' gini_index(c(0, 0, 10, 0, 0))  # one has it all
#' gini_index(c(7, 0, 3, 0, 0))   # give to the poor, take away from the rich
#' gini_index(c(6, 0, 3, 1, 0))   # (a.k.a. Pigou-Dalton principle)
#' bonferroni_index(c(2, 2, 2, 2, 2))
#' bonferroni_index(c(0, 0, 10, 0, 0))
#' bonferroni_index(c(7, 0, 3, 0, 0))
#' bonferroni_index(c(6, 0, 3, 1, 0))
#' devergottini_index(c(2, 2, 2, 2, 2))
#' devergottini_index(c(0, 0, 10, 0, 0))
#' devergottini_index(c(7, 0, 3, 0, 0))
#' devergottini_index(c(6, 0, 3, 1, 0))
#'
#' @name inequality
#' @rdname inequality
#' @export
gini_index <- function(x) {
    .Call(`_genieclust_gini_index`, x)
}

#' @rdname inequality
#' @export
bonferroni_index <- function(x) {
    .Call(`_genieclust_bonferroni_index`, x)
}

#' @rdname inequality
#' @export
devergottini_index <- function(x) {
    .Call(`_genieclust_devergottini_index`, x)
}
Any scripts or data that you put into this service are public.
genieclust documentation built on Sept. 11, 2024, 8:01 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
genieclust
Fast and Robust Hierarchical Clustering with Noise Points Detection

R/RcppExports.R
In genieclust: Fast and Robust Hierarchical Clustering with Noise Points Detection

Try the genieclust package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

genieclust Fast and Robust Hierarchical Clustering with Noise Points Detection

R/RcppExports.R In genieclust: Fast and Robust Hierarchical Clustering with Noise Points Detection

Try the genieclust package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

genieclust
Fast and Robust Hierarchical Clustering with Noise Points Detection

R/RcppExports.R
In genieclust: Fast and Robust Hierarchical Clustering with Noise Points Detection