Nothing
#' Cluster validity indices
#'
#' Compute different cluster validity indices (CVIs) of a given cluster partition, using the
#' clustering distance measure and centroid function if applicable.
#'
#' @export
#' @exportMethod cvi
#' @importFrom flexclust comPart
#' @importFrom methods setGeneric
#'
#' @param a An object returned by [tsclust()], for crisp partitions a vector that can be coerced to
#' integers which indicate the cluster memberships, or the membership matrix for soft clustering.
#' @param b If needed, a vector that can be coerced to integers which indicate the cluster
#' memberships. The ground truth (if known) should be provided here.
#' @param type Character vector indicating which indices are to be computed. See supported values
#' below.
#' @param ... Arguments to pass to and from other methods.
#' @param log.base Base of the logarithm to be used in the calculation of VI (see details).
#'
#' @details
#'
#' Clustering is commonly considered to be an unsupervised procedure, so evaluating its performance
#' can be rather subjective. However, a great amount of effort has been invested in trying to
#' standardize cluster evaluation metrics by using cluster validity indices (CVIs).
#'
#' In general, CVIs can be either tailored to crisp or fuzzy partitions. CVIs can be classified as
#' internal, external or relative depending on how they are computed. Focusing on the first two, the
#' crucial difference is that internal CVIs only consider the partitioned data and try to define a
#' measure of cluster purity, whereas external CVIs compare the obtained partition to the correct
#' one. Thus, external CVIs can only be used if the ground truth is known.
#'
#' Note that even though a fuzzy partition can be changed into a crisp one, making it compatible
#' with many of the existing crisp CVIs, there are also fuzzy CVIs tailored specifically to fuzzy
#' clustering, and these may be more suitable in those situations. Fuzzy partitions usually have no
#' ground truth associated with them, but there are exceptions depending on the task's goal.
#'
#' Each index defines their range of values and whether they are to be minimized or maximized. In
#' many cases, these CVIs can be used to evaluate the result of a clustering algorithm regardless of
#' how the clustering works internally, or how the partition came to be.
#'
#' Knowing which CVI will work best cannot be determined a priori, so they should be tested for each
#' specific application. Usually, many CVIs are utilized and compared to each other, maybe using a
#' majority vote to decide on a final result. Furthermore, it should be noted that many CVIs perform
#' additional distance calculations when being computed, which can be very considerable if using DTW
#' or GAK.
#'
#' @return The chosen CVIs.
#'
#' @section External CVIs:
#'
#' - Crisp partitions (the first 4 are calculated via [flexclust::comPart()])
#' + `"RI"`: Rand Index (to be maximized).
#' + `"ARI"`: Adjusted Rand Index (to be maximized).
#' + `"J"`: Jaccard Index (to be maximized).
#' + `"FM"`: Fowlkes-Mallows (to be maximized).
#' + `"VI"`: Variation of Information (Meila (2003); to be minimized).
#' - Fuzzy partitions (based on Lei et al. (2017))
#' + `"RI"`: Soft Rand Index (to be maximized).
#' + `"ARI"`: Soft Adjusted Rand Index (to be maximized).
#' + `"VI"`: Soft Variation of Information (to be minimized).
#' + `"NMIM"`: Soft Normalized Mutual Information based on Max entropy (to be maximized).
#'
#' @section Internal CVIs:
#'
#' The indices marked with an exclamation mark (!) calculate (or re-use if already available) the
#' whole distance matrix between the series in the data. If you were trying to avoid this in the
#' first place, then these CVIs might not be suitable for your application.
#'
#' The indices marked with a question mark (?) depend on the extracted centroids, so bear that in
#' mind if a hierarchical procedure was used and/or the centroid function has associated
#' randomness (such as [shape_extraction()] with series of different length).
#'
#' The indices marked with a tilde (~) require the calculation of a global centroid. Since [DBA()]
#' and [shape_extraction()] (for series of different length) have some randomness associated,
#' these indices might not be appropriate for those centroids.
#'
#' - Crisp partitions
#' + `"Sil"` (!): Silhouette index (Rousseeuw (1987); to be maximized).
#' + `"D"` (!): Dunn index (Arbelaitz et al. (2013); to be maximized).
#' + `"COP"` (!): COP index (Arbelaitz et al. (2013); to be minimized).
#' + `"DB"` (?): Davies-Bouldin index (Arbelaitz et al. (2013); to be minimized).
#' + `"DBstar"` (?): Modified Davies-Bouldin index (DB*) (Kim and Ramakrishna (2005); to be
#' minimized).
#' + `"CH"` (~): Calinski-Harabasz index (Arbelaitz et al. (2013); to be maximized).
#' + `"SF"` (~): Score Function (Saitta et al. (2007); to be maximized; see notes).
#' - Fuzzy partitions (using the nomenclature from Wang and Zhang (2007))
#' + `"MPC"`: to be maximized.
#' + `"K"` (~): to be minimized.
#' + `"T"`: to be minimized.
#' + `"SC"` (~): to be maximized.
#' + `"PBMF"` (~): to be maximized (see notes).
#'
#' @section Additionally:
#'
#' - `"valid"`: Returns all valid indices depending on the type of `a` and whether `b` was
#' provided or not.
#' - `"internal"`: Returns all internal CVIs. Only supported for [TSClusters-class] objects.
#' - `"external"`: Returns all external CVIs. Requires `b` to be provided.
#'
#' @note
#'
#' In the original definition of many internal and fuzzy CVIs, the Euclidean distance and a mean
#' centroid was used. **The implementations here change this, making use of whatever
#' distance/centroid was chosen during clustering**. However, some of the CVIs assume that the
#' distances are symmetric, since cross-distance matrices are calculated and only the upper/lower
#' triangulars are considered. A warning will be given if the matrices are not symmetric and the CVI
#' assumes so.
#'
#' Because of the above, calculating CVIs for clusterings made with [TADPole()] is a special case.
#' Since TADPole uses 3 distances during its execution (DTW, LB_Keogh and Euclidean), it is not
#' obvious which one should be used for the calculation of CVIs. Nevertheless, [dtw_basic()] is used
#' by default.
#'
#' The formula for the SF index in Saitta et al. (2007) does not correspond to the one in Arbelaitz
#' et al. (2013). The one specified in the former is used here.
#'
#' The formulas for the Silhouette index are not entirely correct in Arbelaitz et al. (2013), refer
#' to Rousseeuw (1987) for the correct ones.
#'
#' The formulas for the PBMF index are not entirely unambiguous in the literature, the ones given in
#' Lin (2013) are used here.
#'
#' @references
#'
#' Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Perez, J. M., & Perona, I. (2013). An extensive
#' comparative study of cluster validity indices. Pattern Recognition, 46(1), 243-256.
#'
#' Kim, M., & Ramakrishna, R. S. (2005). New indices for cluster validity assessment. Pattern
#' Recognition Letters, 26(15), 2353-2363.
#'
#' Lei, Y., Bezdek, J. C., Chan, J., Vinh, N. X., Romano, S., & Bailey, J. (2017). Extending
#' information-theoretic validity indices for fuzzy clustering. IEEE Transactions on Fuzzy Systems,
#' 25(4), 1013-1018.
#'
#' Lin, H. Y. (2013). Effective Feature Selection for Multi-class Classification Models. In
#' Proceedings of the World Congress on Engineering (Vol. 3).
#'
#' Meila, M. (2003). Comparing clusterings by the variation of information. In Learning theory and
#' kernel machines (pp. 173-187). Springer Berlin Heidelberg.
#'
#' Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of
#' cluster analysis. Journal of computational and applied mathematics, 20, 53-65.
#'
#' Saitta, S., Raphael, B., & Smith, I. F. (2007). A bounded index for cluster validity. In
#' International Workshop on Machine Learning and Data Mining in Pattern Recognition (pp. 174-187).
#' Springer Berlin Heidelberg.
#'
#' Wang, W., & Zhang, Y. (2007). On fuzzy cluster validity indices. Fuzzy sets and systems, 158(19),
#' 2095-2117.
#'
#' @examples
#'
#' cvi(CharTrajLabels, sample(CharTrajLabels), type = c("ARI", "VI"))
#'
setGeneric("cvi", def = function(a, b = NULL, type = "valid", ..., log.base = 10) {
# Only external CVIs for default, S4-TSClusters-methods.R has the internal ones
if (is.null(b))
stop("A second set of cluster membership indices is required in 'b' for this/these CVI(s).")
a <- as.integer(a)
b <- as.integer(b)
if (length(a) != length(b))
stop("External CVIs: the length of 'a' and 'b' must match.") # nocov
type <- match.arg(type, several.ok = TRUE,
choices = c("RI", "ARI", "J", "FM", "VI", "valid", "external"))
if (any(type %in% c("valid", "external"))) type <- c("RI", "ARI", "J", "FM", "VI")
which_flexclust <- type %in% c("RI", "ARI", "J", "FM")
if (any(which_flexclust))
CVIs <- flexclust::comPart(x = a, y = b, type = type[which_flexclust])
else
CVIs <- numeric()
if (any(type == "VI")) {
# Variation of information
# taken from https://github.com/cran/mcclust/blob/master/R/vi.dist.R
# entropy
ent <- function(cl) {
p <- table(cl) / length(cl)
-sum(p * log(p, base = log.base))
}
# mutual information
mi <- function(cl1, cl2) {
p12 <- table(cl1, cl2) / length(cl1)
p1p2 <- outer(table(cl1) / length(cl1), table(cl2) / length(cl2))
sum(p12[p12 > 0] * log(p12[p12 > 0] / p1p2[p12 > 0], base = log.base))
}
VI <- ent(a) + ent(b) - 2 * mi(a, b)
CVIs <- c(CVIs, VI = VI)
}
# return
CVIs
})
# For external fuzzy CVIs
#' @rdname cvi
#' @export
#' @exportMethod cvi
#' @importFrom methods setMethod
#' @importFrom methods signature
#'
setMethod(
"cvi", signature = methods::signature(a = "matrix"),
function(a, b = NULL, type = "valid", ..., log.base = 10) {
if (is.null(b))
stop("A second set of cluster membership indices is required in 'b' for this/these CVI(s).")
dim_b <- dim(b)
b <- as.integer(b)
dim(b) <- dim_b
if (is.null(dim_b)) {
if (nrow(a) != length(b))
stop("External CVIs: 'a'-rows and 'b'-length must match.") # nocov
temp <- matrix(0L, nrow(a), max(b))
temp[cbind(1L:nrow(temp), b)] <- 1L
b <- temp
}
type <- match.arg(type, several.ok = TRUE,
choices = c("ARI", "RI", "VI", "NMIM", "valid", "external"))
if (any(type %in% c("valid", "external"))) type <- c("ARI", "RI", "VI", "NMIM")
num_objects <- nrow(a)
contingency_table <- t(a) %*% b
if (any(c("ARI", "RI") %in% type)) {
total_sum <- sum(contingency_table)
squared_sum <- sum(contingency_table ^ 2L)
row_sum <- sum(base::rowSums(contingency_table) ^ 2L)
col_sum <- sum(base::colSums(contingency_table) ^ 2L)
pairs_in_both <- (squared_sum - total_sum) / 2
pairs_in_neither <- (num_objects ^ 2L + squared_sum - row_sum - col_sum) / 2
just_in_a <- (col_sum - squared_sum) / 2
just_in_b <- (row_sum - squared_sum) / 2
}
if (any(c("VI", "NMIM") %in% type)) {
joint_distribution <- contingency_table / num_objects
joint_entropy <- -sum(joint_distribution * log(joint_distribution + .Machine$double.eps,
base = log.base))
jdx <- base::rowSums(joint_distribution)
jdy <- base::colSums(joint_distribution)
entropy_x <- -sum(jdx * log(jdx + .Machine$double.eps, base = log.base))
entropy_y <- -sum(jdy * log(jdy + .Machine$double.eps, base = log.base))
mutual_information <- entropy_x + entropy_y - joint_entropy
}
CVIs <- sapply(type, function(CVI) {
switch(EXPR = CVI,
# -------------------------------------------------------------------------------
"ARI" = {
sum_all <- pairs_in_both + pairs_in_neither + just_in_a + just_in_b
both_and_a <- pairs_in_both + just_in_a
both_and_b <- pairs_in_both + just_in_b
temp_sum <- both_and_a + both_and_b
temp_mul <- both_and_a * both_and_b
(pairs_in_both - temp_mul / sum_all) / (0.5 * temp_sum - temp_mul / sum_all)
},
# -------------------------------------------------------------------------------
"RI" = {
(pairs_in_both + pairs_in_neither) /
(pairs_in_both + pairs_in_neither + just_in_a + just_in_b)
},
# -------------------------------------------------------------------------------
"VI" = {
entropy_x + entropy_y - 2 * mutual_information
},
# -------------------------------------------------------------------------------
"NMIM" = {
mutual_information / max(entropy_x, entropy_y)
})
})
# return
CVIs
}
)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.