Nothing
#' Compare cluster memberships among multiple bioregionalizations
#'
#' This function computes pairwise comparisons for several
#' bioregionalizations, usually outputs from `netclu_`, `hclu_`, or `nhclu_`
#' functions. It also provides the confusion matrix from pairwise comparisons,
#' enabling the user to compute additional comparison metrics.
#'
#' @param bioregionalizations A `data.frame` object where each row corresponds to
#' a site, and each column to a bioregionalization.
#'
#' @param indices `NULL` or `character`. Indices to compute for the pairwise
#' comparison of bioregionalizations. Currently available metrics are `"rand"`
#' and `"jaccard"`.
#'
#' @param cor_frequency A `boolean`. If `TRUE`, computes the correlation
#' between each bioregionalization and the total frequency of co-membership of
#' items across all bioregionalizations. This is useful for identifying which
#' bioregionalization(s) is(are) most representative of all computed
#' bioregionalizations.
#'
#' @param store_pairwise_membership A `boolean`. If `TRUE`, stores the pairwise
#' membership of items in the output object.
#'
#' @param store_confusion_matrix A `boolean`. If `TRUE`, stores the confusion
#' matrices of pairwise bioregionalization comparisons in the output object.
#'
#' @details
#' This function operates in two main steps:
#'
#' 1. Within each bioregionalization, the function compares all pairs of items
#' and documents whether they are clustered together (`TRUE`) or separately
#' (`FALSE`). For example, if site 1 and site 2 are clustered in the same
#' cluster in bioregionalization 1, their pairwise membership `site1_site2`
#' will be `TRUE`. This output is stored in the `pairwise_membership` slot if
#' `store_pairwise_membership = TRUE`.
#'
#' 2. Across all bioregionalizations, the function compares their pairwise
#' memberships to determine similarity. For each pair of bioregionalizations,
#' it computes a confusion matrix with the following elements:
#' * `a`: Number of item pairs grouped in both bioregionalizations.
#' * `b`: Number of item pairs grouped in the first but not in the second
#' bioregionalization.
#' * `c`: Number of item pairs grouped in the second but not in the first
#' bioregionalization.
#' * `d`: Number of item pairs not grouped in either bioregionalization.
#'
#' The confusion matrix is stored in `confusion_matrix` if
#' `store_confusion_matrix = TRUE`.
#'
#' Based on these confusion matrices, various indices can be computed to
#' measure agreement among bioregionalizations. The currently implemented
#' indices are:
#'
#' * **Rand index**: `(a + d) / (a + b + c + d)`
#' Measures agreement by considering both grouped and ungrouped item pairs.
#'
#' * **Jaccard index**: `a / (a + b + c)`
#' Measures agreement based only on grouped item pairs.
#'
#' These indices are complementary: the Jaccard index evaluates clustering
#' similarity, while the Rand index considers both clustering and separation.
#' For example, if two bioregionalizations never group the same pairs, their
#' Jaccard index will be 0, but their Rand index may be > 0 due to ungrouped
#' pairs.
#'
#' Users can compute additional indices manually using the list of confusion
#' matrices.
#'
#' To identify which bioregionalization is most representative of the others,
#' the function can compute the correlation between the pairwise membership of
#' each bioregionalization and the total frequency of pairwise membership across
#' all bioregionalizations. This is enabled by setting `cor_frequency = TRUE`.
#'
#' @return
#' A `list` containing 4 to 7 elements:
#'
#' \enumerate{
#' \item{**args**: A `list` of user-provided arguments.}
#' \item{**inputs**: A `list` containing information on the input
#' bioregionalizations, such as the number of items clustered.}
#' \item{**pairwise_membership** (optional): If `store_pairwise_membership = TRUE`,
#' a `boolean matrix` where `TRUE` indicates two items are in the same cluster,
#' and `FALSE` indicates they are not.}
#' \item{**freq_item_pw_membership**: A `numeric vector` containing the number of
#' times each item pair is clustered together, corresponding to the sum of rows
#' in `pairwise_membership`.}
#' \item{**bioregionalization_freq_cor** (optional): If `cor_frequency = TRUE`,
#' a `numeric vector` of correlations between individual bioregionalizations
#' and the total frequency of pairwise membership.}
#' \item{**confusion_matrix** (optional): If `store_confusion_matrix = TRUE`,
#' a `list` of confusion matrices for each pair of bioregionalizations.}
#' \item{**bioregionalization_comparison**: A `data.frame` containing comparison
#' results, where the first column indicates the bioregionalizations compared,
#' and the remaining columns contain the requested `indices`.}}
#'
#' @seealso
#' For more details illustrated with a practical example,
#' see the vignette:
#' \url{https://biorgeo.github.io/bioregion/articles/a5_2_compare_bioregionalizations.html}.
#'
#' Associated functions:
#' [bioregionalization_metrics]
#'
#' @author
#' Boris Leroy (\email{leroy.boris@gmail.com}) \cr
#' Maxime Lenormand (\email{maxime.lenormand@inrae.fr}) \cr
#' Pierre Denelle (\email{pierre.denelle@gmail.com})
#'
#' @examples
#' # We here compare three different bioregionalizations
#' comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
#' 20, 25)
#' rownames(comat) <- paste0("Site",1:20)
#' colnames(comat) <- paste0("Species",1:25)
#'
#' dissim <- dissimilarity(comat, metric = "Simpson")
#' bioregion1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson")
#'
#' net <- similarity(comat, metric = "Simpson")
#' bioregion2 <- netclu_greedy(net)
#' bioregion3 <- netclu_walktrap(net)
#'
#' # Make one single data.frame with the bioregionalizations to compare
#' compare_df <- merge(bioregion1$clusters, bioregion2$clusters, by = "ID")
#' compare_df <- merge(compare_df, bioregion3$clusters, by = "ID")
#' colnames(compare_df) <- c("Site", "Hclu", "Greedy", "Walktrap")
#' rownames(compare_df) <- compare_df$Site
#' compare_df <- compare_df[, c("Hclu", "Greedy", "Walktrap")]
#'
#' # Running the function
#' compare_bioregionalizations(compare_df)
#'
#' # Find out which bioregionalizations are most representative
#' compare_bioregionalizations(compare_df,
#' cor_frequency = TRUE)
#'
#' @export
compare_bioregionalizations <- function(bioregionalizations,
indices = c("rand", "jaccard"),
cor_frequency = FALSE,
store_pairwise_membership = TRUE,
store_confusion_matrix = TRUE){
# input can be of format bioregion.clusters
if (inherits(bioregionalizations, "bioregion.clusters")) {
if (inherits(bioregionalizations$clusters, "data.frame")) {
has.clusters <- TRUE
clusters <- bioregionalizations$clusters
} else {
if (bioregionalizations$name == "hclu_hierarclust") {
stop(paste0("No clusters have been generated for your hierarchical ",
"tree, please extract clusters from the tree before using ",
"bioregionalization_metrics().\n",
"See ?hclu_hierarclust or ?cut_tree"),
call. = FALSE)
} else {
stop(paste0("bioregionalizations does not have the expected type of ",
"'clusters' slot"),
call. = FALSE)
}
}
# or data.frame (next row)
} else if (inherits(bioregionalizations, "data.frame")) {
controls(data = bioregionalizations, type = "input_data_frame")
clusters <- bioregionalizations
# or list of data.frames
} else if (inherits(bioregionalizations, "list")) {
# test that all elements of the list are data.frames
test <- lapply(bioregionalizations, FUN = function(x) {
try(controls(data = x, type = "input_data_frame"), silent = TRUE)})
if (any(vapply(test, function(x) inherits(x, "try-error"),
FUN.VALUE = logical(1)))) {
stop(paste0("All elements in bioregionalizations should be ",
"of data.frame format"),
call. = FALSE)
# ensure that the number of rows are identical among df of the list
} else if (length(unique(vapply(bioregionalizations,
nrow,
FUN.VALUE = numeric(1)))) == 1){
clusters <- data.frame(bioregionalizations)
}
# if none of the above then stop
} else {
stop(paste0("This function is designed to work either on ",
"bioregion.clusters objects, on a data.frame or on a list of ",
"data.frames"),
call. = FALSE)
}
if(ncol(clusters) == 1) {
stop(paste0("This function is designed to be applied on multiple ",
"bioregionalizations."),
call. = FALSE)
}
if(!is.null(indices)) {
controls(args = indices, type = "character_vector")
if(!isTRUE(unique(indices %in% c("rand", "jaccard")))) {
stop(paste0("Please choose indices from the following:\n",
"rand or jaccard."),
call. = FALSE)
}
}
controls(args = cor_frequency, type = "boolean")
controls(args = store_pairwise_membership, type = "boolean")
controls(args = store_confusion_matrix, type = "boolean")
message(Sys.time(),
" - Computing pairwise membership comparisons for each ",
"bioregionalization...\n")
# if(!is.null(sample_items)) {
# if(ncol(clusters) * (nrow(clusters) * (nrow(clusters) - 1)) / 2 > 10e6) {
# message(" /!\\\ NOTE: Very high number of comparisons ",
# "(", nrow(clusters), " rows in clusters --> ",
# (nrow(clusters) * (nrow(clusters) - 1)) / 2,
# " pairwise comparisons per bioregionalization --> ",
# ncol(clusters) * (nrow(clusters) * (nrow(clusters) - 1)) / 2,
# " comparisons in total across all bioregionalizations).\n ",
# "If the computation ",
# "time is too long, consider using argument sample_items and",
# " set it to a reasonable number (e.g. 5000).")
# }
# }
#
# WITHIN bioregionalization - Pairwise membership comparison ----------------
# For each bioregionalization, compare all pairs of items to detect if they
# are in the same cluster or not
item_pw_mb <- get_pairwise_membership(
clusters)
# BETWEEN bioregionalizationS - Pairwise bioregionalization comparison ------
message(Sys.time(),
" - Comparing memberships among bioregionalizations...\n")
# Prepare the pairwise bioregionalization comparisons
partnames <- pw_cluster_comps <- t(utils::combn(seq_len(ncol(clusters)), 2))
# Compute confusion matrices for all pairs of bioregionalizations
all_conf_matrices <- lapply(seq_len(nrow(pw_cluster_comps)), function(x) {
get_confusion_matrix(item_pw_mb[, pw_cluster_comps[x, 1]],
item_pw_mb[, pw_cluster_comps[x, 2]])
})
# partnames gets the names of each bioregionalization
partnames[] <- colnames(clusters)[pw_cluster_comps]
names(all_conf_matrices) <- apply(partnames, 1, paste, collapse = "%")
# Item membership frequency -------------------------------------------------
item_pw_mb_freq <- rowSums(item_pw_mb)
# Prepare data.frame comparing bioregionalizations --------------------------
partcomp_indices <- data.frame(
bioregionalization_comparison = names(all_conf_matrices))
# Rand index ----------------------------------------------------------------
if("rand" %in% indices) {
message(Sys.time(),
" - Computing Rand index...\n")
partcomp_indices$rand <- vapply(all_conf_matrices,
rand_index,
numeric(1))
}
# Jaccard index -------------------------------------------------------------
if("jaccard" %in% indices) {
message(Sys.time(),
" - Computing Jaccard index...\n")
partcomp_indices$jaccard <- vapply(all_conf_matrices,
jaccard_index,
numeric(1))
}
# Point biserial correlation -----------------------------------------------
if(cor_frequency) {
message(
Sys.time(),
" - Computing the correlation between each bioregionalization and the",
" vector of frequency of pairwise membership...\n")
bioregionalization_freq_cor <- as.vector(suppressWarnings(
cor(item_pw_mb, item_pw_mb_freq)))
bioregionalization_freq_cor[is.na(bioregionalization_freq_cor)] <- 0
names(bioregionalization_freq_cor) <- colnames(clusters)
}
# Store outputs -----------------------------------------------------------
outputs <- list(
args = list(indices = indices,
cor_frequency = cor_frequency,
store_pairwise_membership = store_pairwise_membership,
store_confusion_matrix = store_confusion_matrix))
outputs$inputs <- c(number_items = nrow(clusters),
number_bioregionalizations = ncol(clusters))
if(store_pairwise_membership) {
outputs$pairwise_membership <- item_pw_mb
}
outputs$freq_item_pw_membership <- item_pw_mb_freq
if(cor_frequency) {
outputs$bioregionalization_freq_cor <- bioregionalization_freq_cor
}
if(store_confusion_matrix) {
outputs$confusion_matrix <- all_conf_matrices
}
outputs$bioregionalization_comparison <- partcomp_indices
class(outputs) <- append("bioregion.bioregionalization.comparison", class(outputs))
return(outputs)
}
# Get pairwise membership -----------------------------------------------------
get_pairwise_membership <- function(input_clusters) {
# Number of rows (items) and columns (iterations or clustering solutions)
n_items <- nrow(input_clusters)
n_bioregionalizations <- ncol(input_clusters)
# Initialize matrix to store pairwise membership results
pw_membership <- matrix(FALSE, nrow = choose(n_items, 2),
ncol = n_bioregionalizations)
# Get all unique pairwise comparisons indices
pairwise_comps <- t(utils::combn(seq_len(n_items), 2))
# Compute pairwise membership
for (col in seq_len(n_bioregionalizations)) {
# Extract cluster memberships for this bioregionalization
cluster_col <- input_clusters[, col]
# Compare memberships directly for each pair of items
pw_membership[, col] <- cluster_col[pairwise_comps[, 1]] ==
cluster_col[pairwise_comps[, 2]]
}
# Set row names based on the pairwise comparisons (optional)
rownames(pw_membership) <- apply(pairwise_comps, 1, paste, collapse = "_")
colnames(pw_membership) <- colnames(input_clusters)
return(pw_membership)
}
# Confusion matrix ------------------------------------------------------------
get_confusion_matrix <- function(clust1, clust2) {
return(c(a = length(which(clust1 & clust2)), # TP
b = length(which(clust1 & !clust2)),
c = length(which(!clust1 & clust2)),
d = length(which(!clust1 & !clust2)))) # TN
}
# Rand index ------------------------------------------------------------------
rand_index <- function(confusion_matrix) {
RI <- (confusion_matrix["a"] +
confusion_matrix["d"]) /
(confusion_matrix["a"] + confusion_matrix["b"] + confusion_matrix["c"] +
confusion_matrix["d"])
names(RI) <- "rand"
return(RI)
}
# Jaccard index ---------------------------------------------------------------
jaccard_index <- function(confusion_matrix) {
JI <- confusion_matrix["a"] /
(confusion_matrix["a"] + confusion_matrix["b"] + confusion_matrix["c"])
names(JI) <- "jaccard"
return(JI)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.