View source: R/compare_bioregionalizations.R
compare_bioregionalizations | R Documentation |
This function computes pairwise comparisons for several
bioregionalizations, usually outputs from netclu_
, hclu_
, or nhclu_
functions. It also provides the confusion matrix from pairwise comparisons,
enabling the user to compute additional comparison metrics.
compare_bioregionalizations(
bioregionalizations,
indices = c("rand", "jaccard"),
cor_frequency = FALSE,
store_pairwise_membership = TRUE,
store_confusion_matrix = TRUE
)
bioregionalizations |
A |
indices |
|
cor_frequency |
A |
store_pairwise_membership |
A |
store_confusion_matrix |
A |
This function operates in two main steps:
Within each bioregionalization, the function compares all pairs of items
and documents whether they are clustered together (TRUE
) or separately
(FALSE
). For example, if site 1 and site 2 are clustered in the same
cluster in bioregionalization 1, their pairwise membership site1_site2
will be TRUE
. This output is stored in the pairwise_membership
slot if
store_pairwise_membership = TRUE
.
Across all bioregionalizations, the function compares their pairwise memberships to determine similarity. For each pair of bioregionalizations, it computes a confusion matrix with the following elements:
a
: Number of item pairs grouped in both bioregionalizations.
b
: Number of item pairs grouped in the first but not in the second
bioregionalization.
c
: Number of item pairs grouped in the second but not in the first
bioregionalization.
d
: Number of item pairs not grouped in either bioregionalization.
The confusion matrix is stored in confusion_matrix
if
store_confusion_matrix = TRUE
.
Based on these confusion matrices, various indices can be computed to measure agreement among bioregionalizations. The currently implemented indices are:
Rand index: (a + d) / (a + b + c + d)
Measures agreement by considering both grouped and ungrouped item pairs.
Jaccard index: a / (a + b + c)
Measures agreement based only on grouped item pairs.
These indices are complementary: the Jaccard index evaluates clustering similarity, while the Rand index considers both clustering and separation. For example, if two bioregionalizations never group the same pairs, their Jaccard index will be 0, but their Rand index may be > 0 due to ungrouped pairs.
Users can compute additional indices manually using the list of confusion matrices.
To identify which bioregionalization is most representative of the others,
the function can compute the correlation between the pairwise membership of
each bioregionalization and the total frequency of pairwise membership across
all bioregionalizations. This is enabled by setting cor_frequency = TRUE
.
A list
containing 4 to 7 elements:
args: A list
of user-provided arguments.
inputs: A list
containing information on the input
bioregionalizations, such as the number of items clustered.
pairwise_membership (optional): If store_pairwise_membership = TRUE
,
a boolean matrix
where TRUE
indicates two items are in the same cluster,
and FALSE
indicates they are not.
freq_item_pw_membership: A numeric vector
containing the number of
times each item pair is clustered together, corresponding to the sum of rows
in pairwise_membership
.
bioregionalization_freq_cor (optional): If cor_frequency = TRUE
,
a numeric vector
of correlations between individual bioregionalizations
and the total frequency of pairwise membership.
confusion_matrix (optional): If store_confusion_matrix = TRUE
,
a list
of confusion matrices for each pair of bioregionalizations.
bioregionalization_comparison: A data.frame
containing comparison
results, where the first column indicates the bioregionalizations compared,
and the remaining columns contain the requested indices
.
Boris Leroy (leroy.boris@gmail.com)
Maxime Lenormand (maxime.lenormand@inrae.fr)
Pierre Denelle (pierre.denelle@gmail.com)
For more details illustrated with a practical example, see the vignette: https://biorgeo.github.io/bioregion/articles/a5_2_compare_bioregionalizations.html.
Associated functions: bioregionalization_metrics
# We here compare three different bioregionalizations
comat <- matrix(sample(0:1000, size = 500, replace = TRUE, prob = 1/1:1001),
20, 25)
rownames(comat) <- paste0("Site",1:20)
colnames(comat) <- paste0("Species",1:25)
dissim <- dissimilarity(comat, metric = "Simpson")
bioregion1 <- nhclu_kmeans(dissim, n_clust = 3, index = "Simpson")
net <- similarity(comat, metric = "Simpson")
bioregion2 <- netclu_greedy(net)
bioregion3 <- netclu_walktrap(net)
# Make one single data.frame with the bioregionalizations to compare
compare_df <- merge(bioregion1$clusters, bioregion2$clusters, by = "ID")
compare_df <- merge(compare_df, bioregion3$clusters, by = "ID")
colnames(compare_df) <- c("Site", "Hclu", "Greedy", "Walktrap")
rownames(compare_df) <- compare_df$Site
compare_df <- compare_df[, c("Hclu", "Greedy", "Walktrap")]
# Running the function
compare_bioregionalizations(compare_df)
# Find out which bioregionalizations are most representative
compare_bioregionalizations(compare_df,
cor_frequency = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.