knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", fig.retina = 4, out.width = "100%" )
'CluReAL' is a port of the Python implementation of the algorithm
'CluReAL.v2', which is designed to improve an existing clustering solution by
splitting multimodal clusters, merging akin clusters, and marking tiny
or low-density clusters as outliers (noise). Additionally, symbolic key
ideograms can be created to interpret clusters in high-dimensional space.
The approach is described in detail in the article by Iglesias et al. (2021):
https://doi.org/10.1007/s41060-021-00275-z.
You can install the development version from GitHub with:
# install.packages("remotes") remotes::install_github("unmnn/CluReAL")
A typical CluReAL workflow:
Create the cluster context by calling cluster_context(x, y)
, where
the matrix x
represents the dataset with m observations and n dimensions, and
y
is the integer vector of length m containing the cluster membership indices
(-1 represents noise).
The output is a list with the following elements:
k
: number of clusters (integer)centroids
: centroid coordinates (k x n matrix)mass
: cluster sizes (integer vector of length k)mn_da
: mean Euclidean distances of the cluster members to their centroid
(double vector of length k)md_da
: median Euclidean distances of the cluster members to their centroid
(double vector of length k)sd_da
: standard deviation of the Euclidean distances of the cluster
members to their centroid (double vector of length k)de
: Euclidean distance matrix for the centroids (k x k matrix)outliers
: number of outliers (integer)Compute the cluster validity measures by calling gval(cc)
using the output
object from step 1. The output is a list with the following elements:
g_str
: strict G-index (double)g_rex
: relaxed G-index (double) g_min
: min G-index (double)oi_st
: cluster-individual strict overlap indices
(double vector of length k)oi_rx
: cluster-individual relaxed overlap indices
(double vector of length k)oi_mn
: cluster-individual min overlap indices (double vector of length k)ext_r
: extended cluster radii (double vector of length k) str_r
: strict cluster radii (double vector of length k)vol_r
: extended-to-core ratioCompute the refinement context by calling
refinement_context(x, y, cc, gv)
.
The output is a list with the following elements:
mm
: cluster multimodality flag (logical vector of length k)k_dens
: cluster-individual relative density (double vector of length k)global_c_dens
: global density (double)kinship
: cluster kinship matrix (k x k matrix): 0-itself,
1-parent and child, 2-relatives, 3-close friends, 4-acquaintances,
5-unrelated. Refine the clustering by calling refine(x, y, cc, gv, rc)
.
The output is a list with the following elements:
y
: vector of the refined cluster membership indicescc
: refined cluster contextLoad all required packages:
library(CluReAL) # install.packages("dplyr") # install.packages("tidyr") # install.packages("ggplot2") # install.packages("palmerpenguins") # install.packages("patchwork") library(dplyr, warn.conflicts = FALSE) library(ggplot2) library(patchwork)
We perform kmeans clustering on the Palmer penguins dataset using the variables
flipper_length_mm
and bill_length_mm
.
We min-max normalize the variables to unify their range.
peng <- palmerpenguins::penguins %>% tidyr::drop_na() %>% mutate(across(c(flipper_length_mm, bill_length_mm), ~ (.x - min(.x)) / (max(.x) - (min(.x))))) %>% select(flipper = flipper_length_mm, bill = bill_length_mm, species) ggplot(peng, aes(x = flipper, y = bill, color = species)) + geom_point()
Here, we deliberately call kmeans with a cluster count that is too high.
set.seed(1) clustering <- kmeans(peng[c("flipper", "bill")], centers = 6) peng <- peng %>% mutate(cluster = as.factor(clustering$cluster)) p1 <- ggplot(peng, aes(x = flipper, y = bill, color = cluster)) + geom_point() + labs(title = "Before refining") p1
We perform the four steps of CluReAL as described above and compare the clustering solution before and after refining.
# Step 1: compute the cluster context x <- as.matrix(peng[c("flipper", "bill")]) y <- clustering$cluster cc <- cluster_context(x, y) cc # Step 2: compute the cluster validity indices gv <- gval(cc) gv # Step 3: compute the refinement context rc <- refinement_context(x, y, cc, gv) # Step 4: refine the clustering rf <- refine(x, y, cc, gv, rc) rf peng <- peng %>% mutate(c_refined = as.factor(rf$y)) p2 <- ggplot(peng, aes(x = flipper, y = bill, color = c_refined)) + geom_point() + labs(title = "After refining") p1 + p2
Draw the clustering solution ideogram before and after refining.
ideo_before <- draw_symbol(cc, gv, rc) y_r <- rf$y cc_r <- rf$cc gv_r <- gval(cc_r) rc_r <- refinement_context(x, y_r, cc_r, gv_r) ideo_after <- draw_symbol(cc_r, gv_r, rc_r) ideo_before + ideo_after
Interpretation:
g_str
< 0, g_rex
< 0g_str
< 0, g_rex
> 1Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.