Icluster: Icluster

View source: R/Icluster.R

IclusterR Documentation

Icluster

Description

This function clusters the columns (variables) of a dataset via agglomerative hierarchical variable clustering using estimated multivariate similarities (dependence coefficients) between random vectors.

Usage

Icluster(
  data,
  est_method,
  max_dim = Inf,
  norm = NULL,
  link = "average",
  trace = 1
)

Arguments

data

The dataset (n \times q matrix with observations in rows, variables in columns) whose columns need to be clustered.

est_method

The method for estimating the similarity between two clusters of variables.

max_dim

The maximum dimension of the random vectors for which no link function is used when computing the similarity (default = Inf).

norm

A possible normalization function applied to the dependence measure (default = NULL, meaning no normalization).

link

The link function to be used when max_dim is exceeded (default = "average").

trace

Controls how verbose output should be (default = 1, showing the progress).

Details

Suppose that the q variables (of which we have n observations in data) are \mathcal{S} = \{X_{1}, \dots, X_{q}\}. Then, most important in hierarchical variable clustering, is computing the similarity

\mathcal{D}(\mathbb{X},\mathbb{Y})

between two disjoint subsets of variables \mathbb{X}, \mathbb{Y} \subset \mathcal{S}. In particular, the main algorithm is as follows:

  • Each object \{X_{i}\} forms a separate cluster, i.e., \aleph_{1} = \{\{X_{1}\},\dots,\{X_{q}\}\} is the initial feature partition.

For i = 1,2,\dots,q-1:

  • For each pair of disjoint clusters \mathbb{X},\mathbb{Y} \in \aleph_{i}, compute the similarity \mathcal{D}(\mathbb{X},\mathbb{Y}).

  • Define \aleph_{i+1} = (\aleph_{i} \setminus \{\widetilde{\mathbb{X}},\widetilde{\mathbb{Y}}\}) \cup \{\widetilde{\mathbb{X}} \cup \widetilde{\mathbb{Y}} \}, where \widetilde{\mathbb{X}},\widetilde{\mathbb{Y}} are the clusters having maximal similarity according to the previous step.

  • The algorithm stops with \aleph_{q} = \{\{X_{1},\dots,X_{q}\}\}.

We call \{\aleph_{1}, \dots, \aleph_{q}\} the hierarchy constructed throughout the algorithm, and define, for i \in \{1, \dots, q\}, \text{Adiam}(\aleph_{i}) = |\aleph_{i}|^{-1} \sum_{\mathbb{X} \in \aleph_{i}} \text{diam}(\mathbb{X}), with

\text{diam}(\mathbb{X}) = \begin{cases} \underset{\{X,Y\} \subset \mathbb{X} }{\min} \mathcal{D}(X,Y) & \mbox{if } |\mathbb{X}| > 1 \\ 1 & \mbox{if } |\mathbb{X}| = 1, \end{cases}

and \text{Msplit}(\aleph_{i}) = \max_{\mathbb{X} \in \aleph_{i}} \text{split}(\mathbb{X}), with

\text{split}(\mathbb{X}) = \underset{Y \in \aleph_{i} \setminus \mathbb{X}}{\underset{X \in \mathbb{X}}{\max}} \mathcal{D}(X,Y) \hspace{0.2cm} \text{for} \hspace{0.2cm} \{\mathbb{X}\} \neq \aleph_{i}.

Adiam stands for the average diameter of a partition (measuring the homogeneity, which we want to be large), while Msplit stands for the maximum split of a partition (measuring the non-separation, which we want to be small).

For measuring the similarity \mathcal{D}(\mathbb{X},\mathbb{Y}), we approach \mathbb{X} and \mathbb{Y} as being two random vectors (let's say of dimensions d_{1} and d_{2} respectively). For \mathcal{D}, we take an estimated dependence measure between (two) random vectors. The following options are possible:

  • list("phi", "mi", "Gauss", omegas = omegas) for the estimated Gaussian copula mutual information. Use omegas = 1 for no penalty, or a sequence of omegas for a ridge penalty tuned via 5-fold cross-validation, see also the functions minormal, estR, and cvomega,

  • list("phi", "Hel", "Gauss", omegas = omegas) for the estimated Gaussian copula Hellinger distance. Use omegas = 1 for no penalty, or a sequence of omegas for a ridge penalty tuned via 5-fold cross-validation, see also the functions Helnormal, estR, and cvomega,

  • list("phi", phi(t), "hac", type = type, M = M) for general \Phi-dependence with specified function phi(t) = \Phi(t), estimated by fitting (via pseudo maximum likelihood estimation) a hierarchical Archimedean copula of given type = type, and computed based on a Monte Carlo sample of size M in order to approximate the expectation, see also the functions mlehac, phihac and estphi,

  • list("phi", phi(t), "nphac", estimator = estimator, type = type) for general \Phi-dependence with specified function phi(t) = \Phi(t), estimated via non-parametric beta kernel estimation or Gaussian transformation kernel estimation, and local bandwidth selection, by using a fitted (via pseudo maximum likelihood) hierarchical Archimedean copula as reference copula, see also the functions phinp and estphi,

  • list("phi", phi(t), "np", estimator = estimator, bw_method = bw_method) for general
    \Phi-dependence with specified function phi(t) = \Phi(t), estimated via non-parametric beta kernel estimation or Gaussian transformation kernel estimation, and local bandwidth selection, either by using a non-parametric kernel estimator as reference copula if bw_method = 1, or by using a big O bandwidth rule if bw_method = 2, see also the functions phinp and estphi,

  • list("phi", phi(t), "ellip", grid = grid) for general \Phi-dependence with specified function phi(t) = \Phi(t), estimated via the improved MECIP procedure on the specified grid, and parameter selection done via the function elliptselect using the Gaussian generator as reference generator, see also the functions phiellip and estphi,

  • list("ot", coef = coef, omegas = omegas) for Gaussian copula Bures-Wasserstein dependence measures, either coefficient \mathcal{D}_{1} (coef = 1) or coefficient \mathcal{D}_{2} (coef = 2). Use omegas = 1 for no penalty, or a sequence of omegas for a ridge penalty tuned via 5-fold cross-validation, see also the functions bwd1, bwd2, estR, and cvomega.

When d_{1} + d_{2} > max_dim, the specified link function (say L) is used for computing the similarity between \mathbb{X} and \mathbb{Y}, i.e.,

\mathcal{D} \left ( \mathbb{X}, \mathbb{Y} \right ) = L \left (\left \{\mathcal{D}(X,Y) : X \in \mathbb{X}, Y \in \mathbb{Y} \right \} \right ),

which by default is the average of all inter-pairwise similarities. Other options are "single" for the minimum, and "complete" for the maximum.

The function norm (say N) is a possible normalization applied to the similarity measure, i.e., instead of computing \mathcal{D} (using the method specified by est_method), the similarity becomes N \circ \mathcal{D}. The default is N(t) = t, meaning that no normalization is applied.

Value

A list with elements "hierarchy" containing the hierarchy constructed throughout the algorithm (a hash object), "all" containing all similarities that were computed throughout the algorithm (a hash object), "diam" containing the average diameters of all partitions created throughout the algorithm (a vector), and "split" containing the maximum splits of all partitions created throughout the algorithm (a vector).

References

De Keyser, S. & Gijbels, I. (2024). Hierarchical variable clustering via copula-based divergence measures between random vectors. International Journal of Approximate Reasoning 165:109090. doi: https://doi.org/10.1016/j.ijar.2023.109090.

De Keyser, S. & Gijbels, I. (2024). Parametric dependence between random vectors via copula-based divergence measures. Journal of Multivariate Analysis 203:105336.
doi: https://doi.org/10.1016/j.jmva.2024.105336.

De Keyser, S. & Gijbels, I. (2024). High-dimensional copula-based Wasserstein dependence. doi: https://doi.org/10.48550/arXiv.2404.07141.

See Also

minormal for the computation of the Gaussian copula mutual information, Helnormal for the computation of the Gaussian copula Hellinger distance, estphi for several approach to estimating the \Phi-dependence between k random vectors, bwd1 for the computation of the first Bures-Wasserstein dependence coefficient \mathcal{D}_{1}, bwd2 for the computation of the second Bures-Wasserstein dependence coefficient \mathcal{D}_{2}.

Examples


q = 20

# We will impose a clustering
# {{X1,X2},{X3,X4,X5},{X6,X7,X8},{X9,X10,X11,X12,X13},{X14,X15,X16,X17,X18,X19,X20}}
dim = c(2,3,3,5,7)

# Sample size
n = 200

# Twenty dimensional hierarchical Gumbel copula with parameters
# (theta_0,theta_1,theta_2,theta_3,theta_4,theta_5) = (2,3,4,5,6,7)
hac = gethac(dim,c(2,3,4,5,6,7),type = 1)

# So, strongest cluster is {X14,X15,X16,X17,X18,X19,X20}, then {X9,X10,X11,X12,X13},
# then {X6,X7,X8}, then {X3,X4,X5}, and finally {X1,X2}

# Sample
sample =  suppressWarnings(HAC::rHAC(n,hac))

# Cluster using different methods

# Gaussian copula based methods

Clustering1 =  Icluster(data = sample,
                        est_method = list("phi", "mi", "Gauss", omegas = 1))

# 5-cluster partition
Clustering1$hierarchy$Aleph_16

Clustering2 =  Icluster(data = sample,
                        est_method = list("phi", "mi", "Gauss",
                                          omegas = seq(0.01,0.999,len = 50)))

# 5-cluster partition
Clustering2$hierarchy$Aleph_16

Clustering3 =  Icluster(data = sample,
                        est_method = list("phi", "mi", "Gauss", omegas = 1),
                        max_dim = 2)

# 5-cluster partition
Clustering3$hierarchy$Aleph_16

Clustering4 =  Icluster(data = sample,
                        est_method = list("phi", "Hel", "Gauss", omegas = 1))

# 5-cluster partition
Clustering4$hierarchy$Aleph_16

Clustering5 =  Icluster(data = sample,
                        est_method = list("ot", coef = 1, omegas = 1))

# 5-cluster partition
Clustering5$hierarchy$Aleph_16

Clustering6 =  Icluster(data = sample,
                        est_method = list("ot", coef = 2, omegas = 1))

# 5-cluster partition
Clustering6$hierarchy$Aleph_16

Clustering7 =  Icluster(data = sample,
                        est_method = list("ot", coef = 2, omegas = 1), max_dim = 4)

# 5-cluster partition
Clustering7$hierarchy$Aleph_16

# Parametric hierarchical Archimedean copula approach

Clustering8 = Icluster(data = sample,
                       est_method = list("phi", function(t){t * log(t)}, "hac",
                                         type = 1, M = 1000), max_dim = 4)

# 5-cluster partition
Clustering8$hierarchy$Aleph_16

Clustering9 = Icluster(data = sample,
                       est_method = list("phi", function(t){(sqrt(t)-1)^2}, "hac",
                                         type = 1, M = 1000), max_dim = 2)

# 5-cluster partition
Clustering9$hierarchy$Aleph_16

# Non-parametric approaches

Clustering10 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "nphac",
                                       estimator = "beta", type = 1), max_dim = 3)

# 5-cluster partition
Clustering10$hierarchy$Aleph_16

Clustering11 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "nphac",
                                     estimator = "trans", type = 1), max_dim = 3)

# 5-cluster partition
Clustering11$hierarchy$Aleph_16

Clustering12 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "np",
                                     estimator = "beta", bw_method = 1), max_dim = 3)

# 5-cluster partition
Clustering12$hierarchy$Aleph_16

Clustering13 = Icluster(data = sample,
                        est_method = list("phi", function(t){t * log(t)}, "np",
                                     estimator = "trans", bw_method = 2), max_dim = 3)

# 5-cluster partition
Clustering13$hierarchy$Aleph_16

Clustering14 = Icluster(data = sample,
                        est_method = list("phi", function(t){(sqrt(t)-1)^2}, "np",
                                      estimator = "trans", bw_method = 1), max_dim = 2)

# 5-cluster partition
Clustering14$hierarchy$Aleph_16

# Semi-parametric meta-elliptical copula approach
# Uncomment to run (takes a long time)

# Clustering15 = Icluster(data = sample,
#                        est_method = list("phi", function(t){t * log(t)}, "ellip",
#                                     grid = seq(0.005,100,by = 0.005)), max_dim = 2)

# 5-cluster partition
# Clustering15$hierarchy$Aleph_16




VecDep documentation built on April 4, 2025, 5:14 a.m.