DIBcont | R Documentation |
The DIBcont
function implements the Deterministic Information Bottleneck (DIB) algorithm
for clustering continuous data. This method optimizes an information-theoretic objective to
preserve relevant information while forming concise and interpretable cluster representations
\insertCitecosta_dib_2025IBclust.
DIBcont(X, ncl, randinit = NULL, s = -1, scale = TRUE,
maxiter = 100, nstart = 100, verbose = FALSE)
X |
A numeric matrix or data frame containing the continuous data to be clustered. All variables should be of type |
ncl |
An integer specifying the number of clusters to form. |
randinit |
Optional. A vector specifying initial cluster assignments. If |
s |
A numeric value or vector specifying the bandwidth parameter(s) for continuous variables. The values must be greater than |
scale |
A logical value indicating whether the continuous variables should be scaled to have unit variance before clustering. Defaults to |
maxiter |
The maximum number of iterations allowed for the clustering algorithm. Defaults to |
nstart |
The number of random initializations to run. The best clustering result (based on the information-theoretic criterion) is returned. Defaults to |
verbose |
Logical. Default to |
The DIBcont
function applies the Deterministic Information Bottleneck algorithm to cluster datasets comprising only continuous variables. This method leverages an information-theoretic objective to optimize the trade-off between data compression and the preservation of relevant information about the underlying data distribution.
The function utilizes the Gaussian kernel \insertCitesilverman_density_1998IBclust for estimating probability densities of continuous features. The kernel is defined as:
K_c\left(\frac{x - x'}{s}\right) = \frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{\left(x - x'\right)^2}{2s^2}\right\}, \quad s > 0.
The bandwidth parameter s
, which controls the smoothness of the density estimate, is automatically determined by the algorithm if not provided by the user.
A list containing the following elements:
Cluster |
An integer vector indicating the cluster assignment for each observation. |
Entropy |
A numeric value representing the entropy of the cluster assignments at convergence. |
MutualInfo |
A numeric value representing the mutual information, |
beta |
A numeric vector of the final beta values used during the iterative optimization. |
s |
A numeric value or vector of bandwidth parameters used for the continuous variables. Typically, this will be a single value if all continuous variables share the same bandwidth. |
ents |
A numeric vector tracking the entropy values over the iterations, providing insight into the convergence process. |
mis |
A numeric vector tracking the mutual information values over the iterations. |
Efthymios Costa, Ioanna Papatsouma, Angelos Markos
costa_dib_2025IBclust
\insertRefsilverman_density_1998IBclust
DIBmix
, DIBcat
# Generate simulated continuous data
set.seed(123)
X <- matrix(rnorm(1000), ncol = 5) # 200 observations, 5 features
# Run DIBcont with automatic bandwidth selection and multiple initializations
result <- DIBcont(X = X, ncl = 3, s = -1, nstart = 50)
# Print clustering results
print(result$Cluster) # Cluster assignments
print(result$Entropy) # Final entropy
print(result$MutualInfo) # Mutual information
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.