| gmix | R Documentation |
Fast implementation of the EM algorithm for ML estimation and clustering of Gaussian mixture models with covariance matrix regularization based on eigenvalue ratio constraints.
gmix(
data,
K = NA,
erc = 50,
iter.max = 1000,
tol = 1e-08,
init = "kmed",
init.nstart = 25,
init.iter.max = 30,
init.tol = tol,
save_cluster = TRUE,
save_params = TRUE,
save_taus = FALSE
)
data |
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to variables/features.
Let |
K |
the number of mixture components or clusters. It can be left
|
erc |
a numeric value |
iter.max |
maximum number of iterations for the EM algorithm. |
tol |
tolerance for the convergence of the EM algorithm. |
init |
a character in the set |
init.nstart |
number of initial partitions (see Details). |
init.iter.max |
maximum number of iterations for each run of the
k |
init.tol |
tolerance for the convergence of each run of the
k |
save_cluster |
logical, if |
save_params |
logical, if |
save_taus |
logical, if |
The function implements the constrained ML estimator studied in
Coretto and Hennig (2023). The covariance matrix constraints are
computed according to the CM1-step of Algorithm 2 of Coretto
and Hennig (2017). This function uses highly optimized C code for fast
execution. The constrained M-step extensively uses low-level common
linear algebra matrix operations (BLAS/LAPACK routines). Consequently,
to maximize computational efficiency, it is recommended that the best
available shared libraries, such as OpenBLAS, Intel Math Kernel
Library (MKL), etc., be set up.
Initialization.
The default method, set with init="kmed", uses a fast C
implementation of the k-medians algorithm with random initial
centers drawn uniformly over the data rows init.iter.max
times. Depending on the computer power available it is suggested to set
init.iter.max as large as possible particularly in cases where the
data set dimensionality is large in terms of both sample size and number
of features.
Setting init="kmeans" replaces the k-medians with the
k-means. With init="pam" initial clusters are determined
using the PAM algorithm based on Euclidean distances. The latter does not
perform multiple starts.
The user can also set init = x where x is a vector of
integers of length N=nrow(data) representing an initial hard
assignment of data points to the mixture components or clusters (see
Examples).
Another possibility is to set init = W where W is a
matrix or data frame of dimension (N x K) containing initial
posterior probabilities or initial non-negative weights. The assignment
provided via W can be hard (0-1 weights with the constraint
that only a 1 is possible in each row of W) or smooth. In the
current implementation, entries of W must be finite and
non-negative, and each cluster must receive positive total weight.
W can be seen as the initial version of the object
posterior described in the Value section above.
The last alternative is to set init = f where f is a
function with signature function(data, K) returning an N x K
matrix of initial hard/smooth assignments as described for W
above (see the example below).
Eigenvalue ratio constraint (erc).
It is the maximum allowed ratio between within-cluster covariance
matrix eigenvalues. It defines the so-called
eigenratio constraint. erc=1 enforces spherical clusters
with equal covariance matrices. A large erc allows for large
between-cluster covariance discrepancies. It is suggested to never set
erc arbitrarily large; its main role is to prevent degenerate
covariance parameters and the related emergence of spurious clusters
(see References below).
Finally, in order to facilitate the setting of erc, it is
suggested to scale the columns of data whenever measurement
units of the different variables are grossly incompatible.
An S3 object of class "mbcfit". Output components are as follows:
a list with two components named code and flag giving
information about the underlying EM algorithm.
The code objects can take the following values:
code=1: the algorithm converged within iter.max.
code=2: the algorithm reached iter.max.
code=3: the algorithm did not move from initial values.
code=-1: unexpected memory allocation issues occurred.
code=-2: unexpected LAPACK routine errors occurred.
The flag objects can take the following values:
flag=0: no flag.
flag=1: numerically degenerate posterior probabilities
could not be prevented.
flag=2: the ERC was enforced at least once.
flag=3: conditions of flag=1 and flag=2
occurred.
number of iterations performed in the underlying EM algorithm.
number of data points.
data dimension.
number of clusters.
sample expected log-likelihood.
cluster size (counts).
cluster assignment based on the maximum a posteriori rule (MAP).
Returned when save_cluster = TRUE.
a matrix of dimension (N x K) where posterior[i, k] is the
estimated posterior probability that the ith observation belongs
to the kth cluster. Returned when save_taus = TRUE.
a list containing mixture component parameters. Returned when
save_params = TRUE. The elements of the list are:
$proportion= vector of proportions;
$mean= matrix of dimension (P x K) containing mean
parameters;
$cov= array of size (P x P x K) containing covariance
matrices.
Coretto, Pietro and Christian Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. URL: https://jmlr.org/papers/v18/16-382.html
Coretto, Pietro and Christian Hennig (2023) Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions. arXiv:2311.06108. URL: https://arxiv.org/abs/2311.06108
# --- load data
data("banknote")
dat <- banknote[-1]
n <- nrow(dat) # sample size
nc <- 2 # number of clusters
# fit 2 clusters using the default k-median initialization
set.seed(101)
fit1 <- gmix(dat, K = nc, init.nstart = 1)
print(fit1)
## Not run:
# plot partition (default)
plot(x = fit1, data = dat)
# plot partition onto the first 3 principal component coordinates
plot(x = fit1, data = prcomp(dat)$x, margins = c(1, 2, 3),
pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"),
main = "Principal Components")
## End(Not run)
# user-defined random initialization with hard assignment labels
set.seed(102)
i2 <- sample(1:nc, size = n, replace = TRUE)
fit2 <- gmix(dat, K = 2, init = i2)
## Not run:
plot(x = fit2, data = dat)
## End(Not run)
# user-defined smooth "toy" initialization:
# 50% of the points are assigned to cluster 1 with probability 0.9 and to
# cluster 2 with probability 0.1. The remaining data points are assigned to
# cluster 1 with probability 0.1 and to cluster 2 with probability 0.9.
set.seed(103)
idx <- sample(c(TRUE, FALSE), size = n, replace = TRUE)
i3 <- matrix(0, nrow = n, ncol = nc)
i3[idx, ] <- c(0.9, 0.1)
i3[!idx, ] <- c(0.1, 0.9)
fit3 <- gmix(dat, K = nc, init = i3)
## Not run:
plot(x = fit3, data = dat)
## End(Not run)
# user-defined function for initialization
# this one produces a 0-1 hard posterior matrix W based on kmeans
compute_init <- function(data, K){
cl <- kmeans(data, K, nstart = 1, iter.max = 10)$cluster
W <- sapply(seq(K), function(x) as.numeric(cl == x))
return(W)
}
fit4 <- gmix(dat, K = nc, init = compute_init)
## Not run:
plot(fit4, data = dat)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.