gmix | R Documentation |
Fast implementation of the EM algorithm for ML estimation and clustering of Gaussian mixture models with covariance matrix regularization based on eigenvalue ratio constraints.
gmix(
data,
K = NA,
erc = 50,
iter.max = 1000,
tol = 1e-8,
init = "kmed",
init.nstart = 25,
init.iter.max = 30,
init.tol = tol,
save_cluster = TRUE,
save_params = TRUE,
save_taus = FALSE)
data |
a numeric vector, matrix, or data frame of observations. Rows
correspond to observations and columns correspond to
variables/features. Let |
K |
the number of mixture components or clusters. It can be left
|
erc |
a numeric value |
iter.max |
maximum number of iterations for the EM algorithm. |
tol |
tolerance for the convergence of the EM algorithm. |
init |
a character in the set |
init.nstart |
number of initial partitions ((see Details)). |
init.iter.max |
maximum number of iterations for each run of the kmedian initialization. |
init.tol |
tolerance for the convergence of each ran of the kmedian initialization. |
save_cluster |
logical, if |
save_params |
logical, if |
save_taus |
logical, if |
The function implements the constrained ML estimator studied in
Coretto and Hennig (2023). The convariance matrix constraints are
computed according to the CM1-step
of Algorithm 2 of Coretto
and Hennig (2017). This function uses highly optimized C code for fast
execution. The constrained M-step extensively uses low-level common
linear algebra matrix operations (BLAS/LAPACK routines). Consequently,
to maximize computational efficiency, it is recommended that the best
available shared libraries, such as OpenBLAS, Intel Math Kernel
Library (MKL), etc., be set up.
Initialization.
The default method, set with init="kmed"
, uses fast C
implementation of the k-
medians algorithm with random initial
centers drawn uniformly over the data
rows init.iter.max
times. Depending on the computer power available
it is suggested to set init.iter.max
as large as possible
particularly in cases where the data set dimensionality is large in
terms of both sample size and number of features.
Setting init="kmeans"
one replaces the K-
medians with
the K-
means. With init="pam"
initial clusters are
determined using the PAM algorithm based on Euclidian distances. The
latter does not perform multiple starts.
The user can also set init = x
where x
is a vector of
integers of length N=nrow(data)
representing an initial hard
assignment of data points to the mixture components or clusters (see Examples).
Another possibility is to set init = W
where W
is a
matrix of dimension (N x {K})
containing the initial posterior
probabilities that the ith observation belongs to the
kth cluster. The assignment provided via W
can be hard
(0-
1 weights with the constraint that only a 1 is possible in
each row of W
, or smooth (each row of W
must sum up to
1). W
can be seen as the initial version of the object
tau
describied in the Value section below.
The last alternative is to set init = f(data)
. Here
f(data)
is a function
that takes data
as an input and returns the matrix with an
initial hard/smooth assignment as the W
matrix previously
described (see the example below).
Eigenvalue ratio constraint (erc
).
It is the maximum allowed ratio between within-cluster covariance
matrix eigenvalues.
It defines the so-
called eigenratio constraint.
erc=1
enforces spherical clusters with equal covariance matrices.
A large erc
allows for large between-cluster covariance discrepancies.
It is suggested to never set erc
arbitrarily large, its main role is to
prevent degenerate covariance parameters and the related emergence of spurious
clusters (see Referenceses below).
Finally, in order to facilitate the setting of erc
, it is
suggested to scale the columns of data
whenever measurement
units of the different variables
are grossly incompatible.
An S3 object of class 'mbcfit'
. Output components are as follows:
info |
information on convergence and errors (see notes). |
iter |
number of iterations performed in the underlying EM-algorithm. |
N |
number of data points. |
P |
data dimension. |
K |
number of clusters. |
eloglik |
sample expected log |
size |
cluster size (counts). |
cluster |
cluster assignment based on the maximum a posteriori rule (MAP). |
taus |
a matrix of dimension |
params |
a list containing mixture components parameters.
The elements of the list are as follows:
|
info |
a list with two components named giving information about underlysing
EM algorithm. The
The
|
Coretto, Pietro and Christian Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. URL: https://jmlr.org/papers/v18/16-382.html
Coretto, Pietro and Christian Hennig (2023) Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions. arXiv:2311.06108. URL: https://arxiv.org/abs/2311.06108
# --- load data
data("banknote")
dat <- banknote[-1]
n <- nrow(dat) #sample size
nc <- 2 #number of clusters
# fit 2 clusters using the default k-median initialization
# In real applications set 'init.nstart' as large as possibile
set.seed(101)
fit1 <- gmix(dat, K = nc, init.nstart = 1)
print(fit1)
# plot partition (default)
plot(x = fit1, data = dat)
# plot partition onto the first 3 principal component coordinates
plot(x = fit1, data = prcomp(dat)$x, margins = c(1,2,3),
pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"),
main = "Principal Components")
# user-defined random initialization with hard assignment labels
set.seed(102)
i2 <- sample(1:nc, size = n, replace = TRUE)
fit2 <- gmix(dat, K = 2, init = i2)
plot(x=fit2, data = dat)
# user-defined smooth "toy" initialization:
# 50% of the points are assigned to cluster 1 with probability 0.95 and to
# cluster 2 with probability 5%. The remaining data points are assigned to
# cluster 1 with probability 10% and to cluster 2 with probability 10%
#
set.seed(103)
idx <- sample(c(TRUE, FALSE), size = n, replace = TRUE)
i3 <- matrix(0, nrow = n, ncol = nc)
i3[idx, ] <- c(0.9, 0.1)
i3[!idx, ] <- c(0.1, 0.9)
# fit
fit3 <- gmix(dat, K = nc, init = i3)
plot(x=fit3, data = dat)
# user-defined function for initialization
# this one produces a 0-1 hard posterior matrix W based on kmeans
#
compute_init <- function(data, K){
cl <- kmeans(data, K, nstart=1, iter.max=10)$cluster
W <- sapply(seq(K), function(x) as.numeric(cl==x))
return(W)
}
fit4 <- gmix(dat, K = nc, init = compute_init)
plot(fit4, data = dat)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.