gmix: Gaussian Mixture Modelling

View source: R/gmix.R

gmixR Documentation

Gaussian Mixture Modelling

Description

Fast implementation of the EM algorithm for ML estimation and clustering of Gaussian mixture models with covariance matrix regularization based on eigenvalue ratio constraints.

Usage

gmix(
  data,
  K = NA,
  erc = 50,
  iter.max = 1000,
  tol = 1e-08,
  init = "kmed",
  init.nstart = 25,
  init.iter.max = 30,
  init.tol = tol,
  save_cluster = TRUE,
  save_params = TRUE,
  save_taus = FALSE
)

Arguments

data

a numeric vector, matrix, or data frame of observations. Rows correspond to observations and columns correspond to variables/features. Let N=nrow(data) and P=ncol(data). Categorical variables and NA values are not allowed.

K

the number of mixture components or clusters. It can be left NA only when init is a vector of initial labels, in which case the number of clusters is retrieved from the initial partition. For character, matrix/data frame, and function initializations, K must be supplied.

erc

a numeric value >=1 specifying the eigenvalue ratio constraint (see Details).

iter.max

maximum number of iterations for the EM algorithm.

tol

tolerance for the convergence of the EM algorithm.

init

a character in the set c("kmed", "kmeans", "pam"), a vector, a matrix, a data frame, or a callable giving the initial assignment of data points (see Details). The default choice is "kmed".

init.nstart

number of initial partitions (see Details).

init.iter.max

maximum number of iterations for each run of the k-median initialization.

init.tol

tolerance for the convergence of each run of the k-median initialization.

save_cluster

logical, if TRUE the point-to-cluster assignment based on the maximum a posteriori probability (MAP) rule is returned.

save_params

logical, if TRUE the estimated mixture parameters are returned.

save_taus

logical, if TRUE the posterior class probabilities are returned (these are also known as posterior weights or fuzzy weights).

Details

The function implements the constrained ML estimator studied in Coretto and Hennig (2023). The covariance matrix constraints are computed according to the CM1-step of Algorithm 2 of Coretto and Hennig (2017). This function uses highly optimized C code for fast execution. The constrained M-step extensively uses low-level common linear algebra matrix operations (BLAS/LAPACK routines). Consequently, to maximize computational efficiency, it is recommended that the best available shared libraries, such as OpenBLAS, Intel Math Kernel Library (MKL), etc., be set up.

Initialization. The default method, set with init="kmed", uses a fast C implementation of the k-medians algorithm with random initial centers drawn uniformly over the data rows init.iter.max times. Depending on the computer power available it is suggested to set init.iter.max as large as possible particularly in cases where the data set dimensionality is large in terms of both sample size and number of features. Setting init="kmeans" replaces the k-medians with the k-means. With init="pam" initial clusters are determined using the PAM algorithm based on Euclidean distances. The latter does not perform multiple starts. The user can also set init = x where x is a vector of integers of length N=nrow(data) representing an initial hard assignment of data points to the mixture components or clusters (see Examples). Another possibility is to set init = W where W is a matrix or data frame of dimension (N x K) containing initial posterior probabilities or initial non-negative weights. The assignment provided via W can be hard (0-1 weights with the constraint that only a 1 is possible in each row of W) or smooth. In the current implementation, entries of W must be finite and non-negative, and each cluster must receive positive total weight. W can be seen as the initial version of the object posterior described in the Value section above. The last alternative is to set init = f where f is a function with signature function(data, K) returning an N x K matrix of initial hard/smooth assignments as described for W above (see the example below).

Eigenvalue ratio constraint (erc). It is the maximum allowed ratio between within-cluster covariance matrix eigenvalues. It defines the so-called eigenratio constraint. erc=1 enforces spherical clusters with equal covariance matrices. A large erc allows for large between-cluster covariance discrepancies. It is suggested to never set erc arbitrarily large; its main role is to prevent degenerate covariance parameters and the related emergence of spurious clusters (see References below). Finally, in order to facilitate the setting of erc, it is suggested to scale the columns of data whenever measurement units of the different variables are grossly incompatible.

Value

An S3 object of class "mbcfit". Output components are as follows:

info

a list with two components named code and flag giving information about the underlying EM algorithm. The code objects can take the following values:

  • code=1: the algorithm converged within iter.max.

  • code=2: the algorithm reached iter.max.

  • code=3: the algorithm did not move from initial values.

  • code=-1: unexpected memory allocation issues occurred.

  • code=-2: unexpected LAPACK routine errors occurred.

The flag objects can take the following values:

  • flag=0: no flag.

  • flag=1: numerically degenerate posterior probabilities could not be prevented.

  • flag=2: the ERC was enforced at least once.

  • flag=3: conditions of flag=1 and flag=2 occurred.

iter

number of iterations performed in the underlying EM algorithm.

N

number of data points.

P

data dimension.

K

number of clusters.

loglik

sample expected log-likelihood.

size

cluster size (counts).

cluster

cluster assignment based on the maximum a posteriori rule (MAP). Returned when save_cluster = TRUE.

posterior

a matrix of dimension (N x K) where posterior[i, k] is the estimated posterior probability that the ith observation belongs to the kth cluster. Returned when save_taus = TRUE.

params

a list containing mixture component parameters. Returned when save_params = TRUE. The elements of the list are: $proportion= vector of proportions; $mean= matrix of dimension (P x K) containing mean parameters; $cov= array of size (P x P x K) containing covariance matrices.

References

Coretto, Pietro and Christian Hennig (2017). Consistency, breakdown robustness, and algorithms for robust improper maximum likelihood clustering. Journal of Machine Learning Research, Vol. 18(142), pp. 1-39. URL: https://jmlr.org/papers/v18/16-382.html

Coretto, Pietro and Christian Hennig (2023) Nonparametric consistency for maximum likelihood estimation and clustering based on mixtures of elliptically-symmetric distributions. arXiv:2311.06108. URL: https://arxiv.org/abs/2311.06108

Examples

# --- load data
data("banknote")
dat <- banknote[-1]
n   <- nrow(dat) # sample size
nc  <- 2         # number of clusters

# fit 2 clusters using the default k-median initialization
set.seed(101)
fit1 <- gmix(dat, K = nc, init.nstart = 1)
print(fit1)

## Not run: 
# plot partition (default)
plot(x = fit1, data = dat)

# plot partition onto the first 3 principal component coordinates
plot(x = fit1, data = prcomp(dat)$x, margins = c(1, 2, 3),
     pch_cl = c("A", "B"), col_cl = c("#4285F4", "#0F9D58"),
     main = "Principal Components")

## End(Not run)

# user-defined random initialization with hard assignment labels
set.seed(102)
i2   <- sample(1:nc, size = n, replace = TRUE)
fit2 <- gmix(dat, K = 2, init = i2)
## Not run: 
plot(x = fit2, data = dat)

## End(Not run)

# user-defined smooth "toy" initialization:
# 50% of the points are assigned to cluster 1 with probability 0.9 and to
# cluster 2 with probability 0.1. The remaining data points are assigned to
# cluster 1 with probability 0.1 and to cluster 2 with probability 0.9.
set.seed(103)
idx        <- sample(c(TRUE, FALSE), size = n, replace = TRUE)
i3         <- matrix(0, nrow = n, ncol = nc)
i3[idx, ]  <- c(0.9, 0.1)
i3[!idx, ] <- c(0.1, 0.9)
fit3 <- gmix(dat, K = nc, init = i3)
## Not run: 
plot(x = fit3, data = dat)

## End(Not run)

# user-defined function for initialization
# this one produces a 0-1 hard posterior matrix W based on kmeans
compute_init <- function(data, K){
  cl <- kmeans(data, K, nstart = 1, iter.max = 10)$cluster
  W  <- sapply(seq(K), function(x) as.numeric(cl == x))
  return(W)
}
fit4 <- gmix(dat, K = nc, init = compute_init)
## Not run: 
plot(fit4, data = dat)

## End(Not run)


qcluster documentation built on June 5, 2026, 5:07 p.m.

Related to gmix in qcluster...