MixtClust: Robust Clustering for Complete and Incomplete Data

View source: R/main.R

MixtClustR Documentation

Robust Clustering for Complete and Incomplete Data

Description

Clustering using finite mixture of multivariate t distributions including handling of incomplete data.

Robust clustering, including handling of incomplete data, using the EM algorithm for finite mixtures of multivariate t distributions

Usage

MixtClust(
  x,
  initial.values = "emEM",
  nclusters = NULL,
  max.iter = 1000,
  tol = 0.001,
  convergence = "aitkens",
  sigma.constr = FALSE,
  df.constr = FALSE,
  approx.df = TRUE,
  method = "marginalization",
  verbose = TRUE,
  scaled = TRUE,
  emEM.args = list(nstarts = nclusters * 10 * prod(dim(x)), em.iter = 5, nbest = 4)
)

Arguments

x

A matrix with n observations (rows), p columns (dimensions), and missing entries set to NA.

initial.values

Either "emEM" specifiying the emEM initialization strategy (see emEM.args for additional arguments), "kmeans" specifiying use of kmeans to generate an initial partition, a vector of integers specifying an initial partition, or a named list of initial parameter values (see details).

nclusters

Positive integer. The assumed number of clusters if initial values are not provided.

max.iter

Positive integer. The maximum number of EM iterations allowed.

tol

Positive scalar. The desired stopping criterion value.

convergence

Either "lop" specifying use of relative change in loglikelihood as the convergence criterion, or "aitkens" specifying Aitken's acceleration (default).

sigma.constr

Logical. Should the dispersion matrices Σ_k be held constant over k = 1,…,K all clusters?

df.constr

Logical. Should the degrees of freedom ν_k be held constant over k = 1,…,K all clusters?

approx.df

Logical. If approx.df = TRUE, a numerical approximation of the objective function used to estimate the degrees of freedom ν_k for k = 1,…,K.

method

How should missing entries be handled? Must be either "fullEM" to include missing entries in the EM algorithm following Lin 2009, "marginalization" to integrating out missing entries, or "deletion" to analyze complete cases only.

verbose

Logical. Should progress be periodically reported to the screen?

scaled

Logical variable that indicates if computations for multi-dimensional datasets should be done after scaling the dataset. Note that the resulting parameters are scaled back and so should not theoretically have much effect on the performance, except to potentially offer stability in numerical computations.

emEM.args

A named list of options utilized if initial.values = "emEM" (see details).

Details

Model-based clustering using finite mixtures of t distributions, with handling of incomplete data using either marginalization or the EM algorithm. If supplying initial values, format as a named list with elements:

  • "pi" Mixing proportions. A vector of length K that sums to one.

  • "nu" Degrees of freedom. A vector of length K with entries at least equal to three (thus requiring the existance of the first two moments.)

  • "mu" Locations. A K \times p matrix, where the k-th row is the location μ_k \in R^p for cluster k.

  • "Sigma" Dispersions. A p \times p \times K array, where the k-th slice is the p \times p positive-definite dispersion matrix Σ_k for cluster k.

The arguments for emEM specified in the list emEM.args are:

  • "nstarts" Positive integer. The number of randomly generated initial starting parameter values under consideration.

  • "em.iter" Positive integer. The number of short EM iterations to be performed on each set of initial starting parameter values.

  • "nbest" Positive integer. After em.iter EM iterations are performed in each of the nstarts initial values, the number of top ranking (according to loglikelihood) parameter values on which to run the long EM either to convergence (specified by tol) or maximum number of iterations (specified by max.iter). If nbest is greater than one, the long EM run achieving the largest loglikelihood will be returned.

Model-based clustering using finite mixtures of t distributions, with handling of incomplete data using either marginalization or the EM algorithm.

Value

A list containing:

  • "estimates" A list of the final estimates "pi", "nu", "mu", and "Sigma" containing the MLEs for the mixing proportions, degrees of freedom, locations, and dispersions, respectively.

  • "iterations" Number of EM iterations performed (long EM run only; if emEM was performed, this excludes the short em run iterations specified in emEM.args$em.iter).

  • "Zs" A n \times K matrix where the i-th row contains the posterior probabilities of membership in cluster 1, …, K for the i-th observation (i=1,…,n).

  • "class" A vector of length n with the predicted class memberships for each observation.

  • "loglik" The log likelihood at each (long EM run) iteration.

  • "loglik"The log likelihood at the last iteration, computed for all cases (including those with missing values when method = "deletion")

  • "bic" The BIC for the final fitted model.

  • "EM.time" Runtime for the long EM run(s).

  • "em.time" Runtime for the short em run(s) when initial.values = "emEM".

  • "total.time" Runtime for the entire function call.

  • "call" Supplied function call. nparThe number of model parameters.

Author(s)

Emily Goren, emily.goren@gmail.com

References

Emily M. Goren & Ranjan Maitra, 2022. "Fast model-based clustering of partial records," Stat, 11(1), e416. https://doi.org/10.1002/sta4.416"

Tsung-I Lin & Hsiu Ho & Pao Shen, 2009. "Computationally efficient learning of multivariate t mixture models with missing information," Computational Statistics, 24(3): 375-392.

Examples

set.seed(20180626)
# Use iris data.
d <- subset(iris, select = -Species)
# Create missing data -- MCAR with 10% chance of missingness.
missing <- matrix(rbinom(n = ncol(d)*nrow(d), size = 1, prob = 0.1), ncol = ncol(d))
x <- d; x[missing == 1] <- NA
# Run EM with emEM initialization strategy for candidate clusters K = 2, 3, 4.
Ks <- 2:4
ans <- lapply(Ks, function(K) {
    MixtClust(x, nclusters = K, emEM.args = list(nstarts=K*10, em.iter=5, nbest=1))
})
# Get BIC for each K.
BICs <- sapply(ans, function(f) f$bic)
# Plot BIC by K.
plot(BICs ~ Ks, pch = 20, xlab = 'Number of Clusters', ylab = 'BIC')


emilygoren/MixtClust documentation built on March 19, 2022, 2 p.m.