MixtClust | R Documentation |
Clustering using finite mixture of multivariate t distributions including handling of incomplete data.
Robust clustering, including handling of incomplete data, using the EM algorithm for finite mixtures of multivariate t distributions
MixtClust( x, initial.values = "emEM", nclusters = NULL, max.iter = 1000, tol = 0.001, convergence = "aitkens", sigma.constr = FALSE, df.constr = FALSE, approx.df = TRUE, method = "marginalization", verbose = TRUE, scaled = TRUE, emEM.args = list(nstarts = nclusters * 10 * prod(dim(x)), em.iter = 5, nbest = 4) )
x |
A matrix with n observations (rows), p columns
(dimensions), and missing entries set to |
initial.values |
Either |
nclusters |
Positive integer. The assumed number of clusters if initial values are not provided. |
max.iter |
Positive integer. The maximum number of EM iterations allowed. |
tol |
Positive scalar. The desired stopping criterion value. |
convergence |
Either |
sigma.constr |
Logical. Should the dispersion matrices Σ_k be held constant over k = 1,…,K all clusters? |
df.constr |
Logical. Should the degrees of freedom ν_k be held constant over k = 1,…,K all clusters? |
approx.df |
Logical. If |
method |
How should missing entries be handled? Must be either
|
verbose |
Logical. Should progress be periodically reported to the screen? |
scaled |
Logical variable that indicates if computations for multi-dimensional datasets should be done after scaling the dataset. Note that the resulting parameters are scaled back and so should not theoretically have much effect on the performance, except to potentially offer stability in numerical computations. |
emEM.args |
A named list of options utilized if |
Model-based clustering using finite mixtures of t distributions, with handling of incomplete data using either marginalization or the EM algorithm. If supplying initial values, format as a named list with elements:
"pi" Mixing proportions. A vector of length K that sums to one.
"nu" Degrees of freedom. A vector of length K with entries at least equal to three (thus requiring the existance of the first two moments.)
"mu" Locations. A K \times p matrix, where the k-th row is the location μ_k \in R^p for cluster k.
"Sigma" Dispersions. A p \times p \times K array, where the k-th slice is the p \times p positive-definite dispersion matrix Σ_k for cluster k.
The arguments for emEM specified in the list emEM.args
are:
"nstarts" Positive integer. The number of randomly generated initial starting parameter values under consideration.
"em.iter" Positive integer. The number of short EM iterations to be performed on each set of initial starting parameter values.
"nbest" Positive integer. After
em.iter
EM iterations are performed in each of the nstarts
initial values, the number of top ranking (according to loglikelihood)
parameter values on which to run the long EM either to convergence (specified
by tol
) or maximum number of iterations (specified by
max.iter
). If nbest
is greater than one, the long EM run
achieving the largest loglikelihood will be returned.
Model-based clustering using finite mixtures of t distributions, with handling of incomplete data using either marginalization or the EM algorithm.
A list containing:
"estimates" A list of the final estimates "pi", "nu", "mu", and "Sigma" containing the MLEs for the mixing proportions, degrees of freedom, locations, and dispersions, respectively.
"iterations" Number of EM iterations performed (long EM run only; if emEM was performed, this excludes the short em run iterations specified in emEM.args$em.iter).
"Zs" A n \times K matrix where the i-th row contains the posterior probabilities of membership in cluster 1, …, K for the i-th observation (i=1,…,n).
"class" A vector of length n with the predicted class memberships for each observation.
"loglik" The log likelihood at each (long EM run) iteration.
"loglik"The
log likelihood at the last iteration, computed for all cases (including
those with missing values when method = "deletion"
)
"bic" The BIC for the final fitted model.
"EM.time" Runtime for the long EM run(s).
"em.time" Runtime for the short em run(s) when initial.values
= "emEM"
.
"total.time" Runtime for the entire function call.
"call" Supplied function call. npar
The number of model parameters.
Emily Goren, emily.goren@gmail.com
Emily M. Goren & Ranjan Maitra, 2022. "Fast model-based clustering of partial records," Stat, 11(1), e416. https://doi.org/10.1002/sta4.416"
Tsung-I Lin & Hsiu Ho & Pao Shen, 2009. "Computationally efficient learning of multivariate t mixture models with missing information," Computational Statistics, 24(3): 375-392.
set.seed(20180626) # Use iris data. d <- subset(iris, select = -Species) # Create missing data -- MCAR with 10% chance of missingness. missing <- matrix(rbinom(n = ncol(d)*nrow(d), size = 1, prob = 0.1), ncol = ncol(d)) x <- d; x[missing == 1] <- NA # Run EM with emEM initialization strategy for candidate clusters K = 2, 3, 4. Ks <- 2:4 ans <- lapply(Ks, function(K) { MixtClust(x, nclusters = K, emEM.args = list(nstarts=K*10, em.iter=5, nbest=1)) }) # Get BIC for each K. BICs <- sapply(ans, function(f) f$bic) # Plot BIC by K. plot(BICs ~ Ks, pch = 20, xlab = 'Number of Clusters', ylab = 'BIC')
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.