select_mixture: Mixture Model Selection

View source: R/select_mixture.R

select_mixtureR Documentation

Mixture Model Selection

Description

Fit mixtures via various distributions and decide the best model based on a given information criterion. The distributions include multivariate contaminated normal, multivariate generalized hyperbolic, special and limiting cases of multivariate generalized hyperbolic.

Usage

select_mixture(
  X,
  G,
  model = c("CN", "GH", "NIG", "SNIG", "SC", "C", "St", "t", "N", "SGH", "HUM", "H",
    "SH"),
  criterion = c("BIC", "AIC", "KIC", "KICc", "AIC3", "CAIC", "AICc", "ICL", "AWE", "CLC"),
  max_iter = 20,
  epsilon = 0.01,
  init_method = c("kmedoids", "kmeans", "hierarchical", "manual"),
  clusters = NULL,
  eta_min = 1.001,
  outlier_cutoff = 0.95,
  deriv_ctrl = list(eps = 1e-08, d = 1e-04, zero.tol = sqrt(.Machine$double.eps/7e-07), r
    = 6, v = 2, show.details = FALSE),
  progress = TRUE
)

Arguments

X

An n x d matrix or data frame where n is the number of observations and d is the number of variables.

G

The number of clusters, which must be at least 1. If G = 1, then both init_method and clusters are ignored.

model

A vector of character strings indicating the mixture model(s) to be fitted. See the details section for a list of available distributions. However, all distributions will be considered by default.

criterion

A character string indicating the information criterion for model selection. "BIC" is used by default. See the details section for a list of available information criteria.

max_iter

(optional) A numeric value giving the maximum number of iterations each EM algorithm is allowed to use; 20 by default.

epsilon

(optional) A number specifying the epsilon value for the Aitken-based stopping criterion used in the EM algorithm: 0.01 by default.

init_method

(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", and "manual". When "manual" is chosen, a vector clusters of length n must be specified. If the data set is incomplete, missing values will be first filled based on the mean imputation method.

clusters

(optional) A vector of length n that specifies the initial cluster memberships of the user when init_method is set to "manual". Both numeric and character vectors are acceptable. This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.

eta_min

(optional) A numeric value close to 1 to the right specifying the minimum value of eta; 1.001 by default. This is only relevant for CN mixture

outlier_cutoff

(optional) A number between 0 and 1 indicating the percentile cutoff used for outlier detection. This is only relevant for t mixture.

deriv_ctrl

(optional) A list containing arguments to control the numerical procedures for calculating the first and second derivatives. Some values are suggested by default. Refer to functions grad and hessian under the package numDeriv for more information.

progress

(optional) A logical value indicating whether the fitting progress should be displayed; TRUE by default.

Details

The function can fit mixtures via the contaminated normal distribution, generalized hyperbolic distribution, and special and limiting cases of the generalized hyperbolic distribution. Available distributions include

  • CN - Contaminated Normal

  • GH - Generalized Hyperbolic

  • NIG - Normal-Inverse Gaussian

  • SNIG - Symmetric Normal-Inverse Gaussian

  • SC - Skew-Cauchy

  • C - Cauchy

  • St - Skew-t

  • t - Student's t

  • N - Normal or Gaussian

  • SGH - Symmetric Generalized Hyperbolic

  • HUM- Hyperbolic Univariate Marginals

  • H - Hyperbolic

  • SH - Symmetric Hyperbolic

Available information criteria include

  • AIC - Akaike information criterion

  • BIC - Bayesian information criterion

  • KIC - Kullback information criterion

  • KICc - Corrected Kullback information criterion

  • AIC3 - Modified AIC

  • CAIC - Bozdogan's consistent AIC

  • AICc - Small-sample version of AIC

  • ICL - Integrated Completed Likelihood criterion

  • AWE - Approximate weight of evidence

  • CLC - Classification likelihood criterion

Value

A list with

best_mod

An object of class MixtureMissing corresponding to the best model.

all_mod

A list of objects of class MixtureMissing corresponding to all models of consideration. The list is in the order of model.

criterion

A numeric vector containing the chosen information criterion values of all models of consideration. The vector is in the order of best-to-worst models.

Each object of class MixtureMissing have slots depending on the fitted model. See the returned value of MCNM and MGHM.

References

Browne, R. P. and McNicholas, P. D. (2015). A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics, 43(2):176–198.

Wei, Y., Tang, Y., and McNicholas, P. D. (2019). Mixtures of generalized hyperbolic distributions and mixtures of skew-t distributions for model-based clustering with incomplete data. Computational Statistics & Data Analysis, 130:18–41.

Examples


data('bankruptcy')

#++++ With no missing values ++++#

X <- bankruptcy[, 2:3]
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)

#++++ With missing values ++++#

set.seed(1234)

X <- hide_values(bankruptcy[, 2:3], prop_cases = 0.1)
mod <- select_mixture(X, G = 2, model = c('CN', 'GH', 'St'), criterion = 'BIC', max_iter = 10)


MixtureMissing documentation built on Oct. 16, 2024, 1:09 a.m.