gmm_train: Gaussian Mixture Model (GMM) Training
In mlpack: 'Rcpp' Integration for the 'mlpack' Library

gmm_train

R Documentation

Gaussian Mixture Model (GMM) Training

Description

An implementation of the EM algorithm for training Gaussian mixture models (GMMs). Given a dataset, this can train a GMM for future use with other tools.

Usage

gmm_train(
  gaussians,
  input,
  diagonal_covariance = FALSE,
  input_model = NA,
  kmeans_max_iterations = NA,
  max_iterations = NA,
  no_force_positive = FALSE,
  noise = NA,
  percentage = NA,
  refined_start = FALSE,
  samplings = NA,
  seed = NA,
  tolerance = NA,
  trials = NA,
  verbose = getOption("mlpack.verbose", FALSE)
)

Arguments

`gaussians`	Number of Gaussians in the GMM (integer).
`input`	The training data on which the model will be fit (numeric matrix).
`diagonal_covariance`	Force the covariance of the Gaussians to be diagonal. This can accelerate training time significantly. Default value "FALSE" (logical).
`input_model`	Initial input GMM model to start training with (GMM).
`kmeans_max_iterations`	Maximum number of iterations for the k-means algorithm (used to initialize EM). Default value "1000" (integer).
`max_iterations`	Maximum number of iterations of EM algorithm (passing 0 will run until convergence). Default value "250" (integer).
`no_force_positive`	Do not force the covariance matrices to be positive definite. Default value "FALSE" (logical).
`noise`	Variance of zero-mean Gaussian noise to add to data. Default value "0" (numeric).
`percentage`	If using –refined_start, specify the percentage of the dataset used for each sampling (should be between 0.0 and 1.0). Default value "0.02" (numeric).
`refined_start`	During the initialization, use refined initial positions for k-means clustering (Bradley and Fayyad, 1998). Default value "FALSE" (logical).
`samplings`	If using –refined_start, specify the number of samplings used for initial points. Default value "100" (integer).
`seed`	Random seed. If 0, 'std::time(NULL)' is used. Default value "0" (integer).
`tolerance`	Tolerance for convergence of EM. Default value "1e-10" (numeric).
`trials`	Number of trials to perform in training GMM. Default value "1" (integer).
`verbose`	Display informational messages and the full list of parameters and timers at the end of execution. Default value "getOption("mlpack.verbose", FALSE)" (logical).

Details

This program takes a parametric estimate of a Gaussian mixture model (GMM) using the EM algorithm to find the maximum likelihood estimate. The model may be saved and reused by other mlpack GMM tools.

The input data to train on must be specified with the "input" parameter, and the number of Gaussians in the model must be specified with the "gaussians" parameter. Optionally, many trials with different random initializations may be run, and the result with highest log-likelihood on the training data will be taken. The number of trials to run is specified with the "trials" parameter. By default, only one trial is run.

The tolerance for convergence and maximum number of iterations of the EM algorithm are specified with the "tolerance" and "max_iterations" parameters, respectively. The GMM may be initialized for training with another model, specified with the "input_model" parameter. Otherwise, the model is initialized by running k-means on the data. The k-means clustering initialization can be controlled with the "kmeans_max_iterations", "refined_start", "samplings", and "percentage" parameters. If "refined_start" is specified, then the Bradley-Fayyad refined start initialization will be used. This can often lead to better clustering results.

The 'diagonal_covariance' flag will cause the learned covariances to be diagonal matrices. This significantly simplifies the model itself and causes training to be faster, but restricts the ability to fit more complex GMMs.

If GMM training fails with an error indicating that a covariance matrix could not be inverted, make sure that the "no_force_positive" parameter is not specified. Alternately, adding a small amount of Gaussian noise (using the "noise" parameter) to the entire dataset may help prevent Gaussians with zero variance in a particular dimension, which is usually the cause of non-invertible covariance matrices.

The "no_force_positive" parameter, if set, will avoid the checks after each iteration of the EM algorithm which ensure that the covariance matrices are positive definite. Specifying the flag can cause faster runtime, but may also cause non-positive definite covariance matrices, which will cause the program to crash.

Value

A list with several components:

output_model

Output for trained GMM model (GMM).

Author(s)

mlpack developers

Examples

# As an example, to train a 6-Gaussian GMM on the data in "data" with a
# maximum of 100 iterations of EM and 3 trials, saving the trained GMM to
# "gmm", the following command can be used:

## Not run: 
output <- gmm_train(input=data, gaussians=6, trials=3)
gmm <- output$output_model

## End(Not run)

# To re-train that GMM on another set of data "data2", the following command
# may be used: 

## Not run: 
output <- gmm_train(input_model=gmm, input=data2, gaussians=6)
new_gmm <- output$output_model

## End(Not run)

mlpack documentation built on June 8, 2025, 10:47 a.m.