R/gmm_train.R
In mlpack: 'Rcpp' Integration for the 'mlpack' Library

Documented in gmm_train

#' @title Gaussian Mixture Model (GMM) Training
#'
#' @description
#' An implementation of the EM algorithm for training Gaussian mixture models
#' (GMMs).  Given a dataset, this can train a GMM for future use with other
#' tools.
#'
#' @param gaussians Number of Gaussians in the GMM (integer).
#' @param input The training data on which the model will be fit (numeric
#'   matrix).
#' @param diagonal_covariance Force the covariance of the Gaussians to be
#'   diagonal.  This can accelerate training time significantly.  Default value
#'   "FALSE" (logical).
#' @param input_model Initial input GMM model to start training with
#'   (GMM).
#' @param kmeans_max_iterations Maximum number of iterations for the
#'   k-means algorithm (used to initialize EM).  Default value "1000"
#'   (integer).
#' @param max_iterations Maximum number of iterations of EM algorithm
#'   (passing 0 will run until convergence).  Default value "250" (integer).
#' @param no_force_positive Do not force the covariance matrices to be
#'   positive definite.  Default value "FALSE" (logical).
#' @param noise Variance of zero-mean Gaussian noise to add to data. 
#'   Default value "0" (numeric).
#' @param percentage If using --refined_start, specify the percentage of
#'   the dataset used for each sampling (should be between 0.0 and 1.0). 
#'   Default value "0.02" (numeric).
#' @param refined_start During the initialization, use refined initial
#'   positions for k-means clustering (Bradley and Fayyad, 1998).  Default value
#'   "FALSE" (logical).
#' @param samplings If using --refined_start, specify the number of
#'   samplings used for initial points.  Default value "100" (integer).
#' @param seed Random seed.  If 0, 'std::time(NULL)' is used.  Default
#'   value "0" (integer).
#' @param tolerance Tolerance for convergence of EM.  Default value "1e-10"
#'   (numeric).
#' @param trials Number of trials to perform in training GMM.  Default
#'   value "1" (integer).
#' @param verbose Display informational messages and the full list of
#'   parameters and timers at the end of execution.  Default value "FALSE"
#'   (logical).
#'
#' @return A list with several components:
#' \item{output_model}{Output for trained GMM model (GMM).}
#'
#' @details
#' This program takes a parametric estimate of a Gaussian mixture model (GMM)
#' using the EM algorithm to find the maximum likelihood estimate.  The model
#' may be saved and reused by other mlpack GMM tools.
#' 
#' The input data to train on must be specified with the "input" parameter, and
#' the number of Gaussians in the model must be specified with the "gaussians"
#' parameter.  Optionally, many trials with different random initializations may
#' be run, and the result with highest log-likelihood on the training data will
#' be taken.  The number of trials to run is specified with the "trials"
#' parameter.  By default, only one trial is run.
#' 
#' The tolerance for convergence and maximum number of iterations of the EM
#' algorithm are specified with the "tolerance" and "max_iterations" parameters,
#' respectively.  The GMM may be initialized for training with another model,
#' specified with the "input_model" parameter. Otherwise, the model is
#' initialized by running k-means on the data.  The k-means clustering
#' initialization can be controlled with the "kmeans_max_iterations",
#' "refined_start", "samplings", and "percentage" parameters.  If
#' "refined_start" is specified, then the Bradley-Fayyad refined start
#' initialization will be used.  This can often lead to better clustering
#' results.
#' 
#' The 'diagonal_covariance' flag will cause the learned covariances to be
#' diagonal matrices.  This significantly simplifies the model itself and causes
#' training to be faster, but restricts the ability to fit more complex GMMs.
#' 
#' If GMM training fails with an error indicating that a covariance matrix could
#' not be inverted, make sure that the "no_force_positive" parameter is not
#' specified.  Alternately, adding a small amount of Gaussian noise (using the
#' "noise" parameter) to the entire dataset may help prevent Gaussians with zero
#' variance in a particular dimension, which is usually the cause of
#' non-invertible covariance matrices.
#' 
#' The "no_force_positive" parameter, if set, will avoid the checks after each
#' iteration of the EM algorithm which ensure that the covariance matrices are
#' positive definite.  Specifying the flag can cause faster runtime, but may
#' also cause non-positive definite covariance matrices, which will cause the
#' program to crash.
#'
#' @author
#' mlpack developers
#'
#' @export
#' @examples
#' # As an example, to train a 6-Gaussian GMM on the data in "data" with a
#' # maximum of 100 iterations of EM and 3 trials, saving the trained GMM to
#' # "gmm", the following command can be used:
#' 
#' \dontrun{
#' output <- gmm_train(input=data, gaussians=6, trials=3)
#' gmm <- output$output_model
#' }
#' 
#' # To re-train that GMM on another set of data "data2", the following command
#' # may be used: 
#' 
#' \dontrun{
#' output <- gmm_train(input_model=gmm, input=data2, gaussians=6)
#' new_gmm <- output$output_model
#' }
gmm_train <- function(gaussians,
                      input,
                      diagonal_covariance=FALSE,
                      input_model=NA,
                      kmeans_max_iterations=NA,
                      max_iterations=NA,
                      no_force_positive=FALSE,
                      noise=NA,
                      percentage=NA,
                      refined_start=FALSE,
                      samplings=NA,
                      seed=NA,
                      tolerance=NA,
                      trials=NA,
                      verbose=FALSE) {
  # Create parameters and timers objects.
  p <- CreateParams("gmm_train")
  t <- CreateTimers()
  # Initialize an empty list that will hold all input models the user gave us,
  # so that we don't accidentally create two XPtrs that point to thesame model.
  inputModels <- vector()

  # Process each input argument before calling the binding.
  SetParamInt(p, "gaussians", gaussians)

  SetParamMat(p, "input", to_matrix(input), TRUE)

  if (!identical(diagonal_covariance, FALSE)) {
    SetParamBool(p, "diagonal_covariance", diagonal_covariance)
  }

  if (!identical(input_model, NA)) {
    SetParamGMMPtr(p, "input_model", input_model)
    # Add to the list of input models we received.
    inputModels <- append(inputModels, input_model)
  }

  if (!identical(kmeans_max_iterations, NA)) {
    SetParamInt(p, "kmeans_max_iterations", kmeans_max_iterations)
  }

  if (!identical(max_iterations, NA)) {
    SetParamInt(p, "max_iterations", max_iterations)
  }

  if (!identical(no_force_positive, FALSE)) {
    SetParamBool(p, "no_force_positive", no_force_positive)
  }

  if (!identical(noise, NA)) {
    SetParamDouble(p, "noise", noise)
  }

  if (!identical(percentage, NA)) {
    SetParamDouble(p, "percentage", percentage)
  }

  if (!identical(refined_start, FALSE)) {
    SetParamBool(p, "refined_start", refined_start)
  }

  if (!identical(samplings, NA)) {
    SetParamInt(p, "samplings", samplings)
  }

  if (!identical(seed, NA)) {
    SetParamInt(p, "seed", seed)
  }

  if (!identical(tolerance, NA)) {
    SetParamDouble(p, "tolerance", tolerance)
  }

  if (!identical(trials, NA)) {
    SetParamInt(p, "trials", trials)
  }

  if (verbose) {
    EnableVerbose()
  } else {
    DisableVerbose()
  }

  # Mark all output options as passed.
  SetPassed(p, "output_model")

  # Call the program.
  gmm_train_call(p, t)

  # Add ModelType as attribute to the model pointer, if needed.
  output_model <- GetParamGMMPtr(p, "output_model", inputModels)
  attr(output_model, "type") <- "GMM"

  # Extract the results in order.
  out <- list(
      "output_model" = output_model
  )


  return(out)
}