mdgc: Perform Model Estimation and Imputation
In mdgc: Missing Data Imputation Using Gaussian Copulas

mdgc	R Documentation

Perform Model Estimation and Imputation

Description

A convenience function to perform model estimation and imputation in one call. The learning rate is likely model specific and should be altered. See mdgc_fit.

See the README at https://github.com/boennecd/mdgc for examples.

Usage

mdgc(
  dat,
  lr = 0.001,
  maxit = 25L,
  batch_size = NULL,
  rel_eps = 0.001,
  method = c("svrg", "adam", "aug_Lagran"),
  seed = 1L,
  epsilon = 1e-08,
  beta_1 = 0.9,
  beta_2 = 0.999,
  n_threads = 1L,
  do_reorder = TRUE,
  abs_eps = -1,
  maxpts = 10000L,
  minvls = 100L,
  verbose = FALSE,
  irel_eps = rel_eps,
  imaxit = maxpts,
  iabs_eps = abs_eps,
  iminvls = 1000L,
  start_val = NULL,
  decay = 0.98,
  conv_crit = 1e-05,
  use_aprx = FALSE
)

Arguments

`dat`	`data.frame` with continuous, multinomial, ordinal, and binary variables.
`lr`	learning rate.
`maxit`	maximum number of iteration.
`batch_size`	number of observations in each batch.
`rel_eps`	relative error for each marginal likelihood factor.
`method`	estimation method to use. Can be `"svrg"`, `"adam"`, or `"aug_Lagran"`.
`seed`	fixed seed to use. Use `NULL` if the seed should not be fixed.
`epsilon`	ADAM parameters.
`beta_1`	ADAM parameters.
`beta_2`	ADAM parameters.
`n_threads`	number of threads to use.
`do_reorder`	logical for whether to use a heuristic variable reordering. `TRUE` is likely the best option.
`abs_eps`	absolute convergence threshold for each marginal likelihood factor.
`maxpts`	maximum number of samples to draw for each marginal likelihood term.
`minvls`	minimum number of samples.
`verbose`	logical for whether to print output during the estimation.
`irel_eps`	relative error for each term in the imputation.
`imaxit`	maximum number of samples to draw in the imputation.
`iabs_eps`	absolute convergence threshold for each term in the imputation.
`iminvls`	minimum number of samples in the imputation.
`start_val`	starting value for the covariance matrix. Use `NULL` if unspecified.
`decay`	the learning rate used by SVRG is given by `lr * decay^iteration_number`.
`conv_crit`	relative convergence threshold.
`use_aprx`	logical for whether to use an approximation of `pnorm` and `qnorm`. This may yield a noticeable reduction in the computation time.

Details

It is important that the input for data has the appropriate types and classes. See get_mdgc.

Value

A list with the following entries:

`ximp`	`data.frame` with the observed and imputed values.
`imputed`	output from `mdgc_impute`.
`vcov`	the estimated covariance matrix.
`mea`	the estimated non-zero mean terms.

Additional elements may be present depending on the chosen method. See mdgc_fit.

References

Kingma, D.P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. abs/1412.6980.

Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems.

Examples


# there is a bug on CRAN's check on Solaris which I have failed to reproduce.
# See https://github.com/r-hub/solarischeck/issues/8#issuecomment-796735501.
# Thus, this example is not run on Solaris
is_solaris <- tolower(Sys.info()[["sysname"]]) == "sunos"

if(!is_solaris && require(catdata)){
  data(retinopathy)

  # prepare data and save true data set
  retinopathy$RET <- as.ordered(retinopathy$RET)
  retinopathy$SM <- as.logical(retinopathy$SM)

  # randomly mask data
  set.seed(28325145)
  truth <- retinopathy
  for(i in seq_along(retinopathy))
    retinopathy[[i]][runif(NROW(retinopathy)) < .3] <- NA

  cat("\nMasked data:\n")
  print(head(retinopathy, 10))
  cat("\n")

  # impute data
  impu <- mdgc(retinopathy, lr = 1e-3, maxit = 25L, batch_size = 25L,
               rel_eps = 1e-3, maxpts = 5000L, verbose = TRUE,
               n_threads = 1L, method = "svrg")

  # show correlation matrix
  cat("\nEstimated correlation matrix\n")
  print(impu$vcov)

  # compare imputed and true values
  cat("\nObserved;\n")
  print(head(retinopathy, 10))
  cat("\nImputed values:\n")
  print(head(impu$ximp, 10))
  cat("\nTruth:\n")
  print(head(truth, 10))

  # using augmented Lagrangian method
  cat("\n")
  impu_aug <- mdgc(retinopathy, maxit = 25L, rel_eps = 1e-3,
                   maxpts = 5000L, verbose = TRUE,
                   n_threads = 1L, method = "aug_Lagran")

  # compare the log-likelihood estimate
  obj <- get_mdgc_log_ml(retinopathy)
  cat(sprintf(
    "Maximum log likelihood with SVRG vs. augmented Lagrangian:\n  %.2f vs. %.2f\n",
    mdgc_log_ml(obj, vcov = impu    $vcov, mea = impu    $mea, rel_eps = 1e-3),
    mdgc_log_ml(obj, vcov = impu_aug$vcov, mea = impu_aug$mea, rel_eps = 1e-3)))

  # show correlation matrix
  cat("\nEstimated correlation matrix (augmented Lagrangian)\n")
  print(impu_aug$vcov)

  cat("\nImputed values (augmented Lagrangian):\n")
  print(head(impu_aug$ximp, 10))
}

mdgc documentation built on May 31, 2023, 7:31 p.m.