impute_soft: Soft imputation
In bcjaeger/ipa: Imputation for Predictive Analytics

Description Usage Arguments Details Value References

The softImpute algorithm is used to impute missing values. For more details, see softImpute

impute_soft(
  data_ref,
  data_new = NULL,
  cols = dplyr::everything(),
  rank_max_ovrl = min(dim(data_ref) - 1),
  rank_max_init = min(2, rank_max_ovrl),
  rank_stp_size = 1,
  lambda = seq(rank_max_ovrl * 0.6, 1, length.out = 10),
  grid = FALSE,
  restore_data = TRUE,
  verbose = 1,
  bs = TRUE,
  bs_maxit = 20,
  bs_thresh = 1e-09,
  bs_row.center = FALSE,
  bs_col.center = TRUE,
  bs_row.scale = FALSE,
  bs_col.scale = TRUE,
  si_type = "als",
  si_thresh = 1e-05,
  si_maxit = 100,
  si_final.svd = TRUE
)

`data_ref`	a data frame.
`data_new`	an optional data frame. If supplied, then `data_ref` will be used as a reference dataset for `data_new` and the output will contain imputed values for `data_new`. If not supplied, the output will contain imputed values for `data_ref`.
`cols`	columns that should be imputed and/or used to impute other columns. Supports tidy select functions (see examples).
`rank_max_ovrl`	an integer value that restricts the rank of the solution for all `softImpute` fits.
`rank_max_init`	an integer value that restricts the rank of the solution for the first `softImpute` fit. Sequential fits may have higher rank depending upon `rank_max_ovrl`, `rank_stp_size`, and `grid`.
`rank_stp_size`	an integer value that indicates how much the maximum rank of `softImpute` fits should increase between iterations.
`lambda`	nuclear-norm regularization parameter. If `lambda = 0`, the algorithm reverts to "hardImpute", for which convergence is typically slower, and to local minimum. Ideally lambda should be chosen so that the solution reached has rank slightly less than rank.max. See also `lambda0()` for computing the smallest `lambda` with a zero solution.
`grid`	a logical value. If `TRUE`, all combinations of rank and lambda are used to fit `softImpute` models. If `FALSE`, then one fit is supplied for each value of `lambda`, and increasing maximum ranks are paired with decreasing values of `lambda`.
`restore_data`	a logical value. If `TRUE`, the variable types of the imputed values will match those of the original data. If `FALSE`, the imputed values are returned in a one-hot encoded format.
`verbose`	an integer value of 0, 1, or 2. If `verbose = 0`, nothing is printed. If `verbose = 1`, messages are printed to the console showing what general steps are being taken in the imputation process. If `verbose = 2`, all relevant information on convergence is printed in addition to general messages.
`bs`	a logical value. If `TRUE`, `softImpute::biScale()` is applied to `data_ref` or `rbind(data_ref, data_new)` prior to fitting `softImpute` models.
`bs_maxit`	an integer indicating the maximum number of iterations for the `biScale` algorithm.
`bs_thresh`	convergence threshold for the `biScale` algorithm.
`bs_row.center`	a logical value. If `TRUE`, row centering will be performed. If `FALSE` (default), then nothing is done.
`bs_col.center`	a logical value. If `TRUE` (default), column centering will be performed. If `FALSE`, then nothing is done.
`bs_row.scale`	a logical value. If `TRUE`, row scaling will be performed. If `FALSE` (default), then nothing is done.
`bs_col.scale`	a logical value. If `TRUE` (default), column scaling will be performed. If `FALSE`, then nothing is done.
`si_type`	two algorithms are implemented, type="svd" or the default type="als". The "svd" algorithm repeatedly computes the svd of the completed matrix, and soft thresholds its singular values. Each new soft-thresholded svd is used to re-impute the missing entries. For large matrices of class "Incomplete", the svd is achieved by an efficient form of alternating orthogonal ridge regression. The "als" algorithm uses this same alternating ridge regression, but updates the imputation at each step, leading to quite substantial speedups in some cases. The "als" approach does not currently have the same theoretical convergence guarantees as the "svd" approach.
`si_thresh`	convergence threshold for the `softImpute` algorithm, measured as the relative change in the Frobenius norm between two successive estimates.
`si_maxit`	maximum number of iterations for the `softImpute` algorithm.
`si_final.svd`	only applicable to `si_type = "als"`. The alternating ridge-regressions do not lead to exact zeros. With the default `final.svd = TRUE`, at the final iteration, a one step unregularized iteration is performed, followed by soft-thresholding of the singular values, leading to hard zeros.

Multiple imputation: The number of imputations returned depends on rank_max_init, rank_max_ovrl, rank_stp_size, lambda, and grid. If grid is FALSE, then there will be length(lambda) imputed value sets in the returned output, and they will be based on fitted softImpute models with increasing maximum ranks. Generally, these ranks are seq(rank_max_init, rank_max_ovrl, by = rank_stp_size), but will be automatically adjusted to have consistency with (1) lambda and (2) the maximum allowed rank for data_ref as needed. If grid is TRUE, then every combination of lambda and the rank sequence will be fitted and the output will contain one set of imputed values for each combination.

Rank inputs: If rank is sufficiently large, and with si_type="svd", the softImpute algorithm solves the nuclear-norm convex matrix-completion problem (see Reference 1). In this case the number of nonzero singular values returned will be less than or equal to the maximum rank. If smaller ranks are used, the solution is not guaranteed to solve the problem, although still results in good local minima. The rank of a softImpute fit should not exceed min(dim(data_ref) - 1.

biScale The softImpute::biScale() function is more flexible than the current function indicates. Specifically, biScale allows users to supply vectors to its row/column centering/scaling inputs that will in turn be used to center/scale the corresponding rows/columns. impute_soft() is more strict and does not offer this option. Also, impute_soft() uses different default values to increase the likelihood of the biScale algorithm converging quickly.

a data frame with fitting parameters and imputed values.

Rahul Mazumder, Trevor Hastie and Rob Tibshirani (2010) Spectral Regularization Algorithms for Learning Large Incomplete Matrices, http://www.stanford.edu/~hastie/Papers/mazumder10a.pdf, Journal of Machine Learning Research 11 (2010) 2287-2322

bcjaeger/ipa documentation built on May 7, 2020, 9:45 a.m.