impute_soft: Soft imputation

Description Usage Arguments Details Value References

View source: R/impute_soft.R

Description

The softImpute algorithm is used to impute missing values. For more details, see softImpute

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
impute_soft(
  data_ref,
  data_new = NULL,
  cols = dplyr::everything(),
  rank_max_ovrl = min(dim(data_ref) - 1),
  rank_max_init = min(2, rank_max_ovrl),
  rank_stp_size = 1,
  lambda = seq(rank_max_ovrl * 0.6, 1, length.out = 10),
  grid = FALSE,
  restore_data = TRUE,
  verbose = 1,
  bs = TRUE,
  bs_maxit = 20,
  bs_thresh = 1e-09,
  bs_row.center = FALSE,
  bs_col.center = TRUE,
  bs_row.scale = FALSE,
  bs_col.scale = TRUE,
  si_type = "als",
  si_thresh = 1e-05,
  si_maxit = 100,
  si_final.svd = TRUE
)

Arguments

data_ref

a data frame.

data_new

an optional data frame. If supplied, then data_ref will be used as a reference dataset for data_new and the output will contain imputed values for data_new. If not supplied, the output will contain imputed values for data_ref.

cols

columns that should be imputed and/or used to impute other columns. Supports tidy select functions (see examples).

rank_max_ovrl

an integer value that restricts the rank of the solution for all softImpute fits.

rank_max_init

an integer value that restricts the rank of the solution for the first softImpute fit. Sequential fits may have higher rank depending upon rank_max_ovrl, rank_stp_size, and grid.

rank_stp_size

an integer value that indicates how much the maximum rank of softImpute fits should increase between iterations.

lambda

nuclear-norm regularization parameter. If lambda = 0, the algorithm reverts to "hardImpute", for which convergence is typically slower, and to local minimum. Ideally lambda should be chosen so that the solution reached has rank slightly less than rank.max. See also lambda0() for computing the smallest lambda with a zero solution.

grid

a logical value. If TRUE, all combinations of rank and lambda are used to fit softImpute models. If FALSE, then one fit is supplied for each value of lambda, and increasing maximum ranks are paired with decreasing values of lambda.

restore_data

a logical value. If TRUE, the variable types of the imputed values will match those of the original data. If FALSE, the imputed values are returned in a one-hot encoded format.

verbose

an integer value of 0, 1, or 2. If verbose = 0, nothing is printed. If verbose = 1, messages are printed to the console showing what general steps are being taken in the imputation process. If verbose = 2, all relevant information on convergence is printed in addition to general messages.

bs

a logical value. If TRUE, softImpute::biScale() is applied to data_ref or rbind(data_ref, data_new) prior to fitting softImpute models.

bs_maxit

an integer indicating the maximum number of iterations for the biScale algorithm.

bs_thresh

convergence threshold for the biScale algorithm.

bs_row.center

a logical value. If TRUE, row centering will be performed. If FALSE (default), then nothing is done.

bs_col.center

a logical value. If TRUE (default), column centering will be performed. If FALSE, then nothing is done.

bs_row.scale

a logical value. If TRUE, row scaling will be performed. If FALSE (default), then nothing is done.

bs_col.scale

a logical value. If TRUE (default), column scaling will be performed. If FALSE, then nothing is done.

si_type

two algorithms are implemented, type="svd" or the default type="als". The "svd" algorithm repeatedly computes the svd of the completed matrix, and soft thresholds its singular values. Each new soft-thresholded svd is used to re-impute the missing entries. For large matrices of class "Incomplete", the svd is achieved by an efficient form of alternating orthogonal ridge regression. The "als" algorithm uses this same alternating ridge regression, but updates the imputation at each step, leading to quite substantial speedups in some cases. The "als" approach does not currently have the same theoretical convergence guarantees as the "svd" approach.

si_thresh

convergence threshold for the softImpute algorithm, measured as the relative change in the Frobenius norm between two successive estimates.

si_maxit

maximum number of iterations for the softImpute algorithm.

si_final.svd

only applicable to si_type = "als". The alternating ridge-regressions do not lead to exact zeros. With the default final.svd = TRUE, at the final iteration, a one step unregularized iteration is performed, followed by soft-thresholding of the singular values, leading to hard zeros.

Details

Multiple imputation: The number of imputations returned depends on rank_max_init, rank_max_ovrl, rank_stp_size, lambda, and grid. If grid is FALSE, then there will be length(lambda) imputed value sets in the returned output, and they will be based on fitted softImpute models with increasing maximum ranks. Generally, these ranks are seq(rank_max_init, rank_max_ovrl, by = rank_stp_size), but will be automatically adjusted to have consistency with (1) lambda and (2) the maximum allowed rank for data_ref as needed. If grid is TRUE, then every combination of lambda and the rank sequence will be fitted and the output will contain one set of imputed values for each combination.

Rank inputs: If rank is sufficiently large, and with si_type="svd", the softImpute algorithm solves the nuclear-norm convex matrix-completion problem (see Reference 1). In this case the number of nonzero singular values returned will be less than or equal to the maximum rank. If smaller ranks are used, the solution is not guaranteed to solve the problem, although still results in good local minima. The rank of a softImpute fit should not exceed min(dim(data_ref) - 1.

biScale The softImpute::biScale() function is more flexible than the current function indicates. Specifically, biScale allows users to supply vectors to its row/column centering/scaling inputs that will in turn be used to center/scale the corresponding rows/columns. impute_soft() is more strict and does not offer this option. Also, impute_soft() uses different default values to increase the likelihood of the biScale algorithm converging quickly.

Value

a data frame with fitting parameters and imputed values.

References

  1. Rahul Mazumder, Trevor Hastie and Rob Tibshirani (2010) Spectral Regularization Algorithms for Learning Large Incomplete Matrices, http://www.stanford.edu/~hastie/Papers/mazumder10a.pdf, Journal of Machine Learning Research 11 (2010) 2287-2322


bcjaeger/midy documentation built on May 3, 2020, 3:55 p.m.