impute_nbrs: Nearest neighbor imputation
In bcjaeger/ipa: Imputation for Predictive Analytics

Description Usage Arguments Value Examples

This function conducts nearest neighbor imputation with the added option of using a sequence of neighbor values instead of picking one. One imputed dataset is created for each value of nearest neighbors (k).

impute_nbrs(
  data_ref,
  data_new = NULL,
  cols = dplyr::everything(),
  k_neighbors = 10,
  aggregate = TRUE,
  fun_aggr_ctns = NULL,
  fun_aggr_intg = NULL,
  fun_aggr_catg = NULL,
  nthread = getOption("gd_num_thread"),
  epsilon = 1e-08,
  verbose = FALSE
)

`data_ref`	a data frame.
`data_new`	an optional data frame. If supplied, then `data_ref` will be used as a reference dataset for `data_new` and the output will contain imputed values for `data_new`. If not supplied, the output will contain imputed values for `data_ref`.
`cols`	columns that should be imputed and/or used to impute other columns. Supports tidy select functions (see examples).
`k_neighbors`	a numeric vector indicating how many neighbors should be used to impute missing values.
`aggregate`	a logical value. If `TRUE`, then neighbors will be aggregated to generate imputations. If `FALSE`, then one neighbor will be sampled at random to generate a missing value. Using `aggregate = FALSE` can be helpful if you are conducting multiple imputation.
`fun_aggr_ctns`	a function used to aggregate neighbors for continuous variables. If unspecified, the `mean()` function is used.
`fun_aggr_intg`	a function used to aggregate neighbors for integer values variables. If unspecified, the `medn_est()` function is used. This function returns the median of neighbor values, rounded to the nearest integer. `medn_est_conserve()` goes one step further and identifies which neighbor value is closest to the median, and returns that value. Both of these options can be helpful for integer valued columns if you want to make sure the imputed values do not contain impossible quantities, e.g. no. of children = 3/4.
`fun_aggr_catg`	a function used to aggregate neighbors for categorical variables. If unspecified, the `mode_est()` function is used.
`nthread`	Number of threads to use for parallelization. By default, for a dual-core machine, 2 threads are used. For any other machine n-1 cores are used so your machine doesn't freeze during a big computation. The maximum nr of threads are determined using omp_get_max_threads at C level.
`epsilon`	Computed numbers (variable ranges) smaller than eps are treated as zero
`verbose`	logical value. If `TRUE`, output is printed on the console. If `FALSE`, nothing is printed.

a list of imputed datasets the same length as k_neighbors.

data(diabetes, package = 'ipa')

trn <- diabetes$missing[1:25, ]
tst <- diabetes$missing[26:50, ]

trn_imputes <- impute_nbrs(data_ref = trn, k = 1:5)
tst_imputes <- impute_nbrs(data_ref = trn, data_new = tst, k = 1:5)