impute_nbrs: Nearest neighbor imputation

Description Usage Arguments Value Examples

View source: R/impute_nbrs.R

Description

This function conducts nearest neighbor imputation with the added option of using a sequence of neighbor values instead of picking one. One imputed dataset is created for each value of nearest neighbors (k).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
impute_nbrs(
  data_ref,
  data_new = NULL,
  cols = dplyr::everything(),
  k_neighbors = 10,
  aggregate = TRUE,
  fun_aggr_ctns = NULL,
  fun_aggr_intg = NULL,
  fun_aggr_catg = NULL,
  nthread = getOption("gd_num_thread"),
  epsilon = 1e-08,
  verbose = FALSE
)

Arguments

data_ref

a data frame.

data_new

an optional data frame. If supplied, then data_ref will be used as a reference dataset for data_new and the output will contain imputed values for data_new. If not supplied, the output will contain imputed values for data_ref.

cols

columns that should be imputed and/or used to impute other columns. Supports tidy select functions (see examples).

k_neighbors

a numeric vector indicating how many neighbors should be used to impute missing values.

aggregate

a logical value. If TRUE, then neighbors will be aggregated to generate imputations. If FALSE, then one neighbor will be sampled at random to generate a missing value. Using aggregate = FALSE can be helpful if you are conducting multiple imputation.

fun_aggr_ctns

a function used to aggregate neighbors for continuous variables. If unspecified, the mean() function is used.

fun_aggr_intg

a function used to aggregate neighbors for integer values variables. If unspecified, the medn_est() function is used. This function returns the median of neighbor values, rounded to the nearest integer. medn_est_conserve() goes one step further and identifies which neighbor value is closest to the median, and returns that value. Both of these options can be helpful for integer valued columns if you want to make sure the imputed values do not contain impossible quantities, e.g. no. of children = 3/4.

fun_aggr_catg

a function used to aggregate neighbors for categorical variables. If unspecified, the mode_est() function is used.

nthread

Number of threads to use for parallelization. By default, for a dual-core machine, 2 threads are used. For any other machine n-1 cores are used so your machine doesn't freeze during a big computation. The maximum nr of threads are determined using omp_get_max_threads at C level.

epsilon

Computed numbers (variable ranges) smaller than eps are treated as zero

verbose

logical value. If TRUE, output is printed on the console. If FALSE, nothing is printed.

Value

a list of imputed datasets the same length as k_neighbors.

Examples

1
2
3
4
5
6
7
data(diabetes, package = 'ipa')

trn <- diabetes$missing[1:25, ]
tst <- diabetes$missing[26:50, ]

trn_imputes <- impute_nbrs(data_ref = trn, k = 1:5)
tst_imputes <- impute_nbrs(data_ref = trn, data_new = tst, k = 1:5)

bcjaeger/midy documentation built on May 3, 2020, 3:55 p.m.