impute_missing_values: Impute missing values in a dataframe and add missingness...

Description Usage Arguments Value See Also Examples

View source: R/impute_missing_values.R

Description

Impute missing values, using knn by default or alternatively median-impute numerics, mode-impute factors. Add missingness indicators.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
impute_missing_values(
  data,
  type = "standard",
  add_indicators = TRUE,
  prefix = "miss_",
  skip_vars = NULL,
  all_vars = FALSE,
  remove_constant = TRUE,
  remove_collinear = TRUE,
  values = NULL,
  h2o_glrm = NULL,
  glrm_k = 10L,
  verbose = FALSE
)

Arguments

data

Dataframe or matrix.

type

"knn" or "standard" (median/mode). NOTE: knn will result in the data being centered and scaled!

add_indicators

Add a series of missingness indicators.

prefix

String to add at the beginning of the name of each missingness indicator.

skip_vars

List of variable names to exclude from the imputation.

all_vars

Calculate imputation value for all variables, in cases where the imputation info may be used for future datasets.

remove_constant

Remove constant missingness indicators, if applicable.

remove_collinear

Remove collinear missingness indicators, if applicable.

values

Named list with imputation value to use from another dataset.

h2o_glrm

Optional h2o glrm model for imputing on new data (e.g. test set)

glrm_k

Number of principal components to estimate (up to the # of columns in the data).

verbose

If True display extra information during execution.

Value

List with the following elements:

See Also

missingness_indicators preProcess

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Load a test dataset.
data(PimaIndiansDiabetes2, package = "mlbench")

# Check for missing values.
colSums(is.na(PimaIndiansDiabetes2))

# Impute missing data and add missingness indicators.
# Don't impute the outcome though.
result = impute_missing_values(PimaIndiansDiabetes2, skip_vars = "diabetes")

# Confirm we have no missing data.
colSums(is.na(result$data))


#############
# K-nearest neighbors imputation

result2 = impute_missing_values(PimaIndiansDiabetes2, type = "knn",
                                skip_vars = "diabetes")

# Confirm we have no missing data.
colSums(is.na(result2$data))

ck37r documentation built on Feb. 6, 2020, 5:09 p.m.