impute_missing_values: Impute missing values in a dataframe and add missingness...

View source: R/impute_missing_values.R

impute_missing_valuesR Documentation

Impute missing values in a dataframe and add missingness indicators.

Description

Impute missing values, using knn by default or alternatively median-impute numerics, mode-impute factors. Add missingness indicators.

Usage

impute_missing_values(
  data,
  type = "standard",
  add_indicators = TRUE,
  prefix = "miss_",
  skip_vars = NULL,
  all_vars = FALSE,
  remove_constant = TRUE,
  remove_collinear = TRUE,
  values = NULL,
  h2o_glrm = NULL,
  glrm_k = 10L,
  verbose = FALSE
)

Arguments

data

Dataframe or matrix.

type

"knn" or "standard" (median/mode). NOTE: knn will result in the data being centered and scaled!

add_indicators

Add a series of missingness indicators.

prefix

String to add at the beginning of the name of each missingness indicator.

skip_vars

List of variable names to exclude from the imputation.

all_vars

Calculate imputation value for all variables, in cases where the imputation info may be used for future datasets.

remove_constant

Remove constant missingness indicators, if applicable.

remove_collinear

Remove collinear missingness indicators, if applicable.

values

Named list with imputation value to use from another dataset.

h2o_glrm

Optional h2o glrm model for imputing on new data (e.g. test set)

glrm_k

Number of principal components to estimate (up to the # of columns in the data).

verbose

If True display extra information during execution.

Value

List with the following elements:

  • $data - imputed dataset.

  • $impute_info - if knn, caret preprocess element for imputing test data.

  • $impute_values - if standard, list of imputation values for each variable.

See Also

missingness_indicators preProcess

Examples


# Load a test dataset.
data(PimaIndiansDiabetes2, package = "mlbench")

# Check for missing values.
colSums(is.na(PimaIndiansDiabetes2))

# Impute missing data and add missingness indicators.
# Don't impute the outcome though.
result = impute_missing_values(PimaIndiansDiabetes2, skip_vars = "diabetes")

# Confirm we have no missing data.
colSums(is.na(result$data))


#############
# K-nearest neighbors imputation

result2 = impute_missing_values(PimaIndiansDiabetes2, type = "knn",
                                skip_vars = "diabetes")

# Confirm we have no missing data.
colSums(is.na(result2$data))


ck37/ck37r documentation built on April 29, 2023, 11:42 p.m.