impute: Impute missing data via fusion
In ummel/fusionModel: Data fusion and analysis of synthetic data in R

impute

R Documentation

Impute missing data via fusion

Description

A universal missing data imputation tool that wraps successive calls to train and fuse under the hood. Designed for simplicity and ease of use.

Usage

impute(
  data,
  weight = NULL,
  ignore = NULL,
  cores = parallel::detectCores(logical = FALSE) - 1L
)

Arguments

`data`	A data frame with missing values.
`weight`	Optional name of observation weights column in `data`.
`ignore`	Optional names of columns in `data` to ignore. These variables are neither imputed nor used as predictors.
`cores`	Number of physical CPU cores used by `lightgbm`. LightGBM is parallel-enabled on all platforms if OpenMP is available.

Details

Variables with missing values are imputed sequentially, beginning with the variable with the fewest missing values. Since LightGBM models accommodate NA values in the predictor set, all available variables are used as potential predictors (excluding ignore variables). For each call to train, 80% of observations are randomly selected for training and the remaining 20% are used as a validation set to determine an appropriate number of tree learners. All LightGBM model parameters are kept at the sensible default values in train. Since lightgbm uses OpenMP multithreading, it is not advisable to use impute inside a forked/parallel process when cores > 1.

Value

A data frame with all missing values imputed.

Examples

# Create data frame with random NA values
?recs
data <- recs[, 2:7]
miss <- replicate(ncol(data), runif(nrow(data)) < runif(1, 0.01, 0.3))
data[miss] <- NA
colSums(is.na(data))

# Impute the missing values
result <- impute(data)
anyNA(result)

ummel/fusionModel documentation built on June 1, 2025, 11 p.m.