Imputation Method vimpute
In VIM: Visualization and Imputation of Missing Values

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  fig.align = "center"
)

Introduction

This vignette demonstrates how to use the vimpute() function for flexible missing data imputation using machine learning models from the mlr3 ecosystem.

Function Arguments

data: A datatable or dataframe containing missing values to be imputed.
considered_variables: A character vector of variable names to be either imputed or used as predictors, excluding irrelevant columns from the imputation process.
method: A named list specifying the imputation method for each variable.
pmm: TRUE/FALSE indicating whether predictive mean matching is used. Provide as a list for each variable.
formula: If not all variables are used as predictors, or if transformations or interactions are required (applies to all X, for Y only transformations are possible). Only applicable for the methods "robust" andc "regularized". Provide as a list for each variable that requires specific conditions.
sequential: Specifies whether the imputation should be performed sequentially.
nseq: The number of sequential iterations, if sequentialis TRUE.
eps: The convergence threshold for sequential imputation.
imp_var: Specifies whether to add indicator variables for imputed values.
pred_history: If enabled, saves the prediction history.
tune: Whether to perform hyperparameter tuning.

Data

To demonstrate the function, the sleep dataset from the VIM package is used.

library(VIM)
library(data.table)

data <- as.data.table(VIM::sleep)
a <- aggr(sleep, plot = FALSE)
plot(a, numbers = TRUE, prop = FALSE)

The left plot shows the amount of missings for each column in the dataset sleep and the right plot shows how often each combination of missings occur. For example, there are 9 rows wich contain a missing in both NonD and Dream.

dataDS <- sleep[, c("Dream", "Sleep")]
marginplot(dataDS, main = "Missing Values")

The red boxplot on the left shows the distrubution of all values of Sleep where Dream contains a missing value. The blue boxplot on the left shows the distribution of the values of Sleep where Dream is observed.

Basic Usage

Default Imputation

In the basic usage, the vimpute() function performs imputation using the default settings. It uses the "ranger" method for all variables, applies predictive mean matching, and performs sequential imputation with a convergence threshold of 0.005.

result <- vimpute(
  data = data,
  pred_history = TRUE)

print(head(result$data, 3))

Results and information about missing/imputed values can be shown in the plot margins:

dataDS <- as.data.frame(result$data[, c("Dream", "Sleep", "Dream_imp", "Sleep_imp")])
marginplot(dataDS, delimiter = "_imp", main = "Imputation with Default Model")

The default output are the imputed dataset and the prediction history.

In this plot three differnt colors are used in the top-right. These colors represent the structure of missings.

brown points represent values where Dream was missing initially
beige points represent values where Sleep was missing initially
black points represent values where both Dream and Sleep were missing initially

Advanced Options

Parameter `method`

(default: "ranger" for all variables)

Specifies the machine learning method used for imputation of each variable. In this example, different imputation methods are specified for each variable. The NonD variable uses a robust method, Dream and Span are using ranger, Sleep uses xgboost, Gest uses a regularized method and class uses a robust method.

"robust": Robust regression models
lmrob for numeric variables: Implements MM-estimation for resistance to outliers
glmrob for binary factors: Uses robust estimators to reduce outlier influence
"regularized": Regularized regression (glmnet)
Uses elastic net regularization
Automatically handles multicollinearity
"ranger": Random Forest
Fast implementation of random forests
Handles non-linear relationships well
"xgboost": Gradient Boosted Trees
State-of-the-art tree boosting
Handles mixed data types well

result_mixed <- vimpute(
  data = data,
  method = list(NonD = "robust", Dream = "ranger", Sleep = "xgboost", Span = "ranger" , Gest = "regularized"),
  pred_history = TRUE
  )

dataDS <- as.data.frame(result_mixed$data[, c("Dream", "Sleep", "Dream_imp", "Sleep_imp")])
marginplot(dataDS, delimiter = "_imp", main = "Imputation with different Models for each Variable")

result_xgboost <- vimpute(
  data = data,
  method = setNames(as.list(rep("xgboost", ncol(data))), names(data)),
  pred_history = TRUE, verbose = FALSE
  )
dataDS_xgboost <- as.data.frame(result_xgboost$data[, c("Dream", "Sleep", "Dream_imp", "Sleep_imp")])
result_regularized <- vimpute(
  data = data,
  method = setNames(as.list(rep("regularized", ncol(data))), names(data)),
  pred_history = TRUE
  )
dataDS_regularized <- as.data.frame(result_regularized$data[, c("Dream", "Sleep", "Dream_imp", "Sleep_imp")])

The side-by-side margin plots compare the performance of two imputation methods: xgboost (left) and regularized (right):

par(mfrow = c(1, 2))
marginplot(dataDS_xgboost, delimiter = "_imp", main = "Imputation with xgboost")
marginplot(dataDS_regularized, delimiter = "_imp", main = "Imputation with Regularized")
par(mfrow = c(1, 1))

xgboost handles missing values with data-driven, uneven imputations that capture complex patterns but may be less stable, while regularized methods produce smoother, more conservative estimates that are less prone to overfitting. The key difference lies in flexibility (xgboost) versus robustness (regularization).

Parameter `pmm`

(default: TRUE for all numeric variables)

result <- vimpute(
  data = data,
  method = list(NonD = "robust", 
                Dream = "ranger", 
                Sleep = "xgboost", 
                Span = "ranger" , 
                Gest = "regularized"),
  pmm = list(NonD = FALSE, Dream = TRUE, Sleep = FALSE, Span = FALSE , Gest = TRUE)
  )

If TRUE, imputed values are restricted to actual observed values in the dataset, ensuring realism but potentially limiting variability. If FALSE, raw model predictions are used, allowing greater flexibility but risking implausible or extreme imputations.

Parameter `formula`

(default: FALSE)

Specifies custom model formulas for imputation of each variable, offering precise control over the imputation models.

Key Features:

Variable-Specific Models
Each formula specifies which predictors should be used for imputing a particular variable
Enables different predictor sets for different target variables
Example:
r formula = list( income ~ education + age, blood_pressure ~ weight + age )
Transformations Support
Handles common transformations on both sides of the formula:
- Response transformations: log(y), sqrt(y), exp(y), I(1/y)
- Predictor transformations: log(x1), poly(x2, 2), etc.
Example with transformations:
r formula = list( log(income) ~ poly(age, 2) + education, sqrt(blood_pressure) ~ weight + I(1/age) )
Interaction Terms
Supports interaction terms using : or * syntax (on the right side)
Example:
r formula = list( price ~ sqft * neighborhood + year_built )

Example Demonstration:

result <- vimpute(
  data = data,
  method = setNames(as.list(rep("regularized", ncol(data))), names(data))
  formula = list(
    NonD ~ Dream + Sleep,              # Linear combination
    Span ~ Dream:Sleep + Gest,         # With interaction term
    log(Gest) ~ Sleep + exp(Span)      # With transformations
  )
)

Interpreting the Example:

For NonD:
Uses linear combination of Dream and Sleep variables
Model: NonD = β₀ + β₁*Dream + β₂*Sleep + ε
For Span:
Includes interaction between Dream and Sleep
Plus main effect of Gest
Model: Span = β₀ + β₁*Dream*Sleep + β₂*Gest + ε
For Gest:
Uses log-transformed response
Predictors include Sleep and exponential of Span
Model: log(Gest) = β₀ + β₁*Sleep + β₂*exp(Span) + ε
For Sleep and Dream all other variables are used as predictors

Notes:

Only works with methods "robust" and "regularized"
All model.matrix-compatible functions work for predictors (for more information see ?model.matrix)
Response transformations (left side) are automatically back-transformed

Parameter `tune`

(default: FALSE)

result <- vimpute(
  data = data,
  tune = TRUE
  )

Whether to perform hyperparameter tuning (only possible if seq = TRUE):

When TRUE:
Conducts randomized parameter search (after half of the iterations)
Uses best performing configuration
When FALSE:
Uses default model parameters
Recommended: TRUE for optimal performance

Parameters `nseq` and `eps`

(default: 10 and default: 0.005)

result <- vimpute(
  data = data,
  nseq = 20,
  eps = 0.01
  )

nseq describes the number of sequential imputation iterations. Higher values:

Allow more refinement
Increase computation time

eps describes the convergence threshold for sequential imputation:

Stops early if changes between iterations < eps
Smaller values: Require more precise convergence but may need more iterations

Parameter `imp_var`

(default: TRUE)

result <- vimpute(
  data = data,
  imp_var = TRUE
  )

Creating indicator variables for imputed values adds "_imp" columns (TRUE/FALSE) to mark which data points were imputed. This is particularly useful for tracking imputation effects and conducting diagnostic analyses.

Parameter `pred_history`

(default: TRUE)

print(tail(result$pred_history, 9))

When enabled (TRUE), this option saves prediction trajectories in $pred_history, allowing users to track how imputed values evolve across iterations. This feature is particularly useful for diagnosing convergence issues.

Performance

In order to validate the performance of vimpute() the iris dataset is used. Firstly, some values are randomly set to NA.

library(reactable)

data(iris)
df <- as.data.table(iris)
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")
# randomly produce some missing values in the data
set.seed(1)
nbr_missing <- 50
y <- data.frame(row=sample(nrow(iris),size = nbr_missing,replace = T),
                col=sample(ncol(iris)-1,size = nbr_missing,replace = T))
y<-y[!duplicated(y),]
df[as.matrix(y)]<-NA

aggr(df)

sapply(df, function(x)sum(is.na(x)))

The data contains missing values across all variables, with some observations missing multiple values. The subsequent step involves variable imputation, and the following tables present the rounded first five imputation results for each variable.

For default model:

library(reactable)

data(iris)
df <- as.data.table(iris)
colnames(df) <- c("S.Length","S.Width","P.Length","P.Width","Species")

# Create complete copy before introducing NAs
complete_data <- df

# Randomly produce missing values
set.seed(1)
nbr_missing <- 50
y <- data.frame(row = sample(nrow(df), size = nbr_missing, replace = TRUE),
                col = sample(ncol(df), size = nbr_missing, replace = TRUE))
y <- y[!duplicated(y),]
df[as.matrix(y)] <- NA

# Perform imputation
result <- vimpute(data = df, pred_history = TRUE)

# Extracting the imputed columns from result$data
imputed_columns <- grep("_imp$", names(result$data), value = TRUE)

# Create a function to compare true and imputed values
compare_values <- function(true_data, pred_data, imputed_data, col_name) {
  comparison <- data.frame(
    True_Value = true_data[[col_name]],
    Imputed_Value = ifelse(imputed_data, pred_data[[col_name]], NA)
  )
  comparison <- comparison[!is.na(comparison$Imputed_Value), ]
  return(comparison)
}

# Initialize an empty list to store the comparison tables
comparison_list <- list()

# Loop through each imputed column and create a comparison table
for (imputed_col in imputed_columns) {
  col_name <- sub("_imp$", "", imputed_col)
  comparison_list[[col_name]] <- compare_values(complete_data, result$data, result$data[[imputed_col]], col_name)
}

# Prepare the results in a combined wide format, ensuring equal row numbers
results <- cbind(
  "TRUE1" = head(comparison_list[["S.Length"]][, "True_Value"], 5),
  "IMPUTED1" = head(comparison_list[["S.Length"]][, "Imputed_Value"], 5),
  "TRUE2" = head(comparison_list[["S.Width"]][, "True_Value"], 5),
  "IMPUTED2" = head(comparison_list[["S.Width"]][, "Imputed_Value"], 5),
  "TRUE3" = head(comparison_list[["P.Length"]][, "True_Value"], 5),
  "IMPUTED3" = head(comparison_list[["P.Length"]][, "Imputed_Value"], 5),
  "TRUE4" = head(comparison_list[["P.Width"]][, "True_Value"], 5),
  "IMPUTED4" = head(comparison_list[["P.Width"]][, "Imputed_Value"], 5)
)

# Print the combined wide format table
print(results)

# Load the reactable library
library(reactable)

# Create the reactable
reactable(results, columns = list(
  TRUE1 = colDef(name = "True"),
  IMPUTED1 = colDef(name = "Imputed"),
  TRUE2 = colDef(name = "True"),
  IMPUTED2 = colDef(name = "Imputed"),
  TRUE3 = colDef(name = "True"),
  IMPUTED3 = colDef(name = "Imputed"),
  TRUE4 = colDef(name = "True"),
  IMPUTED4 = colDef(name = "Imputed")
),
columnGroups = list(
  colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")),
  colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")),
  colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")),
  colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4"))
),
striped = TRUE,
highlight = TRUE,
bordered = TRUE
)

For xgboost model:

library(reactable)
library(VIM)
data(iris)

# Create complete copy before introducing NAs
complete_data <- iris
colnames(complete_data) <- c("S.Length","S.Width","P.Length","P.Width","Species")
df <- copy(complete_data)

# Randomly produce missing values
set.seed(1)
nbr_missing <- 50
y <- data.frame(row = sample(nrow(df), size = nbr_missing, replace = TRUE),
                col = sample(ncol(df), size = nbr_missing, replace = TRUE))
y <- y[!duplicated(y),]
df[as.matrix(y)] <- NA

# Perform imputation with proper method specification
result <- vimpute(
  data = df,
  method = setNames(lapply(names(df), function(x) "xgboost"),names(df)),
  pred_history = TRUE
)

# Extracting the imputed columns from result$data
imputed_columns <- grep("_imp$", names(result$data), value = TRUE)

# Create a function to compare true and imputed values
compare_values <- function(true_data, pred_data, imputed_data, col_name) {
  comparison <- data.frame(
    True_Value = true_data[[col_name]],
    Imputed_Value = ifelse(imputed_data, pred_data[[col_name]], NA)
  )
  comparison <- comparison[!is.na(comparison$Imputed_Value), ]
  return(comparison)
}

# Initialize an empty list to store the comparison tables
comparison_list <- list()

# Loop through each imputed column and create a comparison table
for (imputed_col in imputed_columns) {
  col_name <- sub("_imp$", "", imputed_col)
  comparison_list[[col_name]] <- compare_values(complete_data, result$data, result$data[[imputed_col]], col_name)
}

# Prepare the results in a combined wide format, ensuring equal row numbers
results <- cbind(
  "TRUE1" = head(comparison_list[["S.Length"]][, "True_Value"], 5),
  "IMPUTED1" = head(comparison_list[["S.Length"]][, "Imputed_Value"], 5),
  "TRUE2" = head(comparison_list[["S.Width"]][, "True_Value"], 5),
  "IMPUTED2" = head(comparison_list[["S.Width"]][, "Imputed_Value"], 5),
  "TRUE3" = head(comparison_list[["P.Length"]][, "True_Value"], 5),
  "IMPUTED3" = head(comparison_list[["P.Length"]][, "Imputed_Value"], 5),
  "TRUE4" = head(comparison_list[["P.Width"]][, "True_Value"], 5),
  "IMPUTED4" = head(comparison_list[["P.Width"]][, "Imputed_Value"], 5)
)

# Print the combined wide format table
print(results)

# Load the reactable library
library(reactable)

# Create the reactable
reactable(results, columns = list(
  TRUE1 = colDef(name = "True"),
  IMPUTED1 = colDef(name = "Imputed"),
  TRUE2 = colDef(name = "True"),
  IMPUTED2 = colDef(name = "Imputed"),
  TRUE3 = colDef(name = "True"),
  IMPUTED3 = colDef(name = "Imputed"),
  TRUE4 = colDef(name = "True"),
  IMPUTED4 = colDef(name = "Imputed")
),
columnGroups = list(
  colGroup(name = "S.Length", columns = c("TRUE1", "IMPUTED1")),
  colGroup(name = "S.Width", columns = c("TRUE2", "IMPUTED2")),
  colGroup(name = "P.Length", columns = c("TRUE3", "IMPUTED3")),
  colGroup(name = "P.Width", columns = c("TRUE4", "IMPUTED4"))
),
striped = TRUE,
highlight = TRUE,
bordered = TRUE
)

Any scripts or data that you put into this service are public.

VIM documentation built on Jan. 10, 2026, 9:13 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

VIM
Visualization and Imputation of Missing Values

Imputation Method vimpute
In VIM: Visualization and Imputation of Missing Values

Introduction

Function Arguments

Data

Basic Usage

Default Imputation

Advanced Options

Parameter `method`

Parameter `pmm`

Parameter `formula`

Parameter `tune`

Parameters `nseq` and `eps`

Parameter `imp_var`

Parameter `pred_history`

Performance

Try the VIM package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

VIM Visualization and Imputation of Missing Values

Imputation Method vimpute In VIM: Visualization and Imputation of Missing Values

Introduction

Function Arguments

Data

Basic Usage

Default Imputation

Advanced Options

Parameter method

Parameter pmm

Parameter formula

Parameter tune

Parameters nseq and eps

Parameter imp_var

Parameter pred_history

Performance

Try the VIM package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

VIM
Visualization and Imputation of Missing Values

Imputation Method vimpute
In VIM: Visualization and Imputation of Missing Values

Parameter `method`

Parameter `pmm`

Parameter `formula`

Parameter `tune`

Parameters `nseq` and `eps`

Parameter `imp_var`

Parameter `pred_history`