miceFast - Introduction and Advanced Usage

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Overview

miceFast provides fast imputation methods built on C++/RcppArmadillo (Eddelbuettel & Sanderson, 2014). The main interfaces are:

For missing-data theory (MCAR/MAR/MNAR) and full MI workflows, see the companion vignette: Missing Data Mechanisms and Multiple Imputation. For a comprehensive treatment, see van Buuren (2018).

library(miceFast)
library(dplyr)
library(data.table)
library(ggplot2)
set.seed(123456)

Available Imputation Models

miceFast supports several models via the model argument in fill_NA() and fill_NA_N():

| Model | Type | Stochastic? | Use case | |-------|------|-------------|----------| | lm_pred | Linear regression | No | Deterministic point prediction. With only an Intercept column, equivalent to mean imputation. | | lm_bayes | Bayesian linear regression | Yes | Draws coefficients from their posterior. Recommended for MI of continuous variables. | | lm_noise | Linear regression + noise | Yes (improper) | Adds residual noise but omits parameter uncertainty. Useful for sensitivity analysis, not recommended as the primary MI model. | | lda | Linear Discriminant Analysis | Approximate | For categorical variables. With a random ridge parameter it becomes stochastic. Suitable for MI but not strictly proper (uses ad-hoc perturbation, not a posterior draw). | | pmm | Predictive Mean Matching | Yes (proper) | Imputes from observed values via nearest-neighbour matching on Bayesian-posterior predictions. Works for both continuous and categorical variables. Available through fill_NA_N() / OOP impute() / impute_N(). |

For MI you need a stochastic model. lm_bayes is strictly proper (Rubin, 1987): it draws both coefficients and residual variance from their posterior. pmm is also proper. It draws coefficients from the posterior for predictions on missing rows, then matches to the nearest observed values (Type II PMM; van Buuren, 2018). It works for both continuous and categorical variables and naturally preserves the observed data distribution. For categorical variables, lda with ridge = runif(1, ...) is an approximate approach. The random ridge creates between-imputation variability, but it is not a true posterior draw. lm_noise is improper. It adds residual noise but omits parameter uncertainty. Both lda + ridge and lm_noise work well in practice and are useful for sensitivity analysis. See the MI vignette.

Example Data

The package ships with air_miss, an extended version of airquality with additional columns including a character variable, weights, groups, and a categorized Ozone variable.

data(air_miss)
str(air_miss)

Examining Missingness

upset_NA(air_miss, 6)

Checking for Collinearity

Before imputing, check Variance Inflation Factors. Values above ~10 suggest problematic collinearity that can destabilize imputation models. Consider dropping or combining redundant predictors before imputing.

VIF(
  air_miss,
  posit_y = "Ozone",
  posit_x = c("Solar.R", "Wind", "Temp", "x_character", "Day", "weights", "groups")
)

Single Imputation with fill_NA()

fill_NA() imputes missing values in a single column using a specified model and predictors. It accepts column names (strings) or position indices and works inside dplyr::mutate() or data.table := expressions.

Basic usage (dplyr)

data(air_miss)

result <- air_miss %>%
  # Continuous variable: Bayesian linear model (stochastic)
  mutate(Ozone_imp = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp")
  )) %>%
  # Categorical variable: LDA
  mutate(x_char_imp = fill_NA(
    x = .,
    model = "lda",
    posit_y = "x_character",
    posit_x = c("Wind", "Temp")
  ))

head(result[, c("Ozone", "Ozone_imp", "x_character", "x_char_imp")])

With weights and grouped imputation

Grouping fits a separate model per group, which is useful when the relationship between variables varies across subpopulations. Weights allow for heteroscedasticity or survey designs.

data(air_miss)

result_grouped <- air_miss %>%
  group_by(groups) %>%
  do(mutate(.,
    Solar_R_imp = fill_NA(
      x = .,
      model = "lm_pred",
      posit_y = "Solar.R",
      posit_x = c("Wind", "Temp", "Intercept"),
      w = .[["weights"]]
    )
  )) %>%
  ungroup()

head(result_grouped[, c("Solar.R", "Solar_R_imp", "groups")])

Log-transformation

For right-skewed variables (like Ozone), use logreg = TRUE to impute on the log scale. The model fits on $\log(y)$ and back-transforms the predictions:

data(air_miss)

result_log <- air_miss %>%
  mutate(Ozone_imp = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp", "Intercept"),
    logreg = TRUE
  ))

# Compare distributions: log imputation avoids negative values
summary(result_log[c("Ozone", "Ozone_imp")])

Using column position indices

You can refer to columns by position instead of name. Check names(air_miss) to find the right positions:

data(air_miss)

result_pos <- air_miss %>%
  mutate(Ozone_imp = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = 1,
    posit_x = c(4, 6),
    logreg = TRUE
  ))

head(result_pos[, c("Ozone", "Ozone_imp")])

Basic usage (data.table)

data(air_miss)
setDT(air_miss)

air_miss[, Ozone_imp := fill_NA(
  x = .SD,
  model = "lm_bayes",
  posit_y = "Ozone",
  posit_x = c("Solar.R", "Wind", "Temp")
)]

air_miss[, x_char_imp := fill_NA(
  x = .SD,
  model = "lda",
  posit_y = "x_character",
  posit_x = c("Wind", "Temp")
)]

# With grouping
air_miss[, Solar_R_imp := fill_NA(
  x = .SD,
  model = "lm_pred",
  posit_y = "Solar.R",
  posit_x = c("Wind", "Temp", "Intercept"),
  w = .SD[["weights"]]
), by = .(groups)]

Multiple Imputation with fill_NA_N()

For lm_bayes and lm_noise, fill_NA_N() generates k stochastic draws per missing observation and returns their average. For pmm, k is the number of nearest observed neighbours from which one value is randomly selected (no averaging). In both cases the result is a single filled-in dataset. Between-imputation variance is lost, so Rubin's rules cannot be applied. For MI with continuous variables, use fill_NA() + pool() with lm_bayes. For MI with PMM (proper, works for both continuous and categorical variables), use the OOP interface: call impute("pmm", ...) in a loop (see the OOP section and the MI vignette).

dplyr

data(air_miss)

result_n <- air_miss %>%
  # PMM with 20 draws. Imputes from observed values.
  mutate(Ozone_pmm = fill_NA_N(
    x = .,
    model = "pmm",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp"),
    k = 20
  )) %>%
  # lm_noise with 30 draws and weights
  mutate(Ozone_noise = fill_NA_N(
    x = .,
    model = "lm_noise",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 30
  ))

data.table with grouping

data(air_miss)
setDT(air_miss)

air_miss[, Ozone_pmm := fill_NA_N(
  x = .SD,
  model = "pmm",
  posit_y = "Ozone",
  posit_x = c("Wind", "Temp", "Intercept"),
  w = .SD[["weights"]],
  logreg = TRUE,
  k = 20
), by = .(groups)]

Comparing Imputations

Use compare_imp() to visually compare the distribution of observed vs. imputed values:

data(air_miss)

air_miss_imp <- air_miss %>%
  mutate(
    Ozone_bayes = fill_NA(x = ., model = "lm_bayes",
      posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")),
    Ozone_pmm = fill_NA_N(x = ., model = "pmm",
      posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), k = 20)
  )

compare_imp(air_miss_imp, origin = "Ozone", target = c("Ozone_bayes", "Ozone_pmm"))

Multiple Imputation with pool()

For proper statistical inference on incomplete data, use the MI workflow:

  1. Generate m completed datasets (each with a different stochastic imputation).
  2. Fit your analysis model on each completed dataset.
  3. Pool results using Rubin's rules via pool().

pool() works with any model that has coef() and vcov() methods.

data(air_miss)

# Select the 4 core variables: Ozone and Solar.R have NAs; Wind and Temp are complete.
df <- air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]

# Step 1: Generate m = 10 completed datasets.
# Impute Solar.R first (predictors fully observed), then Ozone
# (can now use the freshly imputed Solar.R). This sequential order
# ensures all NAs are filled in a single pass.
set.seed(1234)
completed <- lapply(1:10, function(i) {
  df %>%
    mutate(Solar.R = fill_NA(., "lm_bayes", "Solar.R", c("Wind", "Temp"))) %>%
    mutate(Ozone   = fill_NA(., "lm_bayes", "Ozone",   c("Solar.R", "Wind", "Temp")))
})

# Step 2: Fit the analysis model on each completed dataset
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))

# Step 3: Pool using Rubin's rules
pool_res <- pool(fits)
pool_res
summary(pool_res)

Full Imputation: Filling All Variables and MI with Rubin's Rules

In practice you often need to impute every variable that has missing values, then run MI with proper pooling. The key is ordering: impute variables whose predictors are complete first, then variables that depend on the freshly imputed ones. This sequential chain resolves joint missingness in a single pass without iterative FCS.

Below is a complete workflow using air_miss.

data(air_miss)

# Define a function that fills all variables with NAs in one pass.
# Order matters: Solar.R first (Wind, Temp are complete), then Ozone
# (uses the freshly imputed Solar.R), then categorical variables.
impute_all <- function(data) {
  data %>%
    # Continuous: Solar.R (predictors fully observed)
    mutate(Solar.R = fill_NA(
      x = ., model = "lm_bayes",
      posit_y = "Solar.R",
      posit_x = c("Wind", "Temp")
    )) %>%
    # Continuous: Ozone (Solar.R now complete)
    mutate(Ozone = fill_NA(
      x = ., model = "lm_bayes",
      posit_y = "Ozone",
      posit_x = c("Solar.R", "Wind", "Temp")
    )) %>%
    # Categorical: x_character (LDA with random ridge for stochasticity)
    mutate(x_character = fill_NA(
      x = ., model = "lda",
      posit_y = "x_character",
      posit_x = c("Wind", "Temp"),
      ridge = runif(1, 0.5, 50)
    )) %>%
    # Categorical: Ozone_chac (use tryCatch for safety with small groups)
    group_by(groups) %>%
    do(mutate(., Ozone_chac = tryCatch(
      fill_NA(
        x = ., model = "lda",
        posit_y = "Ozone_chac",
        posit_x = c("Temp", "Wind")
      ),
      error = function(e) .[["Ozone_chac"]]
    ))) %>%
    ungroup()
}

# MI: generate m = 10 completed datasets
set.seed(2024)
m <- 10
completed <- lapply(1:m, function(i) impute_all(air_miss))

# Fit the analysis model on each completed dataset
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))

# Pool using Rubin's rules
pool_result <- pool(fits)
pool_result
summary(pool_result)

The same workflow using data.table:

data(air_miss)
setDT(air_miss)

impute_all_dt <- function(dt) {
  d <- copy(dt)
  # Order: Solar.R first (predictors complete), then Ozone, then categoricals
  d[, Solar.R := fill_NA(
    x = .SD, model = "lm_bayes",
    posit_y = "Solar.R",
    posit_x = c("Wind", "Temp")
  )]
  d[, Ozone := fill_NA(
    x = .SD, model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp")
  )]
  d[, x_character := fill_NA(
    x = .SD, model = "lda",
    posit_y = "x_character", posit_x = c("Wind", "Temp"),
    ridge = runif(1, 0.5, 50)
  )]
  d[, Ozone_chac := tryCatch(
    fill_NA(
      x = .SD, model = "lda",
      posit_y = "Ozone_chac", posit_x = c("Temp", "Wind")
    ),
    error = function(e) .SD[["Ozone_chac"]]
  ), by = .(groups)]
  d
}

set.seed(2024)
completed_dt <- lapply(1:10, function(i) impute_all_dt(air_miss))
fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))
pool(fits_dt)

The miceFast OOP Module

For maximum performance and fine-grained control, use the C++ object directly. This interface operates on numeric matrices and uses column position indices.

Methods

| Method | Description | |--------|-------------| | set_data(x) | Set the data matrix (by reference). | | set_g(g) | Set a grouping variable (positive numeric, no NAs). | | set_w(w) | Set a weighting variable (positive numeric, no NAs). | | impute(model, y, x) | Single imputation. All models including pmm (k = 1). | | impute_N(model, y, x, k) | lm_bayes/lm_noise: averaged k draws. pmm: samples from k nearest observed values. | | update_var(y, imps) | Permanently update a column with imputations. | | vifs(y, x) | Variance Inflation Factors. | | get_data() / get_g() / get_w() / get_index() | Retrieve data or metadata. | | sort_byg() / is_sorted_byg() | Sort by grouping variable. | | which_updated() | Check which columns have been updated. |

Note that update_var() permanently alters the data matrix by reference. When a grouping variable is set, data is automatically sorted on first imputation; use get_index() to recover the original row order.

Simple example

data <- cbind(as.matrix(airquality[, 1:4]), intercept = 1, index = 1:nrow(airquality))
model <- new(miceFast)
model$set_data(data)

# Single imputation with linear model (col 1 = Ozone)
model$update_var(1, model$impute("lm_pred", 1, 5)$imputations)

# Averaged multiple imputation for Solar.R (col 2, Bayesian, k=10 draws)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5), k = 10)$imputations)

model$which_updated()
head(model$get_data(), 4)

With weights and groups

data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality))
weights <- rgamma(nrow(data), 3, 3)
groups <- as.numeric(airquality[, 5])

model <- new(miceFast)
model$set_data(data)
model$set_w(weights)
model$set_g(groups)

model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), k = 10)$imputations)

# Recover original row order
head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)

With unsorted groups

When groups are not pre-sorted, the data is automatically sorted on first imputation:

data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality))
weights <- rgamma(nrow(data), 3, 3)
groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE))

model <- new(miceFast)
model$set_data(data)
model$set_w(weights)
model$set_g(groups)

model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), 10)$imputations)

head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)

Full MI workflow with OOP

The OOP interface can also drive the full MI loop. Each iteration creates a fresh model, imputes all variables with missing values, and returns the completed matrix in the original row order.

This example uses lm_bayes; for PMM (which also works for categorical variables), simply replace "lm_bayes" with "pmm". PMM is proper and draws from the Bayesian posterior before matching to observed values.

data(air_miss)

# Prepare a numeric matrix with an intercept and row index
mat <- cbind(
  as.matrix(air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]),
  intercept = 1,
  index = seq_len(nrow(air_miss))
)
groups <- as.numeric(air_miss$groups)

impute_oop <- function(mat, groups) {
  m <- new(miceFast)
  m$set_data(mat + 0)  # copy. set_data uses the matrix by reference.
  m$set_g(groups)
  # col 1 = Ozone, col 2 = Solar.R, col 3 = Wind, col 4 = Temp, col 5 = intercept
  m$update_var(1, m$impute("lm_bayes", 1, c(2, 3, 4))$imputations)
  m$update_var(2, m$impute("lm_bayes", 2, c(3, 4, 5))$imputations)
  completed <- m$get_data()[order(m$get_index()), ]
  as.data.frame(completed)
}

set.seed(2025)
completed <- lapply(1:10, function(i) impute_oop(mat, groups))

fits <- lapply(completed, function(d) lm(V1 ~ V3 + V4, data = d))
pool(fits)
summary(pool(fits))

Iterative FCS (Chained Equations) with miceFast

The mice package uses Fully Conditional Specification (FCS): it imputes each variable given all others and cycles through multiple iterations until convergence. miceFast can do exactly the same thing. The key is that update_var() modifies the data matrix in place, so each subsequent impute() call sees the freshly imputed values.

data.table (convenience functions)

The same logic works with fill_NA and :=, which also updates columns in place:

data(air_miss)
setDT(air_miss)

fcs_dt <- function(dt, n_iter = 5) {
  d <- copy(dt)
  na_ozone <- is.na(d$Ozone)
  na_solar <- is.na(d[["Solar.R"]])
  d <- naive_fill_NA(d)                          # initialise
  for (iter in seq_len(n_iter)) {
    set(d, which(na_ozone), "Ozone", NA_real_)   # restore NAs
    d[, Ozone := fill_NA(.SD, "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))]
    set(d, which(na_solar), "Solar.R", NA_real_)
    d[, Solar.R := fill_NA_N(.SD, "pmm", "Solar.R", c("Ozone", "Wind", "Temp", "Intercept"))]
  }
  d
}

set.seed(2025)
completed_dt <- lapply(1:10, function(i) fcs_dt(air_miss))
fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Wind + Temp, data = d))
pool(fits_dt)

With a monotone missing-data pattern a single pass (n_iter = 1) is sufficient. For complex interlocking patterns, 5--10 iterations is typical. The OOP interface avoids R-level data copies and is fastest for large datasets.


Generating Correlated Data with corrData

The corrData module generates correlated data for simulations. This is useful for creating test datasets with known properties.

# 3 continuous variables, 100 observations
means <- c(10, 20, 30)
cor_matrix <- matrix(c(
  1.0, 0.7, 0.3,
  0.7, 1.0, 0.5,
  0.3, 0.5, 1.0
), nrow = 3)

cd <- new(corrData, 100, means, cor_matrix)
sim_data <- cd$fill("contin")
round(cor(sim_data), 2)
# With 2 categorical variables: first argument is nr_cat
cd2 <- new(corrData, 2, 200, means, cor_matrix)
sim_discrete <- cd2$fill("discrete")
head(sim_discrete)

Tips


References



Try the miceFast package in your browser

Any scripts or data that you put into this service are public.

miceFast documentation built on Feb. 26, 2026, 5:06 p.m.