knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
miceFast provides fast imputation methods built on C++/RcppArmadillo (Eddelbuettel & Sanderson, 2014). The main interfaces are:
fill_NA and fill_NA_N work with dplyr and data.table pipelines.new(miceFast) provides an OOP interface for maximum control.pool() combines results from multiply imputed datasets using Rubin's rules
(Rubin, 1987) with the Barnard-Rubin (1999) degrees-of-freedom adjustment.For missing-data theory (MCAR/MAR/MNAR) and full MI workflows, see the companion vignette: Missing Data Mechanisms and Multiple Imputation. For a comprehensive treatment, see van Buuren (2018).
library(miceFast) library(dplyr) library(data.table) library(ggplot2) set.seed(123456)
miceFast supports several models via the model argument in fill_NA() and fill_NA_N():
| Model | Type | Stochastic? | Use case |
|-------|------|-------------|----------|
| lm_pred | Linear regression | No | Deterministic point prediction. With only an Intercept column, equivalent to mean imputation. |
| lm_bayes | Bayesian linear regression | Yes | Draws coefficients from their posterior. Recommended for MI of continuous variables. |
| lm_noise | Linear regression + noise | Yes (improper) | Adds residual noise but omits parameter uncertainty. Useful for sensitivity analysis, not recommended as the primary MI model. |
| lda | Linear Discriminant Analysis | Approximate | For categorical variables. With a random ridge parameter it becomes stochastic. Suitable for MI but not strictly proper (uses ad-hoc perturbation, not a posterior draw). |
| pmm | Predictive Mean Matching | Yes (proper) | Imputes from observed values via nearest-neighbour matching on Bayesian-posterior predictions. Works for both continuous and categorical variables. Available through fill_NA_N() / OOP impute() / impute_N(). |
For MI you need a stochastic model. lm_bayes is strictly
proper (Rubin, 1987): it draws both coefficients and residual
variance from their posterior. pmm is also proper. It draws
coefficients from the posterior for predictions on missing rows, then
matches to the nearest observed values (Type II PMM; van Buuren, 2018).
It works for both continuous and categorical variables and naturally
preserves the observed data distribution. For categorical variables,
lda with ridge = runif(1, ...) is an approximate approach.
The random ridge creates between-imputation variability, but it is not a
true posterior draw. lm_noise is improper. It adds residual noise
but omits parameter uncertainty. Both lda + ridge and lm_noise work
well in practice and are useful for sensitivity analysis.
See the MI vignette.
The package ships with air_miss, an extended version of airquality with additional columns
including a character variable, weights, groups, and a categorized Ozone variable.
data(air_miss) str(air_miss)
upset_NA(air_miss, 6)
Before imputing, check Variance Inflation Factors. Values above ~10 suggest problematic collinearity that can destabilize imputation models. Consider dropping or combining redundant predictors before imputing.
VIF( air_miss, posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "x_character", "Day", "weights", "groups") )
fill_NA()fill_NA() imputes missing values in a single column using a specified model and predictors.
It accepts column names (strings) or position indices and works inside dplyr::mutate() or
data.table := expressions.
data(air_miss) result <- air_miss %>% # Continuous variable: Bayesian linear model (stochastic) mutate(Ozone_imp = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )) %>% # Categorical variable: LDA mutate(x_char_imp = fill_NA( x = ., model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp") )) head(result[, c("Ozone", "Ozone_imp", "x_character", "x_char_imp")])
Grouping fits a separate model per group, which is useful when the relationship between variables varies across subpopulations. Weights allow for heteroscedasticity or survey designs.
data(air_miss) result_grouped <- air_miss %>% group_by(groups) %>% do(mutate(., Solar_R_imp = fill_NA( x = ., model = "lm_pred", posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept"), w = .[["weights"]] ) )) %>% ungroup() head(result_grouped[, c("Solar.R", "Solar_R_imp", "groups")])
For right-skewed variables (like Ozone), use logreg = TRUE to impute on the log scale.
The model fits on $\log(y)$ and back-transforms the predictions:
data(air_miss) result_log <- air_miss %>% mutate(Ozone_imp = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "Intercept"), logreg = TRUE )) # Compare distributions: log imputation avoids negative values summary(result_log[c("Ozone", "Ozone_imp")])
You can refer to columns by position instead of name.
Check names(air_miss) to find the right positions:
data(air_miss) result_pos <- air_miss %>% mutate(Ozone_imp = fill_NA( x = ., model = "lm_bayes", posit_y = 1, posit_x = c(4, 6), logreg = TRUE )) head(result_pos[, c("Ozone", "Ozone_imp")])
data(air_miss) setDT(air_miss) air_miss[, Ozone_imp := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )] air_miss[, x_char_imp := fill_NA( x = .SD, model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp") )] # With grouping air_miss[, Solar_R_imp := fill_NA( x = .SD, model = "lm_pred", posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept"), w = .SD[["weights"]] ), by = .(groups)]
fill_NA_N()For lm_bayes and lm_noise, fill_NA_N() generates k stochastic draws
per missing observation and returns their average. For pmm, k is the
number of nearest observed neighbours from which one value is randomly selected
(no averaging). In both cases the result is a single filled-in dataset.
Between-imputation variance is lost, so Rubin's rules cannot be applied.
For MI with continuous variables, use fill_NA() + pool() with lm_bayes.
For MI with PMM (proper, works for both continuous and categorical
variables), use the OOP interface: call impute("pmm", ...) in a loop
(see the OOP section and the
MI vignette).
data(air_miss) result_n <- air_miss %>% # PMM with 20 draws. Imputes from observed values. mutate(Ozone_pmm = fill_NA_N( x = ., model = "pmm", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), k = 20 )) %>% # lm_noise with 30 draws and weights mutate(Ozone_noise = fill_NA_N( x = ., model = "lm_noise", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 30 ))
data(air_miss) setDT(air_miss) air_miss[, Ozone_pmm := fill_NA_N( x = .SD, model = "pmm", posit_y = "Ozone", posit_x = c("Wind", "Temp", "Intercept"), w = .SD[["weights"]], logreg = TRUE, k = 20 ), by = .(groups)]
Use compare_imp() to visually compare the distribution of observed vs. imputed values:
data(air_miss) air_miss_imp <- air_miss %>% mutate( Ozone_bayes = fill_NA(x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")), Ozone_pmm = fill_NA_N(x = ., model = "pmm", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), k = 20) ) compare_imp(air_miss_imp, origin = "Ozone", target = c("Ozone_bayes", "Ozone_pmm"))
pool()For proper statistical inference on incomplete data, use the MI workflow:
pool().pool() works with any model that has coef() and vcov() methods.
data(air_miss) # Select the 4 core variables: Ozone and Solar.R have NAs; Wind and Temp are complete. df <- air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")] # Step 1: Generate m = 10 completed datasets. # Impute Solar.R first (predictors fully observed), then Ozone # (can now use the freshly imputed Solar.R). This sequential order # ensures all NAs are filled in a single pass. set.seed(1234) completed <- lapply(1:10, function(i) { df %>% mutate(Solar.R = fill_NA(., "lm_bayes", "Solar.R", c("Wind", "Temp"))) %>% mutate(Ozone = fill_NA(., "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))) }) # Step 2: Fit the analysis model on each completed dataset fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d)) # Step 3: Pool using Rubin's rules pool_res <- pool(fits) pool_res summary(pool_res)
In practice you often need to impute every variable that has missing values, then run MI with proper pooling. The key is ordering: impute variables whose predictors are complete first, then variables that depend on the freshly imputed ones. This sequential chain resolves joint missingness in a single pass without iterative FCS.
Below is a complete workflow using air_miss.
data(air_miss) # Define a function that fills all variables with NAs in one pass. # Order matters: Solar.R first (Wind, Temp are complete), then Ozone # (uses the freshly imputed Solar.R), then categorical variables. impute_all <- function(data) { data %>% # Continuous: Solar.R (predictors fully observed) mutate(Solar.R = fill_NA( x = ., model = "lm_bayes", posit_y = "Solar.R", posit_x = c("Wind", "Temp") )) %>% # Continuous: Ozone (Solar.R now complete) mutate(Ozone = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )) %>% # Categorical: x_character (LDA with random ridge for stochasticity) mutate(x_character = fill_NA( x = ., model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp"), ridge = runif(1, 0.5, 50) )) %>% # Categorical: Ozone_chac (use tryCatch for safety with small groups) group_by(groups) %>% do(mutate(., Ozone_chac = tryCatch( fill_NA( x = ., model = "lda", posit_y = "Ozone_chac", posit_x = c("Temp", "Wind") ), error = function(e) .[["Ozone_chac"]] ))) %>% ungroup() } # MI: generate m = 10 completed datasets set.seed(2024) m <- 10 completed <- lapply(1:m, function(i) impute_all(air_miss)) # Fit the analysis model on each completed dataset fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d)) # Pool using Rubin's rules pool_result <- pool(fits) pool_result summary(pool_result)
The same workflow using data.table:
data(air_miss) setDT(air_miss) impute_all_dt <- function(dt) { d <- copy(dt) # Order: Solar.R first (predictors complete), then Ozone, then categoricals d[, Solar.R := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Solar.R", posit_x = c("Wind", "Temp") )] d[, Ozone := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )] d[, x_character := fill_NA( x = .SD, model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp"), ridge = runif(1, 0.5, 50) )] d[, Ozone_chac := tryCatch( fill_NA( x = .SD, model = "lda", posit_y = "Ozone_chac", posit_x = c("Temp", "Wind") ), error = function(e) .SD[["Ozone_chac"]] ), by = .(groups)] d } set.seed(2024) completed_dt <- lapply(1:10, function(i) impute_all_dt(air_miss)) fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d)) pool(fits_dt)
miceFast OOP ModuleFor maximum performance and fine-grained control, use the C++ object directly. This interface operates on numeric matrices and uses column position indices.
| Method | Description |
|--------|-------------|
| set_data(x) | Set the data matrix (by reference). |
| set_g(g) | Set a grouping variable (positive numeric, no NAs). |
| set_w(w) | Set a weighting variable (positive numeric, no NAs). |
| impute(model, y, x) | Single imputation. All models including pmm (k = 1). |
| impute_N(model, y, x, k) | lm_bayes/lm_noise: averaged k draws. pmm: samples from k nearest observed values. |
| update_var(y, imps) | Permanently update a column with imputations. |
| vifs(y, x) | Variance Inflation Factors. |
| get_data() / get_g() / get_w() / get_index() | Retrieve data or metadata. |
| sort_byg() / is_sorted_byg() | Sort by grouping variable. |
| which_updated() | Check which columns have been updated. |
Note that update_var() permanently alters the data matrix by reference.
When a grouping variable is set, data is automatically sorted on first imputation;
use get_index() to recover the original row order.
data <- cbind(as.matrix(airquality[, 1:4]), intercept = 1, index = 1:nrow(airquality)) model <- new(miceFast) model$set_data(data) # Single imputation with linear model (col 1 = Ozone) model$update_var(1, model$impute("lm_pred", 1, 5)$imputations) # Averaged multiple imputation for Solar.R (col 2, Bayesian, k=10 draws) model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5), k = 10)$imputations) model$which_updated() head(model$get_data(), 4)
data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(airquality[, 5]) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), k = 10)$imputations) # Recover original row order head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)
When groups are not pre-sorted, the data is automatically sorted on first imputation:
data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE)) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), 10)$imputations) head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)
The OOP interface can also drive the full MI loop. Each iteration creates a fresh model, imputes all variables with missing values, and returns the completed matrix in the original row order.
This example uses lm_bayes; for PMM (which also works for categorical variables),
simply replace "lm_bayes" with "pmm". PMM is proper and draws from the
Bayesian posterior before matching to observed values.
data(air_miss) # Prepare a numeric matrix with an intercept and row index mat <- cbind( as.matrix(air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]), intercept = 1, index = seq_len(nrow(air_miss)) ) groups <- as.numeric(air_miss$groups) impute_oop <- function(mat, groups) { m <- new(miceFast) m$set_data(mat + 0) # copy. set_data uses the matrix by reference. m$set_g(groups) # col 1 = Ozone, col 2 = Solar.R, col 3 = Wind, col 4 = Temp, col 5 = intercept m$update_var(1, m$impute("lm_bayes", 1, c(2, 3, 4))$imputations) m$update_var(2, m$impute("lm_bayes", 2, c(3, 4, 5))$imputations) completed <- m$get_data()[order(m$get_index()), ] as.data.frame(completed) } set.seed(2025) completed <- lapply(1:10, function(i) impute_oop(mat, groups)) fits <- lapply(completed, function(d) lm(V1 ~ V3 + V4, data = d)) pool(fits) summary(pool(fits))
The mice package uses Fully Conditional Specification (FCS): it imputes each
variable given all others and cycles through multiple iterations until convergence.
miceFast can do exactly the same thing. The key is that update_var() modifies the
data matrix in place, so each subsequent impute() call sees the freshly imputed
values.
The same logic works with fill_NA and :=, which also updates columns in place:
data(air_miss) setDT(air_miss) fcs_dt <- function(dt, n_iter = 5) { d <- copy(dt) na_ozone <- is.na(d$Ozone) na_solar <- is.na(d[["Solar.R"]]) d <- naive_fill_NA(d) # initialise for (iter in seq_len(n_iter)) { set(d, which(na_ozone), "Ozone", NA_real_) # restore NAs d[, Ozone := fill_NA(.SD, "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))] set(d, which(na_solar), "Solar.R", NA_real_) d[, Solar.R := fill_NA_N(.SD, "pmm", "Solar.R", c("Ozone", "Wind", "Temp", "Intercept"))] } d } set.seed(2025) completed_dt <- lapply(1:10, function(i) fcs_dt(air_miss)) fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Wind + Temp, data = d)) pool(fits_dt)
With a monotone missing-data pattern a single pass (n_iter = 1) is sufficient.
For complex interlocking patterns, 5--10 iterations is typical. The OOP interface
avoids R-level data copies and is fastest for large datasets.
corrDataThe corrData module generates correlated data for simulations.
This is useful for creating test datasets with known properties.
# 3 continuous variables, 100 observations means <- c(10, 20, 30) cor_matrix <- matrix(c( 1.0, 0.7, 0.3, 0.7, 1.0, 0.5, 0.3, 0.5, 1.0 ), nrow = 3) cd <- new(corrData, 100, means, cor_matrix) sim_data <- cd$fill("contin") round(cor(sim_data), 2)
# With 2 categorical variables: first argument is nr_cat cd2 <- new(corrData, 2, 200, means, cor_matrix) sim_discrete <- cd2$fill("discrete") head(sim_discrete)
Creating matrices from data frames: R matrices hold only one type.
Convert factors with model.matrix():
r
mtcars$cyl <- factor(mtcars$cyl)
model.matrix(~ ., data = mtcars, na.action = "na.pass")
Collinearity: Always check VIF() before imputing. High VIF (>10) indicates
collinearity that can destabilize imputation models.
Error handling with groups: Small groups may not have enough observations.
Wrap fill_NA() calls in tryCatch():
r
tryCatch(
fill_NA(x = ., model = "lda", posit_y = "y", posit_x = c("x1", "x2")),
error = function(e) .[["y"]]
)
BLAS optimization: Install an optimized BLAS library for significant speedups:
sudo apt-get install libopenblas-devRubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
Barnard, J. and Rubin, D.B. (1999). Small-sample degrees of freedom with multiple imputation. Biometrika, 86(4), 948-955.
van Buuren, S. (2018). Flexible Imputation of Missing Data (2nd ed.). Chapman & Hall/CRC.
Eddelbuettel, D. and Sanderson, C. (2014). RcppArmadillo: Accelerating R with high-performance C++ linear algebra. Computational Statistics and Data Analysis, 71, 1054-1063.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.