knitr::opts_chunk$set(echo = FALSE)
This vignette introduces the miceFast package, which provides fast imputations in an object-oriented programming style.
The underlying methods rely on quantitative models with closed-form solutions implemented via linear algebra operations.
Additional utility functions are provided for easy integration with popular R packages such as dplyr and data.table.
Before you begin, ensure the miceFast package is installed. You may also want to install a high-performance BLAS library for significant speedups.
Recommended: Install an optimized BLAS library (potentially up to 100x faster):
sudo apt-get install libopenblas-dev
bash
cd /Library/Frameworks/R.framework/Resources/lib
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib libRblas.dylib
pkgs <- c("miceFast", "mice", "ggplot2", "dplyr", "data.table") inst <- lapply(pkgs, library, character.only = TRUE) set.seed(123456)
The miceFast package accelerates imputations for quantitative models (with closed-form solutions) by using efficient linear algebra routines. It offers substantial speed improvements in:
Note: For details on the performance-testing procedure, see the file
performance_validity.R
in theextdata
folder. Additional simulation plots (which you are free to modify) are located inextdata/images
.
Environment: R 4.2 on Mac M1.
Missing data is a common issue in data analysis. One approach is to delete any row containing missing observations, though this can degrade the quality or representativeness of your dataset. Another, often better, approach is to use imputation methods (such as multiple or regular imputations) to fill in missing data, leveraging observed independent variables.
Since R and Python are user-friendly but can be slower for heavy computation, miceFast uses efficient C++/RcppArmadillo routines under the hood to speed up the imputation process.
The miceFast package provides additional functions (fill_NA
and fill_NA_N
) that are designed to be stable in the presence of user-level mistakes or complex data structures. Below are minimal examples using dplyr and data.table.
data(air_miss) # airquality with additional variables
upset_NA(air_miss, 6)
Below we show how to:
tryCatch
.# VIF (Variance Inflation Factor) > ~10 suggests collinearity problems. VIF( air_miss, posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "x_character", "Day", "weights", "groups") )
air_miss <- air_miss %>% # Impute with grouping and weights group_by(groups) %>% do(mutate(., Solar_R_imp = fill_NA( x = ., model = "lm_pred", posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept"), w = .[["weights"]] ) )) %>% ungroup() %>% # Impute a discrete variable mutate(x_character_imp = fill_NA( x = ., model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp") )) %>% # Log regression because Ozone is roughly log-normal mutate(Ozone_imp1 = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept"), logreg = TRUE )) %>% # Impute by position indices mutate(Ozone_imp2 = fill_NA( x = ., model = "lm_bayes", posit_y = 1, posit_x = c(4, 6), logreg = TRUE )) %>% # Imputations (mean of 30 draws) mutate(Ozone_imp3 = fill_NA_N( x = ., model = "lm_noise", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 30 )) %>% mutate(Ozone_imp4 = fill_NA_N( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 30 )) %>% group_by(groups) %>% # Single evaluation model do(mutate(., Ozone_imp5 = fill_NA( x = ., model = "lm_pred", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE ) )) %>% do(mutate(., Ozone_imp6 = fill_NA_N( x = ., model = "pmm", posit_y = "Ozone", posit_x = c("Intercept", "x_character_imp", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 20 ) )) %>% ungroup() %>% # Combine imputations mutate(Ozone_imp_mix = rowMeans(select(., starts_with("Ozone_imp")))) %>% # Protect against errors in small groups group_by(groups) %>% do(mutate(., Ozone_chac_imp = tryCatch( fill_NA( x = ., model = "lda", posit_y = "Ozone_chac", posit_x = c("Intercept", "Month", "Day", "Temp", "x_character_imp"), w = .[["weights"]] ), error = function(e) .[["Ozone_chac"]] ) )) %>% ungroup()
# Compare original vs. imputed distributions compare_imp(air_miss, origin = "Ozone", target = "Ozone_imp_mix") # Or compare multiple imputation columns compare_imp(air_miss, origin = "Ozone", target = c("Ozone_imp2", "Ozone_imp_mix"))
Below is a similar approach using data.table syntax.
data(air_miss) setDT(air_miss)
# Imputations air_miss[, Solar_R_imp := fill_NA_N( x = .SD, model = "lm_bayes", posit_y = "Solar.R", posit_x = c("Wind","Temp","Intercept"), w = .SD[["weights"]], k = 100 ), by = .(groups)] %>% .[, x_character_imp := fill_NA( x = .SD, model = "lda", posit_y = "x_character", posit_x = c("Wind","Temp","groups") )] %>% .[, Ozone_imp1 := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept"), logreg = TRUE )] %>% .[, Ozone_imp2 := fill_NA( x = .SD, model = "lm_bayes", posit_y = 1, posit_x = c(4,6), logreg = TRUE )] %>% .[, Ozone_imp3 := fill_NA_N( x = .SD, model = "lm_noise", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE, k = 30 )] %>% .[, Ozone_imp4 := fill_NA_N( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE, k = 30 )] %>% .[, Ozone_imp5 := fill_NA( x = .SD, model = "lm_pred", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE ), by = .(groups)] %>% .[, Ozone_imp6 := fill_NA_N( x = .SD, model = "pmm", posit_y = "Ozone", posit_x = c("Intercept","x_character_imp","Wind","Temp"), w = .SD[["weights"]], logreg = TRUE, k = 10 ), by = .(groups)] %>% # Average across a set of methods .[, Ozone_imp_mix := apply(.SD, 1, mean), .SDcols = Ozone_imp1:Ozone_imp6] %>% # tryCatch for small group issues or collinearity .[, Ozone_chac_imp := tryCatch( fill_NA( x = .SD, model = "lda", posit_y = "Ozone_chac", posit_x = c("Intercept","Month","Day","Temp","x_character_imp"), w = .SD[["weights"]] ), error = function(e) .SD[["Ozone_chac"]] ), by = .(groups)]
You can generate correlated data via corrData
:
# Constructors: new(corrData, nr_cat, n_obs, means, cor_matrix) new(corrData, n_obs, means, cor_matrix) # Some relevant methods: cd_obj$fill("type") # "contin", "binom", or "discrete"
nr_cat
: number of categories for a discrete dependent variable n_obs
: number of observations means
: centers for independent variables cor_matrix
: a positive-definite correlation matrix Example usage:
cd <- new(corrData, nr_cat = 3, n_obs = 1000, means = c(0,1,2), cor_matrix = some_matrix) generated_data <- cd$fill("discrete")
The miceFast module performs fast imputations via various built-in models. Below is a summary of its main methods:
set_data(x)
: Provide the data as a numeric matrix by reference. set_g(g)
: Provide a grouping variable (numeric vector, must be positive, no NAs). set_w(w)
: Provide a weighting variable (numeric vector, must be positive, no NAs). impute(model, posit_y, posit_x)
: Single imputation according to the chosen model. impute_N(model, posit_y, posit_x, k)
: Multiple imputations (averaged result) for certain models. update_var(posit_y, imputations)
: Permanently update the object’s data for the specified column with new imputations. vifs(posit_y, posit_x)
: Compute Variance Inflation Factors. sort_byg()
/ is_sorted_byg()
: Sort the data by the grouping variable (or check if sorted). which_updated()
: Check which variables have been permanently updated in the object.Supported models:
"lda"
: Linear discriminant analysis (for categorical posit_y
). "lm_pred"
: Standard linear model prediction (including a mean-imputation approach if you only include an intercept). "lm_bayes"
: Linear model with Bayesian noise. "lm_noise"
: Similar to "lm_bayes"
but with different noise assumptions. "pmm"
: Predictive Mean Matching (used via impute_N
). Note: For purely mean imputation, add an intercept column and use
"lm_pred"
. The LDA model requires at least 15 complete observations. Linear models require at least as many complete observations as the number of predictors.
data <- cbind(as.matrix(mice::nhanes), intercept = 1, index = 1:nrow(mice::nhanes)) model <- new(miceFast) model$set_data(data) # Single imputation, linear model imp_result <- model$impute("lm_pred", posit_y = 2, posit_x = 5) model$update_var(2, imp_result$imputations) # LDA imputation for a categorical variable model$update_var(3, model$impute("lda", 3, c(1,2))$imputations) # Multiple imputation example multi_imp <- model$impute_N("lm_bayes", 4, c(1,2,3), k = 10) model$update_var(4, multi_imp$imputations) # Check which variables were updated model$which_updated() # Always be cautious with 'update_var' # since it permanently alters the object and your data matrix.
data <- cbind(as.matrix(airquality[,-5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(airquality[,5]) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) # Impute with group-wise weighting model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), k = 10)$imputations) # Inspect updated data, returning to original order via model$get_index() head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)
data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE)) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) # The data will be automatically sorted by 'groups' during the first imputation: model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1,3,4,5,6), 10)$imputations) head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)
Creating matrices from data frames: Remember that standard R matrices can only hold one data type (often numeric). Factor and character columns must be transformed (e.g., via model.matrix
).
r
mtcars$cyl <- factor(mtcars$cyl)
mtcars$gear <- factor(mtcars$gear)
mtcars_mat <- model.matrix(~ ., data = mtcars, na.action = "na.pass")
Variance Inflation Factors (VIF): VIF measures how much the variance of a regression coefficient is inflated due to collinearity. High values (>10) may indicate problematic collinearity.
```r
airquality2 <- airquality
airquality2$Temp2 <- airquality2$Temp^2
airquality2$Month <- factor(airquality2$Month)
# Using data.table: VIF( airquality2, posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "Month", "Day", "Temp2"), correct = FALSE ) ```
Exposing C++ functions and classes with Rcpp modules
Dirk Eddelbuettel and Romain François (2018)
https://dirk.eddelbuettel.com/code/rcpp/Rcpp-modules.pdf
RcppArmadillo: Easily Extending R with High-Performance C++ Code
Dirk Eddelbuettel and Conrad Sanderson (2012)
https://dirk.eddelbuettel.com/papers/RcppArmadillo-intro.pdf
Extending R with C++: A Brief Introduction to Rcpp
Dirk Eddelbuettel and James Joseph Balamuta (2018)
https://dirk.eddelbuettel.com/code/rcpp/Rcpp-introduction.pdf
MICE: Multivariate Imputation by Chained Equations in R
Stef van Buuren (2013)
CSCE 666: Pattern Analysis
Ricardo Gutierrez-Osuna (Fall 2013)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.