README.md
In misl: Multiple Imputation by Super Learning

misl

Note: This package is currently experimental and under active development. The API may change. Feedback and bug reports are welcome via GitHub Issues.

misl implements Multiple Imputation by Super Learning (MISL), a flexible approach to handling missing data that uses a stacked ensemble of machine learning algorithms to impute missing values across continuous, binary, and categorical variables.

Rather than relying on a single parametric imputation model, MISL builds a super learner for each incomplete variable using the tidymodels framework, combining learners such as linear/logistic regression, random forests, gradient boosted trees, and MARS to produce well-calibrated imputations.

The method is described in:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904–1915. doi: 10.1177/09622802221104238

misl is not yet on CRAN. Install the development version from GitHub:

# install.packages("remotes")
remotes::install_github("JustinManjourides/misl")

The following backend packages are optional but recommended:

install.packages(c("ranger", "xgboost", "earth"))

library(misl)

# Introduce missingness into a dataset
set.seed(42)
n <- 200
demo_data <- data.frame(
  age    = rnorm(n, mean = 50, sd = 10),
  weight = rnorm(n, mean = 70, sd = 15),
  smoker = rbinom(n, 1, 0.3),
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE))
)
demo_data[sample(n, 20), "age"]    <- NA
demo_data[sample(n, 15), "weight"] <- NA
demo_data[sample(n, 10), "smoker"] <- NA
demo_data[sample(n, 10), "group"]  <- NA

# Run MISL with default settings
misl_imp <- misl(
  demo_data,
  m      = 5,
  maxit  = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("rand_forest", "multinom_reg")
)

# Each of the m imputed datasets is accessible via:
completed_data <- misl_imp[[1]]$datasets

# Trace plots can be used to inspect convergence:
trace <- misl_imp[[1]]$trace

Imputation across the m datasets is parallelised via the future framework. To enable parallel execution, set a plan before calling misl():

library(future)
plan(multisession, workers = 4)

misl_imp <- misl(demo_data, m = 5, maxit = 5)

plan(sequential)  # reset when done

# View all available learners
list_learners()

# Filter by outcome type
list_learners("continuous")
list_learners("categorical")

# Show only installed learners
list_learners(installed_only = TRUE)

If you use misl in your research, please cite the original paper:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super
learning. Statistical Methods in Medical Research. 31(10):1904-1915.
doi: 10.1177/09622802221104238

BibTeX:

@article{carpenito2022misl,
  author  = {Carpenito, T and Manjourides, J},
  title   = {{MISL}: Multiple imputation by super learning},
  journal = {Statistical Methods in Medical Research},
  year    = {2022},
  volume  = {31},
  number  = {10},
  pages   = {1904--1915},
  doi     = {10.1177/09622802221104238}
}