Introduction to misl
In misl: Multiple Imputation by Super Learning

knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = TRUE,
  warning   = FALSE,
  message   = FALSE
)
library(misl)

Overview

misl implements Multiple Imputation by Super Learning, a flexible approach to handling missing data described in:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904--1915. doi: 10.1177/09622802221104238

Rather than relying on a single parametric imputation model, misl builds a stacked ensemble of machine learning algorithms for each incomplete column, producing well-calibrated imputations for continuous, binary, categorical, and ordered categorical variables.

Installation

# Install from GitHub
remotes::install_github("JustinManjourides/misl")

# Optional backend packages for additional learners
install.packages(c("ranger", "xgboost", "earth", "MASS"))

Simulated data

We simulate a small dataset with four types of incomplete variables to demonstrate misl across all supported outcome types.

library(misl)

set.seed(42)
n <- 300

sim_data <- data.frame(
  # Continuous predictors (always observed)
  age    = round(rnorm(n, mean = 50, sd = 12)),
  bmi    = round(rnorm(n, mean = 26, sd = 4), 1),
  # Continuous outcome with missingness
  sbp    = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
                       0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
  # Binary outcome with missingness (0 = no, 1 = yes)
  smoker = rbinom(n, 1, prob = 0.3),
  # Unordered categorical outcome with missingness
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE,
                         prob = c(0.4, 0.35, 0.25))),
  # Ordered categorical outcome with missingness
  health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n,
                         replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)),
                  levels  = c("Poor", "Fair", "Good", "Excellent"),
                  ordered = TRUE)
)

# Introduce missing values
sim_data[sample(n, 40), "sbp"]    <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"]  <- NA
sim_data[sample(n, 30), "health"] <- NA

# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))

Built-in learners

Use list_learners() to see the available named learners, optionally filtered by outcome type:

knitr::kable(list_learners())

knitr::kable(list_learners("ordinal"))

Basic usage

The primary function is misl(). Supply a dataset and specify:

m -- the number of multiply imputed datasets to create
maxit -- the number of Gibbs sampling iterations per dataset
con_method, bin_method, cat_method, ord_method -- learners for each outcome type

Running misl() with a single named learner per outcome type is the fastest option and is well suited for exploratory work. Note that ordered factors are automatically detected and routed to ord_method:

misl_imp <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)

misl() returns a list of length m. Each element contains:

$datasets -- the fully imputed tibble
$trace -- a long-format tibble of convergence statistics

# Number of imputed datasets
length(misl_imp)

# Confirm no missing values remain
anyNA(misl_imp[[1]]$datasets)

# Confirm ordered factor is preserved
is.ordered(misl_imp[[1]]$datasets$health)
levels(misl_imp[[1]]$datasets$health)

Custom learners via parsnip

In addition to the built-in named learners, misl v2.0 accepts any parsnip-compatible model spec directly. This allows you to use any learner available in the tidymodels ecosystem without waiting for it to be added to the built-in registry.

Passing a custom spec

Simply pass a parsnip model spec in place of (or alongside) a named string. misl will automatically enforce the correct mode (regression vs classification) regardless of what is set on the spec:

library(parsnip)

# A random forest with custom hyperparameters
custom_rf <- rand_forest(trees = 500, mtry = 3) |>
  set_engine("ranger")

misl_custom <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list(custom_rf),
  bin_method = list(custom_rf),
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)

Mixing named learners and custom specs

Named strings and parsnip specs can be freely mixed in the same list. When multiple learners are supplied, misl uses cross-validation to build a stacked ensemble:

library(parsnip)

# Mix a named learner with a custom tuned spec
custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |>
  set_engine("xgboost")

misl_mixed <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", custom_xgb),
  bin_method = list("glm", custom_xgb),
  cat_method = list("multinom_reg", "rand_forest"),
  ord_method = list("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)

Using a learner not in the built-in registry

Any parsnip-supported engine can be used. For example, a support vector machine via the kernlab package:

library(parsnip)

# SVM - not in the built-in registry but works via parsnip
svm_spec <- svm_rbf() |>
  set_engine("kernlab")

misl_svm <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", svm_spec),
  bin_method = list("glm", svm_spec),
  cat_method = "multinom_reg",
  ord_method = "polr",
  cv_folds   = 3,
  seed       = 42
)

Ordinal outcomes and the polr learner

For ordered categorical variables, misl automatically detects ordered factors and routes them to ord_method. The default learner is "polr" (proportional odds logistic regression from the MASS package), which respects the ordering of the levels:

# Ensure your ordered variable is an ordered factor
sim_data$health <- factor(sim_data$health,
  levels  = c("Poor", "Fair", "Good", "Excellent"),
  ordered = TRUE
)

misl_ordinal <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",   # proportional odds model for ordered categories
  seed       = 42,
  quiet      = TRUE
)

# Imputed values respect the ordering
is.ordered(misl_ordinal[[1]]$datasets$health)
levels(misl_ordinal[[1]]$datasets$health)

Multiple learners and stacking

When multiple learners are supplied (whether named strings, parsnip specs, or a mix), misl uses cross-validation to learn optimal ensemble weights via the stacks package. Use cv_folds to reduce the number of folds and speed up computation:

misl_stack <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)

Analysing the imputed datasets

After imputation, fit your analysis model to each of the m datasets and pool the results using Rubin's rules. Here we implement pooling manually using base R:

# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
  lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets)
})

# Pool point estimates and standard errors using Rubin's rules
m       <- length(models)
ests    <- sapply(models, function(fit) coef(fit))
vars    <- sapply(models, function(fit) diag(vcov(fit)))

Q_bar   <- rowMeans(ests)                          # pooled estimate
U_bar   <- rowMeans(vars)                          # within-imputation variance
B       <- apply(ests, 1, var)                     # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B                # total variance

pooled <- data.frame(
  term      = names(Q_bar),
  estimate  = round(Q_bar, 4),
  std.error = round(sqrt(T_total), 4),
  conf.low  = round(Q_bar - 1.96 * sqrt(T_total), 4),
  conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)

For a full implementation of Rubin's rules including degrees of freedom and p-values, the mice package provides pool() and can be used directly with misl output:

library(mice)
pooled_mice <- summary(pool(models), conf.int = TRUE)

Convergence: trace plots

The plot_misl_trace() function plots the mean or standard deviation of imputed values across iterations for a given variable, with one line per imputed dataset. Stable traces that mix well across datasets indicate convergence. Note that trace statistics are only computed for continuous and numeric binary columns — they are not available for categorical or ordinal columns.

# Plot mean of imputed sbp values across iterations for each dataset
plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)")

# Plot the standard deviation instead
plot_misl_trace(misl_imp, variable = "sbp", statistic = "sd")

Stable traces that mix well across datasets indicate the algorithm has converged.

Parallelism

The m imputed datasets are generated independently and can be run in parallel using the future framework. Set a parallel plan before calling misl():

library(future)

# Use all available cores
plan(multisession)

misl_parallel <- misl(
  sim_data,
  m          = 10,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  seed       = 42
)

# Always reset the plan when done
plan(sequential)

To limit the number of cores:

plan(multisession, workers = 4)

The largest speedup comes from running the m datasets simultaneously. Check how many cores are available with:

parallel::detectCores()

Session info

sessionInfo()

Any scripts or data that you put into this service are public.

misl documentation built on April 8, 2026, 9:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

misl
Multiple Imputation by Super Learning

Introduction to misl
In misl: Multiple Imputation by Super Learning

Overview

Installation

Simulated data

Built-in learners

Basic usage

Custom learners via parsnip

Passing a custom spec

Mixing named learners and custom specs

Using a learner not in the built-in registry

Ordinal outcomes and the polr learner

Multiple learners and stacking

Analysing the imputed datasets

Convergence: trace plots

Parallelism

Session info

Try the misl package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

misl Multiple Imputation by Super Learning

Introduction to misl In misl: Multiple Imputation by Super Learning

Overview

Installation

Simulated data

Built-in learners

Basic usage

Custom learners via parsnip

Passing a custom spec

Mixing named learners and custom specs

Using a learner not in the built-in registry

Ordinal outcomes and the polr learner

Multiple learners and stacking

Analysing the imputed datasets

Convergence: trace plots

Parallelism

Session info

Try the misl package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

misl
Multiple Imputation by Super Learning

Introduction to misl
In misl: Multiple Imputation by Super Learning