Introduction to misl

knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = TRUE,
  warning   = FALSE,
  message   = FALSE
)
library(misl)

Overview

misl implements Multiple Imputation by Super Learning, a flexible approach to handling missing data described in:

Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super learning. Statistical Methods in Medical Research. 31(10):1904--1915. doi: 10.1177/09622802221104238

Rather than relying on a single parametric imputation model, misl builds a stacked ensemble of machine learning algorithms for each incomplete column, producing well-calibrated imputations for continuous, binary, categorical, and ordered categorical variables.

Installation

# Install from GitHub
remotes::install_github("JustinManjourides/misl")

# Optional backend packages for additional learners
install.packages(c("ranger", "xgboost", "earth", "MASS"))

Simulated data

We simulate a small dataset with four types of incomplete variables to demonstrate misl across all supported outcome types.

library(misl)

set.seed(42)
n <- 300

sim_data <- data.frame(
  # Continuous predictors (always observed)
  age    = round(rnorm(n, mean = 50, sd = 12)),
  bmi    = round(rnorm(n, mean = 26, sd = 4), 1),
  # Continuous outcome with missingness
  sbp    = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
                       0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
  # Binary outcome with missingness (0 = no, 1 = yes)
  smoker = rbinom(n, 1, prob = 0.3),
  # Unordered categorical outcome with missingness
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE,
                         prob = c(0.4, 0.35, 0.25))),
  # Ordered categorical outcome with missingness
  health = factor(sample(c("Poor", "Fair", "Good", "Excellent"), n,
                         replace = TRUE, prob = c(0.1, 0.2, 0.5, 0.2)),
                  levels  = c("Poor", "Fair", "Good", "Excellent"),
                  ordered = TRUE)
)

# Introduce missing values
sim_data[sample(n, 40), "sbp"]    <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"]  <- NA
sim_data[sample(n, 30), "health"] <- NA

# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))

Built-in learners

Use list_learners() to see the available named learners, optionally filtered by outcome type:

knitr::kable(list_learners())
knitr::kable(list_learners("ordinal"))

Basic usage

The primary function is misl(). Supply a dataset and specify:

Running misl() with a single named learner per outcome type is the fastest option and is well suited for exploratory work. Note that ordered factors are automatically detected and routed to ord_method:

misl_imp <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)

misl() returns a list of length m. Each element contains:

# Number of imputed datasets
length(misl_imp)

# Confirm no missing values remain
anyNA(misl_imp[[1]]$datasets)

# Confirm ordered factor is preserved
is.ordered(misl_imp[[1]]$datasets$health)
levels(misl_imp[[1]]$datasets$health)

Custom learners via parsnip

In addition to the built-in named learners, misl v2.0 accepts any parsnip-compatible model spec directly. This allows you to use any learner available in the tidymodels ecosystem without waiting for it to be added to the built-in registry.

Passing a custom spec

Simply pass a parsnip model spec in place of (or alongside) a named string. misl will automatically enforce the correct mode (regression vs classification) regardless of what is set on the spec:

library(parsnip)

# A random forest with custom hyperparameters
custom_rf <- rand_forest(trees = 500, mtry = 3) |>
  set_engine("ranger")

misl_custom <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list(custom_rf),
  bin_method = list(custom_rf),
  cat_method = "multinom_reg",
  ord_method = "polr",
  seed       = 42,
  quiet      = TRUE
)

Mixing named learners and custom specs

Named strings and parsnip specs can be freely mixed in the same list. When multiple learners are supplied, misl uses cross-validation to build a stacked ensemble:

library(parsnip)

# Mix a named learner with a custom tuned spec
custom_xgb <- boost_tree(trees = 200, learn_rate = 0.05) |>
  set_engine("xgboost")

misl_mixed <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", custom_xgb),
  bin_method = list("glm", custom_xgb),
  cat_method = list("multinom_reg", "rand_forest"),
  ord_method = list("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)

Using a learner not in the built-in registry

Any parsnip-supported engine can be used. For example, a support vector machine via the kernlab package:

library(parsnip)

# SVM - not in the built-in registry but works via parsnip
svm_spec <- svm_rbf() |>
  set_engine("kernlab")

misl_svm <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = list("glm", svm_spec),
  bin_method = list("glm", svm_spec),
  cat_method = "multinom_reg",
  ord_method = "polr",
  cv_folds   = 3,
  seed       = 42
)

Ordinal outcomes and the polr learner

For ordered categorical variables, misl automatically detects ordered factors and routes them to ord_method. The default learner is "polr" (proportional odds logistic regression from the MASS package), which respects the ordering of the levels:

# Ensure your ordered variable is an ordered factor
sim_data$health <- factor(sim_data$health,
  levels  = c("Poor", "Fair", "Good", "Excellent"),
  ordered = TRUE
)

misl_ordinal <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  ord_method = "polr",   # proportional odds model for ordered categories
  seed       = 42,
  quiet      = TRUE
)

# Imputed values respect the ordering
is.ordered(misl_ordinal[[1]]$datasets$health)
levels(misl_ordinal[[1]]$datasets$health)

Multiple learners and stacking

When multiple learners are supplied (whether named strings, parsnip specs, or a mix), misl uses cross-validation to learn optimal ensemble weights via the stacks package. Use cv_folds to reduce the number of folds and speed up computation:

misl_stack <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)

Analysing the imputed datasets

After imputation, fit your analysis model to each of the m datasets and pool the results using Rubin's rules. Here we implement pooling manually using base R:

# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
  lm(sbp ~ age + bmi + smoker + group + health, data = imp$datasets)
})

# Pool point estimates and standard errors using Rubin's rules
m       <- length(models)
ests    <- sapply(models, function(fit) coef(fit))
vars    <- sapply(models, function(fit) diag(vcov(fit)))

Q_bar   <- rowMeans(ests)                          # pooled estimate
U_bar   <- rowMeans(vars)                          # within-imputation variance
B       <- apply(ests, 1, var)                     # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B                # total variance

pooled <- data.frame(
  term      = names(Q_bar),
  estimate  = round(Q_bar, 4),
  std.error = round(sqrt(T_total), 4),
  conf.low  = round(Q_bar - 1.96 * sqrt(T_total), 4),
  conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)

For a full implementation of Rubin's rules including degrees of freedom and p-values, the mice package provides pool() and can be used directly with misl output:

library(mice)
pooled_mice <- summary(pool(models), conf.int = TRUE)

Convergence: trace plots

The plot_misl_trace() function plots the mean or standard deviation of imputed values across iterations for a given variable, with one line per imputed dataset. Stable traces that mix well across datasets indicate convergence. Note that trace statistics are only computed for continuous and numeric binary columns — they are not available for categorical or ordinal columns.

# Plot mean of imputed sbp values across iterations for each dataset
plot_misl_trace(misl_imp, variable = "sbp", ylab = "Mean imputed sbp (mm Hg)")
# Plot the standard deviation instead
plot_misl_trace(misl_imp, variable = "sbp", statistic = "sd")

Stable traces that mix well across datasets indicate the algorithm has converged.

Parallelism

The m imputed datasets are generated independently and can be run in parallel using the future framework. Set a parallel plan before calling misl():

library(future)

# Use all available cores
plan(multisession)

misl_parallel <- misl(
  sim_data,
  m          = 10,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  ord_method = c("polr", "rand_forest"),
  seed       = 42
)

# Always reset the plan when done
plan(sequential)

To limit the number of cores:

plan(multisession, workers = 4)

The largest speedup comes from running the m datasets simultaneously. Check how many cores are available with:

parallel::detectCores()

Session info

sessionInfo()


Try the misl package in your browser

Any scripts or data that you put into this service are public.

misl documentation built on April 8, 2026, 9:07 a.m.