knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
This vignette accompanies the deprecation of rMIDAS. Existing projects can keep using rMIDAS, but new development should move to rMIDAS2. The source repository for the successor package is https://github.com/MIDASverse/rMIDAS2.
rMIDAS2 is the successor to rMIDAS. It re-implements the MIDAS multiple imputation algorithm with several improvements:
| | rMIDAS | rMIDAS2 |
|---|---|---|
| Backend | TensorFlow (Python, via reticulate) | PyTorch (Python, via local HTTP API) |
| Runtime R dependency on reticulate | Yes | No |
| Preprocessing | Manual (convert()) | Automatic |
| Python versions | 3.6--3.10 | 3.9+ |
| TensorFlow required | Yes (< 2.12) | No |
The API is deliberately simpler: most pipelines that required four function calls in rMIDAS need just one or two in rMIDAS2.
# Remove rMIDAS (optional -- it can coexist) # remove.packages("rMIDAS") # Install rMIDAS2 install.packages("rMIDAS2") # One-time Python backend setup library(rMIDAS2) install_backend()
rMIDAS required configuring a reticulate Python environment with
TensorFlow:
# --- rMIDAS --- library(rMIDAS) # Python environment configured automatically on first load, # or manually via set_python_env()
rMIDAS2 uses a standalone Python server -- no reticulate needed at runtime:
# --- rMIDAS2 --- library(rMIDAS2) install_backend() # one-time setup # The server starts automatically when you call any imputation function
rMIDAS required explicit preprocessing with convert(), where you
had to specify which columns were binary and which were categorical:
# --- rMIDAS --- data(adult) adult_conv <- convert(adult, bin_cols = c("income"), cat_cols = c("workclass", "marital_status"), minmax_scale = TRUE)
rMIDAS2 detects column types automatically -- just pass your data frame directly:
# --- rMIDAS2 --- # No convert() step needed. Pass raw data to midas() or midas_fit().
rMIDAS used train():
# --- rMIDAS --- mid <- train(adult_conv, training_epochs = 20L, layer_structure = c(256, 256, 256), input_drop = 0.8, learn_rate = 0.0004, seed = 89L)
rMIDAS2 uses midas_fit() (or the all-in-one midas()):
# --- rMIDAS2 --- fit <- midas_fit(adult, epochs = 20L, hidden_layers = c(256L, 128L, 64L), corrupt_rate = 0.8, lr = 0.001, seed = 89L)
Parameter name changes:
| rMIDAS (train()) | rMIDAS2 (midas_fit()) | Notes |
|---|---|---|
| training_epochs | epochs | |
| layer_structure | hidden_layers | Default changed from 256-256-256 to 256-128-64 |
| input_drop | corrupt_rate | |
| learn_rate | lr | Default changed from 0.0004 to 0.001 |
| dropout_level | dropout_prob | |
| train_batch | batch_size | Default changed from 16 to 64 |
| cont_adj | num_adj | |
| softmax_adj | cat_adj | |
| binary_adj | bin_adj | |
rMIDAS used complete():
# --- rMIDAS --- imps <- complete(mid, m = 10) # Returns a list of 10 data.frames head(imps[[1]])
rMIDAS2 uses midas_transform():
# --- rMIDAS2 --- imps <- midas_transform(fit, m = 10) # Returns a list of 10 data.frames head(imps[[1]])
Or skip midas_fit() + midas_transform() entirely and use the
all-in-one midas():
# --- rMIDAS2 (all-in-one) --- result <- midas(adult, m = 10, epochs = 20) head(result$imputations[[1]])
The combine() interface has changed:
rMIDAS took a formula and a list of completed data frames:
# --- rMIDAS --- combine("income ~ age + hours_per_week", imps)
rMIDAS2 takes a model ID and an outcome variable name. Independent variables default to all other columns:
# --- rMIDAS2 --- combine(fit, y = "income") # Specify predictors explicitly: combine(fit, y = "income", ind_vars = c("age", "hours_per_week"))
The output format is the same: a data frame with columns term,
estimate, std.error, statistic, df, and p.value.
rMIDAS required re-specifying the data and column types:
# --- rMIDAS --- overimpute(adult, binary_columns = c("income"), softmax_columns = c("workclass", "marital_status"), training_epochs = 20L, spikein = 0.3)
rMIDAS2 runs overimputation on an already-fitted model:
# --- rMIDAS2 --- diag <- overimpute(fit, mask_frac = 0.1) diag$mean_rmse diag$rmse # per-column RMSE
rMIDAS2 adds imp_mean(), which computes the element-wise mean
across all imputations -- useful as a quick single point estimate:
# --- rMIDAS2 only --- mean_df <- imp_mean(fit) head(mean_df)
rMIDAS2 runs a background Python server that should be stopped when you are done:
# --- rMIDAS2 --- stop_server()
Below is a full rMIDAS pipeline and its rMIDAS2 equivalent.
library(rMIDAS) data(adult) adult <- adult[1:1000, ] # 1. Preprocess adult_conv <- convert(adult, bin_cols = c("income"), cat_cols = c("workclass", "marital_status"), minmax_scale = TRUE) # 2. Train mid <- train(adult_conv, training_epochs = 20L, seed = 89L) # 3. Generate imputations imps <- complete(mid, m = 5) # 4. Analyse combine("income ~ age + hours_per_week", imps)
library(rMIDAS2) data(adult) adult <- adult[1:1000, ] # 1. Fit and impute (no preprocessing needed) result <- midas(adult, m = 5, epochs = 20, seed = 89L) # 2. Analyse combine(result, y = "income", ind_vars = c("age", "hours_per_week")) # 3. Clean up stop_server()
| Task | rMIDAS | rMIDAS2 |
|---|---|---|
| Install Python env | Automatic / set_python_env() | install_backend() |
| Preprocess data | convert(data, bin_cols, cat_cols) | Not needed |
| Train model | train(data, training_epochs, ...) | midas_fit(data, epochs, ...) |
| Generate imputations | complete(model, m) | midas_transform(model, m) |
| Train + impute (one step) | Not available | midas(data, m, epochs, ...) |
| Mean imputation | Not available | imp_mean(model) |
| Rubin's rules | combine(formula, df_list) | combine(model, y, ind_vars) |
| Overimputation | overimpute(data, ...) | overimpute(model, mask_frac) |
| Shutdown | Not needed | stop_server() |
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.