knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette shows you how to create a brew
object, a container for multiple imputed datasets and the primary focus of the ipa
package.
ipa
?The ipa
package helps create multiple imputed datasets for prediction models. ipa
offers classical imputation functions such as nearest neighbor imputation in addition to more scalable functions, e.g. the softImpute
algorithm. ipa
also allows you to assess imputed datasets in various ways:
you can identify a single missing data strategy that most accurately imputes missing data or results in the most accurate prediction model.
You can blend imputation strategies to eek out the best possible set(s) of imputed data for your analysis.
You can combine imputation strategies, creating a framework for multiple imputation, which sometimes provides even more accurate prediction models than a well-chosen single imputation strategy.
True to its name, ipa
has a workflow based around the process of brewing. The main steps involved are
brew
: create a container to hold your imputations
spice
: set primary imputation parameters
mash
: set secondary imputation parameters
stir
: fit imputation model(s) to training data
ferment
: Apply imputation models to testing data
bottle
: convert imputed values to imputed datasets (tibble
or matrix
)
sip
: evaluate accuracy of imputed values for each variable, separately. (requires observed data).
chug
: evaluate accuracy of downstream prediction models.
distill
: reduce multiple imputed sets into one (based on accuracy from sip
or chug
)
flight
: create a set of multiply imputed data optimized for prediction accuracy (in development).
As you read this list, you may think 'wow - that sounds like a dumb idea for data analysis', and I get it - whimsical programming interfaces aren't for everyone. That is why ipa
also provides a family of impute
functions that can do at least as much as the brew
functions.
These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data originally described 1046 participants who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. A total of 403 participants were screened for diabetes, which was defined as glycosolated hemoglobin > 7.0. Among those 403 participants, 366 provided complete data for the variables in these data.
We'll work with the complete (diab_complete
) and incomplete (diab_incomplete
) versions of the diabetes
data described above. diab_incomplete
has about 40% missing values in each column except for diabetes status, which is complete. Missing values occur completely at random. We'll develop a prediction model to classify whether someone has prevalent type II diabetes based on clinical predictors.
library(ipa) library(dplyr) library(purrr) library(yardstick) # data for this vignette data("diabetes", package = 'ipa') # will only use a subset of columns for simplicity keep_cols <- c('diabetes', 'chol', 'hdl', 'age', 'gender', 'frame', 'waist') # only a subset of rows have >= 2 observed value for these columns keep_rows <- apply( X = diabetes$missing[, keep_cols], MARGIN = 1, FUN = function(x) sum(!is.na(x)) > 1 ) # data is a list of complete / incomplete data # so we need to apply changes with map diabetes <- map(diabetes, ~.x[keep_rows, keep_cols]) diab_complete <- diabetes$complete diab_missing <- diabetes$missing
A common task for prediction model development is handling missing values. ipa
is designed to help analysts consider more options as they engage with this task. This vignette shows some of ipa
's features in the context of imputing missing values in the diabetes
data. Here is a glimpse of the complete version of the data:
glimpse(diab_complete)
And here is the amputed data, missing roughly 40% of the original values.
glimpse(diab_missing)
We'll work with a training set and testing set in this example:
# random seed for reproducing these results set.seed(730) train_index <- sample(nrow(diab_missing), 200) training <- diab_missing[train_index, ] testing <- diab_missing[-train_index, ]
ipa_brew
objects are created using brew_<flavor>
functions. These object are lists containing
data
: training data with missing values pars
: tuning parameters for imputation.lims
: upper and lower bounds for pars
wort
: a container for imputed data (initially empty)Here we use brew_nbrs
to make a brew that will use k-nearest neighbors to impute missing values.
brew_init <- brew_nbrs(data = training, outcome = diabetes)
Printing your brew
(at this stage) will show the original data with missing values, along with some additional information about the brew
's flavor.
brew_init
It is often helpful (especially when starting to use ipa
) to receive ongoing messages that tell you what is going on with your brew. At any point in your brewing process, you can set the verbosity of your brew using verbose_on()
and
verbose_off()
. For example,
brew_init <- verbose_on(brew_init, level = 1)
spice
lets you control the primary parameters for imputing data with your brew. By primary parameters, we mean the parameters that govern how many imputed datasets are created. For nearest neighbor imputation, the primary tuning parameter is the number of nearest neighbors.
The spice
function has an argument called with
that depends on your brew's flavor. The general idea is to use with = spicer_<flavor>
for a <flavor>
brew, so we use spicer_nbrs
for our nbrs
brew. The advantage of using spicer
functions is that, if you use Rstudio, you can hit the tab key and automatically see a list of the relevant parameters for your brew.
For brewing veterans, spice
also contains a ...
input that lets you specific parameters directly instead if using with = spicer_nbrs()
. This lets you write more concise code if you already know what parameters go with your brew's flavor and you don't need any help remembering them. To clarify this, we use the spice
function both ways below.
# how to spice if you want help remembering what parameters to use spcd_brew <- spice(brew_init, with = spicer_nbrs(k_neighbors = c(1:50))) # a more concise spice for those who don't need auto-completion spcd_brew <- spice(brew_init, k_neighbors = 1:50) # take a look at the brew's parameters (what we added using spice) spcd_brew$pars
mash
allows you to set secondary parameters for imputing data with your brew. Additionally, mash
fits the models and adds their imputed values to the wort
component of the brew
. By secondary, we mean parameters that do not impact the number of imputed datasets created.
For a nearest neighbors brew, mash
allows us to set functions that will be used to aggregate continuous, integer, or categorical values from $k$ nearest neighbors. Just like spice
, we can use masher_<flavor>
for a <flavor>
brew, or we can write input arguments directly using ...
.
# how to mash if you want help remembering what parameters to use mshd_brew <- mash(spcd_brew, with = masher_nbrs(fun_aggr_ctns = median)) # a more concise mashing call if you don't need help remembering parameters. mshd_brew <- mash(spcd_brew, fun_aggr_ctns = median)
Once the parameters are set, it's time to fit imputation models with stir
. This step will probably require more computation time compared to other steps in the brew
workflow. the timer
argument can be set to TRUE
for this function in case you'd like to monitor time taken to fit the models used for imputation.
stirred_brew <- stir(mshd_brew, timer = TRUE) stirred_brew
The printed output of the brew
has changed. Originally, printing our brew
showed the data we wanted to impute, but now printing the brew object shows a data.table
containing an id column (impute
), imputation parameters (k_neighbors
), and imputed values for the training data (iv_training
).
Notably, the brew
's structure has not changed (it is still a list with four things), but print.ipa_brew()
now shows brew$wort
instad of brew$data
. The print
method for ipa_brew
objects is organized this way because focus normally shifts to the imputed data after running imputation.
ferment
imputes missing values in testing data and only has three arguments, making it very easy to use:
brew
: an ipa_brew
object
data_new
testing data with missing values
timer
a logical value indicating whether to monitor time needed for imputation of missing values in data_new
ferm_brew <- ferment(stirred_brew, data_new = testing, timer = TRUE) # the brew's wort now has an additional column, iv_testing, # which contains imputed values for testing data. ferm_brew
Under the hood, ferment
does a lot of opinionated work based on the following principles:
Missing data strategies should use the training data to impute missing values in the testing data and should never use the testing data to impute the training data. ipa
applies two methods related to this principle:
Impute with fit: Models developed from the training data are applied to create imputations for missing values in the testing data. This is
the strategy applied by nearest neighbor imputation. Some imputation strategies cannot do this by design (e.g. softImpute
), and ipa
applies the 'stack and fit' method for these.
Stack and fit: Training data are imputed using only the training data. To impute testing data, the unimputed training and testing data are stacked. The same imputation models that were used to impute training data are re-fitted to the stacked dataset, and testing data are imputed based on these models. This strategy may require a lot more computing time than 'impute with fit', and is therefore only applied when needed.
Outcome columns should not be used to impute missing values in training data. In statistical inference, it is widely known that outcome columns should be used to impute missing data in predictors, but the circumstances are different for prediction models. The difference is that outcome columns will naturally be missing in testing data from the real world. By 'real world', I mean settings where the outcome hasn't happened yet, e.g., predicting a patient's 10-year risk for having a stroke. A missing data strategy that depends on accessing outcome data will not be viable here.
The same modeling strategy that was used to impute training data should also be used to impute testing data. In other words, we don't recommend imputing the training data with softImpute
and then imputing the testing data using kneighbors
.
What if you don't agree with these principles? It is hard to break from them using functions in the brew
family. However, the more flexible ipa
functions (e.g., impute_nbrs
, impute_soft
) can break these principles if you force them.
After imputation, your brew is ready to be bottled. The bottle
function will
replace the lists of imputed values with lists of imputed data sets in the wort
of the brew.
bind the outcome column(s) with these imputed datasets.
format the imputed datasets into tibble
or matrix
composition.
imputes <- bottle(ferm_brew, type = 'tibble') imputes
The brew
functions in ipa
are designed to be used with the %>%
operator. A typical workflow for creating a brew looks like this:
imputes <- training %>% # start with training data brew_nbrs(outcome = diabetes) %>% # initialize brew; separate outcomes spice(k_neighbors = 1:50) %>% # set primary tuning parameters mash(fun_aggr_ctns = median) %>% # set secondary tuning parameters stir(timer = TRUE) %>% # impute training data ferment(data_new = testing) %>% # impute testing data bottle(type = 'tibble') # turn imputed values into datasets
We now have a total of 100 imputed data sets: 50 training sets and 50 testing sets using 1, 2, 3, ..., 50 neighbors to impute missing values.
In the next vignette, we will talk about sip
and chug
, which will help you assess your imputation strategies.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.