knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
ipa
)The goal of ipa
is to make imputation in predictive modeling workflows more straightforward and efficient. The main functions in ipa
are
brew
Create a container to hold your imputations
spice
(optional) set parameters that govern the number of imputations for the given brew
mash
fit models that will provide imputations
ferment
impute missing values in training and (optionally) testing data
bottle
output the imputed datasets in a tibble.
You can install the development version of ipa
from GitHub with:
# install.packages("devtools") devtools::install_github("bcjaeger/ipa")
These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine. The data originally described 1046 participants who were interviewed in a study to understand the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African Americans. A total of 403 participants were screened for diabetes, which was defined as glycosolated hemoglobin > 7.0. Among those 403 participants, 366 provided complete data for the variables in these data.
First, we'll load some packages and set a seed for reproducibility
library(tidymodels) library(tidyverse) library(ipa) data("diabetes", package = 'ipa') diab_complete <- diabetes$complete diab_missing <- diabetes$missing glimpse(diab_missing)
K-nearest-neighbors (KNN) is a flexible and useful method for imputation of missing data. Briefly, each missing value is imputed by aggregating or randomly sampling a value from the K observations with greatest similarity to the current obervation. Conventional methods for KNN provide imputations using a single value of K, which makes it hard to identify an optimal value of K. Ideally, one would generate imputed datasets using a number of different values of K, and then use whichever K provided the most accurate imputed values or the most accurate prediction model (usually, these K are the same or similar). This is one of the things ipa
does.
Generally, the ipa
workflow follows the same patterned steps. Here is an example of this workflow using k-nearest-neighbors. A more detailed explanation of the steps below is written in the Getting started
vignette.
nbrs_brew <- brew_nbrs(diab_missing, outcome = diabetes) %>% verbose_on(level = 1) %>% # makes the brew noisy spice(with = spicer_nbrs(k_neighbors = 1:35)) %>% # tuning parameters mash() %>% # creates imputed values ferment() %>% # turns values into a data frame bottle(type = 'tibble') # turns data into tibbles, adds outcome column nbrs_brew
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.