scrimp_mdl: Score imputations based on accuracy of downstream models
In bcjaeger/midy: Imputation for Predictive Analytics

Description Usage Arguments Value Examples

Imputation of missing data is generally completed in order to fit downstream models that require complete input data. For supervised learning analyses, a key goal is to develop models with optimal accuracy, so analysts will likely want to use whatever imputation strategy provides the most accurate downstream model. scrimp_mdl facilitates this comparison.

1	scrimp_mdl(train_imputed, test_imputed, outcome, .fun = NULL, .fun_args = NULL)

`train_imputed`	an imputed data frame with training data.
`test_imputed`	an imputed data frame with testing data.
`outcome`	column name(s) of outcomes. These values can be provided as symbols (e.g., outcome = c(a,b,c) for multiple outcomes or outcome = a for one outcome) or character values (e.g., outcome = c('a','b','c') for multiple outcomes or outcome = 'a' for a single outcome).
`.fun`	a function with at least three inputs: `.trn` `.tst`, and `outcome`. `scrimp_mdl()` will call your function as follows: `.fun(.trn = train_imputed, .tst = test_imputed, outcome = outcome, ...)`, where `...` is filled in by `.fun_args`. Generally, `.fun` should develop a prediction model using `.trn` create predicted values for observations in `.tst` evaluates the predictions using a summary measure (e.g., R-squared, area underneath the ROC curve, Brier score, etc.). See example below where a function using random forests is applied.
`.fun_args`	a named list with additional arguments for `.fun`.

If you supply your own function using .fun, scrimp_mdl will return the output of .fun. If .fun is left unspecified, a named list is returned with components

model_cv: a model tuned by cross-validation and fitted to the training data
preds_cv: the model's predicted values for internal testing data
preds_ex: the model's predicted values for external testing data
score_ex: a numeric value indicating external prediction accuracy

The list's contents will vary depending on .fun_args. By default, when fun is unspecified, .fun_args is governed by the net_args() function, which includes inputs of keep_mdl and keep_prd. The default value for keep_mdl is FALSE while that of keep_prd is TRUE, so users who want to receive a list including model should write .fun_args = net_args(keep_mdl = TRUE) when calling scrimp_mdl.

data("diabetes")
trn <- diabetes$missing[1:200, ]
tst <- diabetes$missing[-c(1:200), ]

imputes <- brew_soft(trn, outcome = diabetes) %>%
  mash(with = masher_soft(si_maxit = 1000)) %>%
  stir() %>%
  ferment(data_new = tst) %>%
  bottle() %>%
  .[['wort']] %>%
  .[5, list(training, testing)]

# use the default glmnet logistic regression model
dflt_output <- scrimp_mdl(
  train_imputed = imputes$training[[1]],
  test_imputed  = imputes$testing[[1]],
  outcome = diabetes)

# use default glmnet and include fitted model in list output
include_mdls <- scrimp_mdl(
  train_imputed = imputes$training[[1]],
  test_imputed  = imputes$testing[[1]],
  outcome = diabetes,
  .fun_args = net_args(keep_mdl = TRUE))

## Not run: 
# write your own function:
# note the function inputs can be ordered however you like, but the
# names of the inputs **must** include be .trn, .tst, and outcome
rngr_fun <- function(outcome, .trn, .tst, num_trees){

  # make a model formula
  formula <- as.formula(paste(outcome, '~ .'))
  # fit a random forest with ranger (probability = TRUE -> predicted probs)
  mdl <- ranger::ranger(formula = formula, data = .trn,
    probability = TRUE, num.trees = num_trees)
  # prediction from ranger returns matrix with two columns, we need the 2nd.
  prd <- predict(mdl, data = .tst)$predictions[ , 2, drop = TRUE]
  # compute model AUC
  yardstick::roc_auc_vec(.tst[[outcome]], prd)

}

scrimp_mdl(
  train_imputed = imputes$training[[1]],
  test_imputed  = imputes$testing[[1]],
  outcome = 'diabetes',
  .fun = rngr_fun,
  .fun_args = list(num_trees = 100)
)#'

## End(Not run)