scrimp_mdl: Score imputations based on accuracy of downstream models

Description Usage Arguments Value Examples

View source: R/scrimp.R

Description

Imputation of missing data is generally completed in order to fit downstream models that require complete input data. For supervised learning analyses, a key goal is to develop models with optimal accuracy, so analysts will likely want to use whatever imputation strategy provides the most accurate downstream model. scrimp_mdl facilitates this comparison.

Usage

1
scrimp_mdl(train_imputed, test_imputed, outcome, .fun = NULL, .fun_args = NULL)

Arguments

train_imputed

an imputed data frame with training data.

test_imputed

an imputed data frame with testing data.

outcome

column name(s) of outcomes. These values can be provided as symbols (e.g., outcome = c(a,b,c) for multiple outcomes or outcome = a for one outcome) or character values (e.g., outcome = c('a','b','c') for multiple outcomes or outcome = 'a' for a single outcome).

.fun

a function with at least three inputs: .trn .tst, and outcome. scrimp_mdl() will call your function as follows: .fun(.trn = train_imputed, .tst = test_imputed, outcome = outcome, ...), where ... is filled in by .fun_args. Generally, .fun should

  1. develop a prediction model using .trn

  2. create predicted values for observations in .tst

  3. evaluates the predictions using a summary measure (e.g., R-squared, area underneath the ROC curve, Brier score, etc.).

See example below where a function using random forests is applied.

.fun_args

a named list with additional arguments for .fun.

Value

If you supply your own function using .fun, scrimp_mdl will return the output of .fun. If .fun is left unspecified, a named list is returned with components

The list's contents will vary depending on .fun_args. By default, when fun is unspecified, .fun_args is governed by the net_args() function, which includes inputs of keep_mdl and keep_prd. The default value for keep_mdl is FALSE while that of keep_prd is TRUE, so users who want to receive a list including model should write .fun_args = net_args(keep_mdl = TRUE) when calling scrimp_mdl.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
data("diabetes")
trn <- diabetes$missing[1:200, ]
tst <- diabetes$missing[-c(1:200), ]

imputes <- brew_soft(trn, outcome = diabetes) %>%
  mash(with = masher_soft(si_maxit = 1000)) %>%
  stir() %>%
  ferment(data_new = tst) %>%
  bottle() %>%
  .[['wort']] %>%
  .[5, list(training, testing)]

# use the default glmnet logistic regression model
dflt_output <- scrimp_mdl(
  train_imputed = imputes$training[[1]],
  test_imputed  = imputes$testing[[1]],
  outcome = diabetes)

# use default glmnet and include fitted model in list output
include_mdls <- scrimp_mdl(
  train_imputed = imputes$training[[1]],
  test_imputed  = imputes$testing[[1]],
  outcome = diabetes,
  .fun_args = net_args(keep_mdl = TRUE))

## Not run: 
# write your own function:
# note the function inputs can be ordered however you like, but the
# names of the inputs **must** include be .trn, .tst, and outcome
rngr_fun <- function(outcome, .trn, .tst, num_trees){

  # make a model formula
  formula <- as.formula(paste(outcome, '~ .'))
  # fit a random forest with ranger (probability = TRUE -> predicted probs)
  mdl <- ranger::ranger(formula = formula, data = .trn,
    probability = TRUE, num.trees = num_trees)
  # prediction from ranger returns matrix with two columns, we need the 2nd.
  prd <- predict(mdl, data = .tst)$predictions[ , 2, drop = TRUE]
  # compute model AUC
  yardstick::roc_auc_vec(.tst[[outcome]], prd)

}

scrimp_mdl(
  train_imputed = imputes$training[[1]],
  test_imputed  = imputes$testing[[1]],
  outcome = 'diabetes',
  .fun = rngr_fun,
  .fun_args = list(num_trees = 100)
)#'

## End(Not run)

bcjaeger/midy documentation built on May 3, 2020, 3:55 p.m.