Description Usage Arguments Value Examples
Imputation of missing data is generally completed
in order to fit downstream models that require complete input data.
For supervised learning analyses, a key goal is to develop models
with optimal accuracy, so analysts will likely want to use
whatever imputation strategy provides the most accurate downstream
model. scrimp_mdl
facilitates this comparison.
1 | scrimp_mdl(train_imputed, test_imputed, outcome, .fun = NULL, .fun_args = NULL)
|
train_imputed |
an imputed data frame with training data. |
test_imputed |
an imputed data frame with testing data. |
outcome |
column name(s) of outcomes. These values can be provided as symbols (e.g., outcome = c(a,b,c) for multiple outcomes or outcome = a for one outcome) or character values (e.g., outcome = c('a','b','c') for multiple outcomes or outcome = 'a' for a single outcome). |
.fun |
a function with at least three inputs:
See example below where a function using random forests is applied. |
.fun_args |
a named list with additional arguments for |
If you supply your own function using .fun
, scrimp_mdl
will return the output of .fun
. If .fun
is left unspecified,
a named list is returned with components
model_cv
: a model tuned by cross-validation and fitted to the
training data
preds_cv
: the model's predicted values for internal testing data
preds_ex
: the model's predicted values for external testing data
score_ex
: a numeric value indicating external prediction accuracy
The list's contents will vary depending on .fun_args
. By
default, when fun
is unspecified, .fun_args
is governed
by the net_args()
function, which includes inputs of keep_mdl
and keep_prd
. The default value for keep_mdl
is FALSE
while
that of keep_prd
is TRUE
, so users who want to receive a list
including model
should write .fun_args = net_args(keep_mdl = TRUE)
when calling scrimp_mdl
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | data("diabetes")
trn <- diabetes$missing[1:200, ]
tst <- diabetes$missing[-c(1:200), ]
imputes <- brew_soft(trn, outcome = diabetes) %>%
mash(with = masher_soft(si_maxit = 1000)) %>%
stir() %>%
ferment(data_new = tst) %>%
bottle() %>%
.[['wort']] %>%
.[5, list(training, testing)]
# use the default glmnet logistic regression model
dflt_output <- scrimp_mdl(
train_imputed = imputes$training[[1]],
test_imputed = imputes$testing[[1]],
outcome = diabetes)
# use default glmnet and include fitted model in list output
include_mdls <- scrimp_mdl(
train_imputed = imputes$training[[1]],
test_imputed = imputes$testing[[1]],
outcome = diabetes,
.fun_args = net_args(keep_mdl = TRUE))
## Not run:
# write your own function:
# note the function inputs can be ordered however you like, but the
# names of the inputs **must** include be .trn, .tst, and outcome
rngr_fun <- function(outcome, .trn, .tst, num_trees){
# make a model formula
formula <- as.formula(paste(outcome, '~ .'))
# fit a random forest with ranger (probability = TRUE -> predicted probs)
mdl <- ranger::ranger(formula = formula, data = .trn,
probability = TRUE, num.trees = num_trees)
# prediction from ranger returns matrix with two columns, we need the 2nd.
prd <- predict(mdl, data = .tst)$predictions[ , 2, drop = TRUE]
# compute model AUC
yardstick::roc_auc_vec(.tst[[outcome]], prd)
}
scrimp_mdl(
train_imputed = imputes$training[[1]],
test_imputed = imputes$testing[[1]],
outcome = 'diabetes',
.fun = rngr_fun,
.fun_args = list(num_trees = 100)
)#'
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.