step_woe | R Documentation |
step_woe()
creates a specification of a recipe step that will transform
nominal data into its numerical transformation based on weights of evidence
against a binary outcome.
step_woe(
recipe,
...,
role = "predictor",
outcome,
trained = FALSE,
dictionary = NULL,
Laplace = 1e-06,
prefix = "woe",
keep_original_cols = FALSE,
skip = FALSE,
id = rand_id("woe")
)
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which variables will be
used to compute the components. See |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that the new woe components columns created by the original variables will be used as predictors in a model. |
outcome |
The bare name of the binary outcome encased in |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
dictionary |
A tbl. A map of levels and woe values. It must have the
same layout than the output returned from |
Laplace |
The Laplace smoothing parameter. A value usually applied to avoid -Inf/Inf from predictor category with only one outcome class. Set to 0 to allow Inf/-Inf. The default is 1e-6. Also known as 'pseudocount' parameter of the Laplace smoothing technique. |
prefix |
A character string that will be the prefix to the resulting new variables. See notes below. |
keep_original_cols |
A logical to keep the original variables in the
output. Defaults to |
skip |
A logical. Should the step be skipped when the recipe is baked by
|
id |
A character string that is unique to this step to identify it. |
WoE is a transformation of a group of variables that produces a new set of features. The formula is
woe_c = log((P(X = c|Y = 1))/(P(X = c|Y = 0)))
where c
goes from 1 to C
levels of a given nominal predictor
variable X
.
These components are designed to transform nominal variables into numerical
ones with the property that the order and magnitude reflects the association
with a binary outcome. To apply it on numerical predictors, it is advisable
to discretize the variables prior to running WoE. Here, each variable will be
binarized to have woe associated later. This can achieved by using
step_discretize()
.
The argument Laplace
is an small quantity added to the proportions of 1's
and 0's with the goal to avoid log(p/0) or log(0/p) results. The numerical
woe versions will have names that begin with woe_
followed by the
respective original name of the variables. See Good (1985).
One can pass a custom dictionary
tibble to step_woe()
. It must have
the same structure of the output from dictionary()
(see examples). If
not provided it will be created automatically. The role of this tibble is to
store the map between the levels of nominal predictor to its woe values. You
may want to tweak this object with the goal to fix the orders between the
levels of one given predictor. One easy way to do this is by tweaking an
output returned from dictionary()
.
An updated version of recipe
with the new step added to the
sequence of existing steps (if any). For the tidy
method, a tibble with
the woe dictionary used to map categories with woe values.
When you tidy()
this step, a tibble with columns terms
(the selectors or variables selected), value
, n_tot
, n_bad
, n_good
,
p_bad
, p_good
, woe
and outcome
is returned.. See dictionary()
for
more information.
When you tidy()
this step, a tibble is retruned with
columns terms
value
, n_tot
, n_bad
, n_good
, p_bad
, p_good
, woe
and outcome
and id
:
character, the selectors or variables selected
character, level of the outcome
integer, total number
integer, number of bad examples
integer, number of good examples
numeric, p of bad examples
numeric, p of good examples
numeric, weight of evidence
character, name of outcome variable
character, id of this step
This step has 1 tuning parameters:
Laplace
: Laplace Correction (type: double, default: 1e-06)
The underlying operation does not allow for case weights.
Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
Hastie, T., Tibshirani, R. and Friedman, J. (1986). Elements of Statistical Learning, Second Edition, Springer, 2009.
Good, I. J. (1985), "Weight of evidence: A brief survey", Bayesian Statistics, 2, pp.249-270.
library(modeldata)
data("credit_data")
set.seed(111)
in_training <- sample(1:nrow(credit_data), 2000)
credit_tr <- credit_data[in_training, ]
credit_te <- credit_data[-in_training, ]
rec <- recipe(Status ~ ., data = credit_tr) %>%
step_woe(Job, Home, outcome = vars(Status))
woe_models <- prep(rec, training = credit_tr)
# the encoding:
bake(woe_models, new_data = credit_te %>% slice(1:5), starts_with("woe"))
# the original data
credit_te %>%
slice(1:5) %>%
dplyr::select(Job, Home)
# the details:
tidy(woe_models, number = 1)
# Example of custom dictionary + tweaking
# custom dictionary
woe_dict_custom <- credit_tr %>% dictionary(Job, Home, outcome = "Status")
woe_dict_custom[4, "woe"] <- 1.23 # tweak
# passing custom dict to step_woe()
rec_custom <- recipe(Status ~ ., data = credit_tr) %>%
step_woe(
Job, Home,
outcome = vars(Status), dictionary = woe_dict_custom
) %>%
prep()
rec_custom_baked <- bake(rec_custom, new_data = credit_te)
rec_custom_baked %>%
dplyr::filter(woe_Job == 1.23) %>%
head()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.