View source: R/target_encoding_lab.R
target_encoding_lab | R Documentation |
Target encoding involves replacing the values of categorical variables with numeric ones derived from a "target variable", usually a model's response.
In essence, target encoding works as follows:
1. group all cases belonging to a unique value of the categorical variable.
2. compute a statistic of the target variable across the group cases.
3. assign the value of the statistic to the group.
The methods to compute the group statistic implemented here are:
"mean" (implemented in target_encoding_mean()
): Encodes categorical values with the group means of the response. Variables encoded with this method are identified with the suffix "__encoded_mean". It has a method to control overfitting implemented via the argument smoothing
. The integer value of this argument indicates a threshold in number of rows. Groups above this threshold are encoded with the group mean, while groups below it are encoded with a weighted mean of the group's mean and the global mean. This method is named "mean smoothing" in the relevant literature.
"rank" (implemented in target_encoding_rank()
): Returns the rank of the group as a integer, being 1 he group with the lower mean of the response variable. Variables encoded with this method are identified with the suffix "__encoded_rank".
"loo" (implemented in target_encoding_loo()
): Known as the "leave-one-out method" in the literature, it encodes each categorical value with the mean of the response variable across all other group cases. This method controls overfitting better than "mean". Variables encoded with this method are identified with the suffix "__encoded_loo".
Accepts a parallelization setup via future::plan()
and a progress bar via progressr::handlers()
(see examples).
target_encoding_lab(
df = NULL,
response = NULL,
predictors = NULL,
methods = c("loo", "mean", "rank"),
smoothing = 0,
white_noise = 0,
seed = 0,
overwrite = FALSE,
quiet = FALSE
)
df |
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL. |
response |
(optional, character string) Name of a numeric response variable in |
predictors |
(optional; character vector) Names of the predictors to select from |
methods |
(optional; character vector or NULL). Name of the target encoding methods. If NULL, target encoding is ignored, and |
smoothing |
(optional; integer vector) Argument of the method "mean". Groups smaller than this number have their means pulled towards the mean of the response across all cases. Default: 0 |
white_noise |
(optional; numeric vector) Argument of the methods "mean", "rank", and "loo". Maximum white noise to add, expressed as a fraction of the range of the response variable. Range from 0 to 1. Default: |
seed |
(optional; integer vector) Random seed to facilitate reproducibility when |
overwrite |
(optional; logical) If |
quiet |
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE |
data frame
Blas M. Benito, PhD
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. doi: 10.1145/507533.507538
Other target_encoding:
target_encoding_mean()
data(
vi,
vi_predictors
)
#subset to limit example run time
vi <- vi[1:1000, ]
#applying all methods for a continuous response
df <- target_encoding_lab(
df = vi,
response = "vi_numeric",
predictors = "koppen_zone",
methods = c(
"mean",
"loo",
"rank"
),
white_noise = c(0, 0.1, 0.2)
)
#identify encoded predictors
predictors.encoded <- grep(
pattern = "*__encoded*",
x = colnames(df),
value = TRUE
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.