View source: R/target_encoding_lab.R
| target_encoding_lab | R Documentation |
Target encoding maps the values of categorical variables (of class character or factor) to numeric using another numeric variable as reference. The encoding methods implemented here are:
"mean" (implemented in target_encoding_mean()): Maps each category to the average of reference numeric variable across the category cases. Variables encoded with this method are identified with the suffix "__encoded_mean". It has a method to control overfitting implemented via the argument smoothing. The integer value of this argument indicates a threshold in number of rows. Categories sized above this threshold are encoded with the group mean, while groups below it are encoded with a weighted mean of the group's mean and the global mean. This method is named "mean smoothing" in the relevant literature.
"rank" (implemented in target_encoding_rank()): Returns the rank of the group as a integer, being 1 he group with the lower mean of the reference variable. Variables encoded with this method are identified with the suffix "__encoded_rank".
"loo" (implemented in target_encoding_loo()): Known as the "leave-one-out method" in the literature, it encodes each categorical value with the mean of the response variable across all other group cases. This method controls overfitting better than "mean". Variables encoded with this method are identified with the suffix "__encoded_loo".
Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).
target_encoding_lab(
df = NULL,
response = NULL,
predictors = NULL,
encoding_method = "loo",
smoothing = 0,
overwrite = FALSE,
quiet = FALSE,
...
)
df |
(required; dataframe, tibble, or sf) A dataframe with responses
(optional) and predictors. Must have at least 10 rows for pairwise
correlation analysis, and |
response |
(optional, character string) Name of a numeric response variable in |
predictors |
(optional; character vector or NULL) Names of the
predictors in |
encoding_method |
(optional; character vector or NULL). Name of the target encoding methods. One or several of: "mean", "rank", "loo". If NULL, target encoding is ignored, and |
smoothing |
(optional; integer vector) Argument of the method "mean". Groups smaller than this number have their means pulled towards the mean of the response across all cases. Default: 0 |
overwrite |
(optional; logical) If TRUE, the original predictors in |
quiet |
(optional; logical) If FALSE, messages are printed. Default: FALSE. |
... |
(optional) Internal args (e.g. |
dataframe
Blas M. Benito, PhD
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. doi: 10.1145/507533.507538
Other target_encoding:
target_encoding_loo()
data(vi_smol)
#applying all methods for a continuous response
df <- target_encoding_lab(
df = vi_smol,
response = "vi_numeric",
predictors = "koppen_zone",
encoding_method = c(
"mean",
"loo",
"rank"
)
)
#identify encoded predictors
predictors.encoded <- grep(
pattern = "*__encoded*",
x = colnames(df),
value = TRUE
)
head(df[, predictors.encoded])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.