target_encoding_lab: Convert categorical predictors to numeric via target encoding
In collinear: Automated Multicollinearity Management

target_encoding_lab

R Documentation

Convert categorical predictors to numeric via target encoding

Description

Target encoding maps the values of categorical variables (of class character or factor) to numeric using another numeric variable as reference. The encoding methods implemented here are:

"mean" (implemented in target_encoding_mean()): Maps each category to the average of reference numeric variable across the category cases. Variables encoded with this method are identified with the suffix "__encoded_mean". It has a method to control overfitting implemented via the argument smoothing. The integer value of this argument indicates a threshold in number of rows. Categories sized above this threshold are encoded with the group mean, while groups below it are encoded with a weighted mean of the group's mean and the global mean. This method is named "mean smoothing" in the relevant literature.
"rank" (implemented in target_encoding_rank()): Returns the rank of the group as a integer, being 1 he group with the lower mean of the reference variable. Variables encoded with this method are identified with the suffix "__encoded_rank".
"loo" (implemented in target_encoding_loo()): Known as the "leave-one-out method" in the literature, it encodes each categorical value with the mean of the response variable across all other group cases. This method controls overfitting better than "mean". Variables encoded with this method are identified with the suffix "__encoded_loo".

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Usage

target_encoding_lab(
  df = NULL,
  response = NULL,
  predictors = NULL,
  encoding_method = "loo",
  smoothing = 0,
  overwrite = FALSE,
  quiet = FALSE,
  ...
)

Arguments

`df`	(required; dataframe, tibble, or sf) A dataframe with responses (optional) and predictors. Must have at least 10 rows for pairwise correlation analysis, and `10 * (length(predictors) - 1)` for VIF. Default: NULL.
`response`	(optional, character string) Name of a numeric response variable in `df`. Default: NULL.
`predictors`	(optional; character vector or NULL) Names of the predictors in `df`. If NULL, all columns except `responses` and constant/near-zero-variance columns are used. Default: NULL.
`encoding_method`	(optional; character vector or NULL). Name of the target encoding methods. One or several of: "mean", "rank", "loo". If NULL, target encoding is ignored, and `df` is returned with no modification. Default: "loo"
`smoothing`	(optional; integer vector) Argument of the method "mean". Groups smaller than this number have their means pulled towards the mean of the response across all cases. Default: 0
`overwrite`	(optional; logical) If TRUE, the original predictors in `df` are overwritten with their encoded versions, but only one encoding method, smoothing, white noise, and seed are allowed. Otherwise, encoded predictors with their descriptive names are added to `df`. Default: FALSE
`quiet`	(optional; logical) If FALSE, messages are printed. Default: FALSE.
`...`	(optional) Internal args (e.g. `function_name` for `validate_arg_function_name`, a precomputed correlation matrix `m`, or cross-validation args for `preference_order`).

Value

dataframe

Author(s)

Blas M. Benito, PhD

References

Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. doi: 10.1145/507533.507538

Examples


data(vi_smol)

#applying all methods for a continuous response
df <- target_encoding_lab(
  df = vi_smol,
  response = "vi_numeric",
  predictors = "koppen_zone",
  encoding_method = c(
    "mean",
    "loo",
    "rank"
  )
)

#identify encoded predictors
predictors.encoded <- grep(
  pattern = "*__encoded*",
  x = colnames(df),
  value = TRUE
)

head(df[, predictors.encoded])

collinear documentation built on Dec. 8, 2025, 5:06 p.m.