factor.encoder: Encoder for Qualitative Variables
In midr: Learning from Black-Box Models by Maximum Interpretation Decomposition

factor.encoder

R Documentation

Encoder for Qualitative Variables

Description

factor.encoder() creates an encoder function for a qualitative (factor or character) variable. This encoder converts the variable into a one-hot encoded (dummy) design matrix.

factor.frame() is a helper function to create a "factor.frame" object that defines the encoding scheme.

Usage

factor.encoder(
  x,
  k,
  use.catchall = TRUE,
  catchall = "(others)",
  tag = "x",
  frame = NULL,
  weights = NULL
)

factor.frame(levels, catchall = "(others)", tag = "x")

Arguments

`x`	a vector to be encoded as a qualitative variable.
`k`	an integer specifying the maximum number of distinct levels to retain (including the catch-all level). If not positive, all unique values of `x` are used.
`use.catchall`	logical. If `TRUE`, less frequent levels are grouped into the catch-all level.
`catchall`	a character string for the catch-all level.
`tag`	the name of the variable.
`frame`	a "factor.frame" object or a character vector that explicitly defines the levels of the variable.
`weights`	an optional numeric vector of sample weights for `x`.
`levels`	a vector to be used as the levels of the variable.

Details

This function is designed to handle qualitative data for use in the MID model's linear system formulation.

The primary mechanism is one-hot encoding. Each unique level of the input variable becomes a column in the output matrix. For a given observation, the column corresponding to its level is assigned a 1, and all other columns are assigned 0.

When a variable has many unique levels (high cardinality), you can use the use.catchall = TRUE and k arguments. This will group the k - 1 most frequent levels into their own columns, while all other less frequent levels are consolidated into a single catchall level (e.g., "(others)" by default). This is crucial for preventing MID models from becoming overly complex.

Value

factor.encoder() returns an object of class "encoder". This is a list containing the following components:

`frame`	a "factor.frame" object containing the encoding information (levels).
`encode`	a function to convert a vector `x` into a one-hot encoded matrix.
`n`	the number of encoding levels (i.e., columns in the design matrix).
`type`	a character string describing the encoding type: "factor" or "null".

factor.frame() returns a "factor.frame" object containing the encoding information.

Examples

# Create an encoder for a qualitative variable
data(iris, package = "datasets")
enc <- factor.encoder(x = iris$Species, use.catchall = FALSE, tag = "Species")
enc

# Encode a vector with NA
enc$encode(x = c("setosa", "virginica", "ensata", NA, "versicolor"))

# Create an encoder with a pre-defined encoding frame
frm <- factor.frame(c("setosa", "virginica"), "other iris")
enc <- factor.encoder(x = iris$Species, frame = frm)
enc
enc$encode(c("setosa", "virginica", "ensata", NA, "versicolor"))

# Create an encoder with a character vector specifying the levels
enc <- factor.encoder(x = iris$Species, frame = c("setosa", "versicolor"))
enc$encode(c("setosa", "virginica", "ensata", NA, "versicolor"))

midr documentation built on Sept. 11, 2025, 1:07 a.m.