knitr::opts_chunk$set(
  collapse = TRUE,
  comment = " # ",
  fig.path = "tools/README-"
)

CVRTSEncoder is a categorical variable encoding for supervised learning.

This package is still in a research and development mode. Functionality and interfaces may change.

Re-encode a set of categorical variables jointly as a spectral projection of the trajectory of modeling residuals. This is intended as a succinct numeric linear representation of a set of categorical variables in a manner that is useful for supervised learning.

The concept is y-aware encoding the trajectory of non-linear model residuals in terms of target categorical variables.

The idea is an extension of the vtreat coding concepts, the re-encoding concepts of JavaLogistic, and of the y-aware scaling concepts of Nina Zumel and John Mount:

The core idea is: other models factor the quantity to be explained into an explainable versus residual portion (with respect to the given model). Each of these components are possibly useful for modeling.

library("CVRTSEncoder")
library("wrapr")

data <- iris
avars <- c("Sepal.Length", "Petal.Length")
evars <- c("Sepal.Width", "Petal.Width")
dep_var <- "Species"
dep_target <- "versicolor"
for(vi in evars) {
  data[[vi]] <- as.character(round(data[[vi]]))
}
str(data)

cross_enc <- estimate_residual_encoding_c(
  data = data,
  avars = avars,
  evars = evars,
  dep_var = dep_var,
  dep_target = dep_target,
  n_comp = 4
)
enc <- prepare(cross_enc$coder, data)
data <- cbind(data, enc)
data %.>%
  head(.) %.>% 
  knitr::kable(.)

f0 <- wrapr::mk_formula(dep_var, avars, outcome_target = dep_target)
print(f0)

model0 <- glm(f0, data = data, family = binomial)
summary(model0)

data$pred0 <- predict(model0, newdata = data, type = "response")
table(data$Species, data$pred0>0.5)

newvars <- c(avars, colnames(enc))
f <- wrapr::mk_formula(dep_var, newvars, outcome_target = dep_target)
print(f)

model <- glmnet::cv.glmnet(as.matrix(data[, newvars, drop = FALSE]), 
                           as.numeric(data[[dep_var]]==dep_target), 
                           family = "binomial")
coef(model, lambda = "lambda.min")
data$pred <- as.numeric(predict(model, newx = as.matrix(data[, newvars, drop = FALSE]), s = "lambda.min"))
table(data$Species, data$pred>0.5)


WinVector/CVRTSEncoder documentation built on June 7, 2019, 9:53 a.m.