Using vtreat with Regression Problems

Nina Zumel and John Mount updated February 2020

Note this is a description of the R version of vtreat, the same example for the Python version of vtreat can be found here.

Preliminaries

Load modules/packages.

library(vtreat)
packageVersion('vtreat')
suppressPackageStartupMessages(library(ggplot2))
library(WVPlots)
library(rqdatatable)

Generate example data.

set.seed(2020)

make_data <- function(nrows) {
    d <- data.frame(x = 5*rnorm(nrows))
    d['y'] = sin(d[['x']]) + 0.01*d[['x']] + 0.1*rnorm(n = nrows)
    d[4:10, 'x'] = NA                  # introduce NAs
    d['xc'] = paste0('level_', 5*round(d$y/5, 1))
    d['x2'] = rnorm(n = nrows)
    d[d['xc']=='level_-1', 'xc'] = NA  # introduce a NA level
    return(d)
}

d = make_data(500)

d %.>%
  head(.) %.>%
  knitr::kable(.)

Some quick data exploration

Check how many levels xc has, and their distribution (including NaN)

unique(d['xc'])
table(d$xc, useNA = 'always')

Find the mean value of y

mean(d[['y']])

Plot of y versus x.

ggplot(d, aes(x=x, y=as.numeric(y))) + 
  geom_line()

Build a transform appropriate for regression problems.

Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or NaNs.

First create the data treatment transform object, in this case a treatment for a regression problem.

unpack[
  transform = treatments,
  d_prepared = crossFrame
  ] <- vtreat::mkCrossFrameNExperiment(
    dframe = d,                              # data to learn transform from
    varlist = setdiff(colnames(d), c('y')),  # columns to transform
    outcomename = 'y'                        # outcome variable
  )

Let's look at the top of d_prepared. Notice that the new treated data frame includes only new derived variables (along with y). The derived variables will be discussed below.

d_prepared %.>%
  head(.) %.>%
  knitr::kable(.)

Note that for the training data d: crossFrame is not the same as prepare(transform, d); the second call can lead to nested model bias in some situations, and is not recommended. For other, later data, not seen during transform design transform.preprare(o) is an appropriate step.

vtreat version 1.5.1 and newer issue a warning if you call the incorrect transform pattern on your original training data:

d_prepared_wrong <- prepare(transform, d)

The prepared data should be clean: completely numeric, with no missing values.

The Score Frame

Now examine the score frame, which gives information about each new variable, including its type, which original variable it is derived from, its (cross-validated) significance as a one-variable linear model for the outcome, and the (cross-validated) R-squared of its corresponding linear model.

score_frame <- transform$scoreFrame

# look at a subset of the columns
cols = c("varName", "origName", "code", "rsq", "sig", "varMoves", "default_threshold", "recommended")
knitr::kable(score_frame[, cols])

Notice that the variable xc has been converted to multiple variables:

The variable x has been converted to two variables:

Any or all of these new variables are available for downstream modeling.

The recommended column indicates which variables are non constant (has_range == True) and have a significance value smaller than default_threshold. See the section Deriving the Default Thresholds below for the reasoning behind the default thresholds. Recommended columns are intended as advice about which variables appear to be most likely to be useful in a downstream model. This advice attempts to be conservative, to reduce the possibility of mistakenly eliminating variables that may in fact be useful (although, obviously, it can still mistakenly eliminate variables that have a real but non-linear relationship to the output).

Let's look at the recommended and not recommended variables:

# recommended variables
score_frame[score_frame[['recommended']], 'varName', drop = FALSE]  %.>%
  knitr::kable(.)
# not recommended variables
score_frame[!score_frame[['recommended']], 'varName', drop = FALSE] %.>%
  knitr::kable(.)

A Closer Look at the catN variables

Variables of type catN are the outputs of a one-variable linear regression of a categorical variable (in our example, xc) against the centered output on the (cross-validated) treated training data.

Let's look at the relationship between xc_catN and y (actually y_centered, a centered version of y).

d_prepared['y_centered'] <- d_prepared$y - mean(d_prepared$y)

WVPlots::ScatterHist(
  d_prepared, 
  xvar = 'xc_catN',
  yvar = 'y_centered',
  smoothmethod = 'identity',
  estimate_sig = TRUE,
  title = 'Relationship between xc_catN and y')

This indicates that xc_catN is strongly predictive of the outcome. Note that the score frame also reported the R-squared between xc_catN and y_centered, which is fairly large.

score_frame[score_frame$varName == 'xc_catN', ]$rsq

Note also that the impact code values are jittered; this is because d_prepared is a "cross-frame": that is, the result of a cross-validated estimation process. Hence, the impact coding of xc is a function of both the value of xc and the cross-validation fold of the datum's row. When transform is applied to new data, there will be only one value of impact code for each (common) level of xc. We can see this by applying the transform to the data frame d as if it were new data.

# the scores for the rows in the cross_frame where xc == 'level_1'
# jittered
summary(d_prepared$xc_catN[(!is.na(d$xc)) & (d$xc == 'level_1')])

dtmp = prepare(transform, d)
dtmp['y_centered'] = dtmp$y - mean(dtmp$y)

# the scores for the rows of "new" prepared data where xc == 'level_1'
# constant
summary(dtmp$xc_catN[(!is.na(d$xc)) & (d$xc == 'level_1')])

Variables of type catN are useful when dealing with categorical variables with a very large number of possible levels. For example, a categorical variable with 10,000 possible values potentially converts to 10,000 indicator variables, which may be unwieldy for some modeling methods. Using a single numerical variable of type catN may be a preferable alternative.

Using the Prepared Data in a Model

Of course, what we really want to do with the prepared training data is to fit a model jointly with all the (recommended) variables. Let's try fitting a linear regression model to d_prepared.

model_vars <- score_frame$varName[score_frame$recommended]
# to use all the variables:
# model_vars <- score_frame$varName

f <- wrapr::mk_formula('y', model_vars)

model = lm(f, data = d_prepared)

# now predict
d_prepared['prediction'] = predict(
  model,
  newdata = d_prepared)

# look at the fit (on the training data)
WVPlots::ScatterHist(
  d_prepared, 
  xvar = 'prediction',
  yvar = 'y',
  smoothmethod = 'identity',
  estimate_sig = TRUE,
  title = 'Relationship between prediction and y')

Now apply the model to new data.

# create the new data
dtest <- make_data(450)

# prepare the new data with vtreat
dtest_prepared = prepare(transform, dtest)
# dtest %.>% transform is an alias for prepare(transform, dtest)

# apply the model to the prepared data
dtest_prepared['prediction'] = predict(
  model,
  newdata = dtest_prepared)

# compare the predictions to the outcome (on the test data)
WVPlots::ScatterHist(
  dtest_prepared, 
  xvar = 'prediction',
  yvar = 'y',
  smoothmethod = 'identity',
  estimate_sig = TRUE,
  title = 'Relationship between prediction and y')

# get r-squared
sigr::wrapFTest(dtest_prepared, 
                predictionColumnName = 'prediction',
                yColumnName = 'y',
                nParameters = length(model_vars) + 1)

Parameters for mkCrossFrameNExperiment

We've tried to set the defaults for all parameters so that vtreat is usable out of the box for most applications.

suppressPackageStartupMessages(library(printr))
args("mkCrossFrameNExperiment")

Some parameters of note include:

codeRestriction: The types of synthetic variables that vtreat will (potentially) produce. By default, all possible applicable types will be produced. See Types of prepared variables below.

minFraction (default: 0.02): For categorical variables, indicator variables (type levs) are only produced for levels that are present at least minFraction of the time. A consequence of this is that 1/minFraction is the maximum number of indicators that will be produced for a given categorical variable. To make sure that all possible indicator variables are produced, set minFraction = 0

splitFunction: The cross validation method used by vtreat. Most people won't have to change this.

ncross (default: 3): The number of folds to use for cross-validation

missingness_imputation: The function or value that vtreat uses to impute or "fill in" missing numerical values. The default is mean. To change the imputation function or use different functions/values for different columns, see the Imputation example.

customCoders: For passing in user-defined transforms for custom data preparation. Won't be needed in most situations, but see here for an example of applying a GAM transform to input variables.

Types of prepared variables

clean: Produced from numerical variables: a clean numerical variable with no NAs or missing values

lev: Produced from categorical variables, one for each (common) level: for each level of the variable, indicates if that level was "on"

catP: Produced from categorical variables: indicates how often each level of the variable was "on" (its prevalence)

catN: Produced from categorical variables: score from a one-dimensional model of the centered output as a function of the explanatory variable

catD: Produced from categorical variables: deviation of outcome as a function of the explanatory variable

isBAD: Produced for numerical variables: an indicator variable that marks when the original variable was missing or NaN.

More on the coding types can be found here.

Example: Produce only a subset of variable types

In this example, suppose you only want to use indicators and continuous variables in your model; in other words, you only want to use variables of types (clean, is_BAD, and lev), and no catN, catP, or catD variables.

unpack[
  transform_thin = treatments,
  d_prepared_thin = crossFrame
  ] <- vtreat::mkCrossFrameNExperiment(
    dframe = d,                                    # data to learn transform from
    varlist = setdiff(colnames(d), c('y', 'y_centered')),  # columns to transform
    outcomename = 'y',                             # outcome variable
    codeRestriction = c('lev',                     # transforms we want
                        'clean',
                        'isBAD')
  )

score_frame_thin <- transform_thin$scoreFrame

d_prepared_thin %.>%
  head(.) %.>%
  knitr::kable(.)
# no catX variables
knitr::kable(score_frame_thin[,cols])

Deriving the Default Thresholds

While machine learning algorithms are generally tolerant to a reasonable number of irrelevant or noise variables, too many irrelevant variables can lead to serious overfit; see this article for an extreme example, one we call "Bad Bayes". The default threshold is an attempt to eliminate obviously irrelevant variables early.

Imagine that you have a pure noise dataset, where none of the n inputs are related to the output. If you treat each variable as a one-variable model for the output, and look at the significances of each model, these significance-values will be uniformly distributed in the range [0:1]. You want to pick a weakest possible significance threshold that eliminates as many noise variables as possible. A moment's thought should convince you that a threshold of 1/n allows only one variable through, in expectation.

This leads to the general-case heuristic that a significance threshold of 1/n on your variables should allow only one irrelevant variable through, in expectation (along with all the relevant variables). Hence, 1/n used to be our recommended threshold, when we originally developed the R version of vtreat.

We noticed, however, that this biases the filtering against numerical variables, since there are at most two derived variables (of types clean and is_BAD) for every numerical variable in the original data. Categorical variables, on the other hand, are expanded to many derived variables: several indicators (one for every common level), plus catN, catP, and catD. So we now reweight the thresholds.

Suppose you have a (treated) data set with ntreat different types of vtreat variables (clean, lev, etc). There are nT variables of type T. Then the default threshold for all the variables of type T is 1/(ntreat nT). This reweighting helps to reduce the bias against any particular type of variable. The heuristic is still that the set of recommended variables will allow at most one noise variable into the set of candidate variables.

As noted above, because vtreat estimates variable significances using linear methods by default, some variables with a non-linear relationship to the output may fail to pass the threshold. In this case, you may not wish to filter the variables to be used in the models to only recommended variables (as we did in the main example above), but instead use all the variables, or select the variables to use by your own criteria.

Conclusion

In all cases (classification, regression, unsupervised, and multinomial classification) the intent is that vtreat transforms are essentially one liners.

The preparation commands are organized as follows:

These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.



WinVector/vtreat documentation built on Aug. 29, 2023, 4:49 a.m.