vtreat Nested Model Bias Warning

For quite a while we have been teaching estimating variable re-encodings on the exact same data they are later naively using to train a model on leads to an undesirable nested model bias. The vtreat package (both the R version and Python version) both incorporate a cross-frame method that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent PyData LA talk).

The next version of vtreat will warn the user if they have improperly used the same data for both vtreat impact code inference and downstream modeling. So in addition to us warning you not to do this, the package now also checks and warns against this situation.

Set up the Example

This example is copied from some of our classification documentation.

Load modules/packages.

# For this example we want vtreat version 1.5.1 or newer
# remotes::install_github("WinVector/vtreat")
library(vtreat)

packageVersion("vtreat")

Generate example data.

make_data <- function(nrows) {
    d <- data.frame(x = 5*rnorm(nrows))
    d['y'] = sin(d['x']) + 0.1*rnorm(n = nrows)
    d[4:10, 'x'] = NA                  # introduce NAs
    d['xc'] = paste0('level_', 5*round(d$y/5, 1))
    d['x2'] = rnorm(n = nrows)
    d[d['xc']=='level_-1', 'xc'] = NA  # introduce a NA level
    d['yc'] = d[['y']]>0.5
    return(d)
}

training_data = make_data(500)

training_data %.>%
  head(.) %.>%
  knitr::kable(.)

Demonstrate the Warning

Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or NAs.

First create the data treatment transform design object, in this case a treatment for a binomial classification problem.

We use the training data training_data to fit the transform and the return a treated training set: completely numeric, with no missing values.

unpack[
  transform = treatments,
  train_prepared = crossFrame
  ] <- vtreat::mkCrossFrameCExperiment(
    dframe = training_data,                                    # data to learn transform from
    varlist = setdiff(colnames(training_data), c('y', 'yc')),  # columns to transform
    outcomename = 'yc',                            # outcome variable
    outcometarget = TRUE,                          # outcome of interest
    verbose = FALSE
  )

train_prepared is prepared in the correct way to use the same training data for inferring the impact-coded variables, using the returned $crossFrame from mkCrossFrameCExperiment().

We prepare new test or application data as follows.

test_data <- make_data(100)

test_prepared <- prepare(transform, test_data)

The issue is: for training data we should not call prepare(), but instead use the cross-frame that is produces during transform design.

The point is we should not do the following:

train_prepared_wrong <- prepare(transform, training_data)

Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.

And that is the new nested model bias warning feature.

The Python-version of this document can be found here.



WinVector/vtreat documentation built on Aug. 29, 2023, 4:49 a.m.