Conditioner

CustomizedCrossPlan

Nina Zumel, John Mount October 2019

These are notes on controlling the cross-validation plan in the R version of vtreat, for notes on the Python version of vtreat, please see here.

Using Custom Cross-Validation Plans with `vtreat`

First, try preparing this data using vtreat.

By default, R vtreat uses a y-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.

Here we start with a simple k-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we’ll show how to replace the cross validation scheme in vtreat.

library(wrapr)
library(rqdatatable)

## Loading required package: rquery

library(vtreat)

As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:

n_row <- 1000

set.seed(2019)

d <- data.frame(
    x = rnorm(n = n_row),
    y = rbinom(n = n_row, size = 1, prob = 0.05)
)

summary(d)

##        x                  y        
##  Min.   :-3.23608   Min.   :0.000  
##  1st Qu.:-0.72730   1st Qu.:0.000  
##  Median :-0.13212   Median :0.000  
##  Mean   :-0.07818   Mean   :0.047  
##  3rd Qu.: 0.59856   3rd Qu.:0.000  
##  Max.   : 3.54146   Max.   :1.000

First, try preparing this data using vtreat.

#
# create the treatment plan
#

k <- 5 # number of cross-val folds
treatment_unstratified <- mkCrossFrameCExperiment(
  d,
  varlist = 'x',
  outcomename = 'y',
  outcometarget = 1,
  ncross = k,
  splitFunction = kWayCrossValidation,
  verbose = FALSE)

# prepare the training data
prepared_unstratified = treatment_unstratified$crossFrame

Let’s look at the distribution of the target outcome in each of the cross-validation groups:

# convenience function to mark the cross-validation group of each row
label_rows <- function(d, cross_plan, label_column = 'group') {
    d[label_column] = 0
    for(i in 1:length(cross_plan)) {
        app = cross_plan[[i]][['app']]
        d[app, label_column] = i
    }
    return(d)
}

# label the rows            
prepared_unstratified <- label_rows(prepared_unstratified, treatment_unstratified$evalSets)
# print(head(prepared_unstratified))

# get some summary statistics on the data
summarize_by_group <- local_td(prepared_unstratified) %.>%
    project(.,
      sum %:=% sum(y),
      mean %:=% mean(y),
      size %:=% n(),
    groupby='group')

unstratified_summary <- prepared_unstratified %.>% summarize_by_group
unstratified_summary <- as.data.frame(unstratified_summary)

knitr::kable(unstratified_summary)

| group | sum | mean | size | | ----: | --: | ----: | ---: | | 2 | 9 | 0.045 | 200 | | 3 | 13 | 0.065 | 200 | | 4 | 7 | 0.035 | 200 | | 1 | 12 | 0.060 | 200 | | 5 | 6 | 0.030 | 200 |

# standard deviation of target prevalence per cross-val fold
std_unstratified = sd(unstratified_summary[['mean']])
std_unstratified

## [1] 0.01524795

The target prevalence in the cross validation groups can vary fairly widely with respect to the “true” prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.

In this situation, vtreat has an alternative cross-validation sampler called kWayStratifiedY that can be passed in as follows:

treatment_stratified <- mkCrossFrameCExperiment(
  d,
  varlist = 'x',
  outcomename = 'y',
  outcometarget = 1,
  ncross = k,
  splitFunction = kWayStratifiedY,
  verbose = FALSE)

# prepare the training data
prepared_stratified = treatment_stratified$crossFrame

# examine the target prevalence
prepared_stratified = label_rows(prepared_stratified, treatment_stratified$evalSets)

stratified_summary <- prepared_stratified %.>% summarize_by_group
stratified_summary <- as.data.frame(stratified_summary)

knitr::kable(stratified_summary)

| group | sum | mean | size | | ----: | --: | ----: | ---: | | 5 | 9 | 0.045 | 200 | | 1 | 10 | 0.050 | 200 | | 3 | 10 | 0.050 | 200 | | 4 | 9 | 0.045 | 200 | | 2 | 9 | 0.045 | 200 |

# standard deviation of target prevalence
std_stratified = sd(stratified_summary[['mean']])
std_stratified

## [1] 0.002738613

The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.

std_unstratified/std_stratified

## [1] 5.567764

If you want to cross-validate under another scheme–for example, stratifying on the prevalences on an input class–you can write your own custom cross-validation scheme and pass it into vtreat in a similar fashion as above. Your cross-validation scheme must have the same signature as vtreat’s kWayCrossValidation.

Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.

More notes on controlling vtreat cross-validation can be found here.

Note: it is important to not use leave-one-out cross-validation when using nested or stacked modeling concepts (such as seen in vtreat), we have some notes on this here.

WinVector/vtreat documentation built on Jan. 12, 2025, 6:04 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com