Nina Zumel, John Mount October 2019
These
are notes on controlling the cross-validation plan in the R
version
of vtreat
, for notes on the
Python
version of vtreat
,
please see
here.
vtreat
First, try preparing this data using vtreat
.
By default, R
vtreat
uses a y
-stratified randomized k-way cross
validation when creating and evaluating complex synthetic variables.
Here we start with a simple k
-way cross validation plan. This will
work well for the majority of applications. However, there may be times
when you need a more specialized cross validation scheme for your
modeling projects. In this document, we’ll show how to replace the cross
validation scheme in vtreat
.
library(wrapr)
library(rqdatatable)
## Loading required package: rquery
library(vtreat)
As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:
n_row <- 1000
set.seed(2019)
d <- data.frame(
x = rnorm(n = n_row),
y = rbinom(n = n_row, size = 1, prob = 0.05)
)
summary(d)
## x y
## Min. :-3.23608 Min. :0.000
## 1st Qu.:-0.72730 1st Qu.:0.000
## Median :-0.13212 Median :0.000
## Mean :-0.07818 Mean :0.047
## 3rd Qu.: 0.59856 3rd Qu.:0.000
## Max. : 3.54146 Max. :1.000
First, try preparing this data using vtreat
.
#
# create the treatment plan
#
k <- 5 # number of cross-val folds
treatment_unstratified <- mkCrossFrameCExperiment(
d,
varlist = 'x',
outcomename = 'y',
outcometarget = 1,
ncross = k,
splitFunction = kWayCrossValidation,
verbose = FALSE)
# prepare the training data
prepared_unstratified = treatment_unstratified$crossFrame
Let’s look at the distribution of the target outcome in each of the cross-validation groups:
# convenience function to mark the cross-validation group of each row
label_rows <- function(d, cross_plan, label_column = 'group') {
d[label_column] = 0
for(i in 1:length(cross_plan)) {
app = cross_plan[[i]][['app']]
d[app, label_column] = i
}
return(d)
}
# label the rows
prepared_unstratified <- label_rows(prepared_unstratified, treatment_unstratified$evalSets)
# print(head(prepared_unstratified))
# get some summary statistics on the data
summarize_by_group <- local_td(prepared_unstratified) %.>%
project(.,
sum %:=% sum(y),
mean %:=% mean(y),
size %:=% n(),
groupby='group')
unstratified_summary <- prepared_unstratified %.>% summarize_by_group
unstratified_summary <- as.data.frame(unstratified_summary)
knitr::kable(unstratified_summary)
| group | sum | mean | size | | ----: | --: | ----: | ---: | | 2 | 9 | 0.045 | 200 | | 3 | 13 | 0.065 | 200 | | 4 | 7 | 0.035 | 200 | | 1 | 12 | 0.060 | 200 | | 5 | 6 | 0.030 | 200 |
# standard deviation of target prevalence per cross-val fold
std_unstratified = sd(unstratified_summary[['mean']])
std_unstratified
## [1] 0.01524795
The target prevalence in the cross validation groups can vary fairly widely with respect to the “true” prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.
In this situation, vtreat
has an alternative cross-validation sampler
called kWayStratifiedY
that can be passed in as follows:
treatment_stratified <- mkCrossFrameCExperiment(
d,
varlist = 'x',
outcomename = 'y',
outcometarget = 1,
ncross = k,
splitFunction = kWayStratifiedY,
verbose = FALSE)
# prepare the training data
prepared_stratified = treatment_stratified$crossFrame
# examine the target prevalence
prepared_stratified = label_rows(prepared_stratified, treatment_stratified$evalSets)
stratified_summary <- prepared_stratified %.>% summarize_by_group
stratified_summary <- as.data.frame(stratified_summary)
knitr::kable(stratified_summary)
| group | sum | mean | size | | ----: | --: | ----: | ---: | | 5 | 9 | 0.045 | 200 | | 1 | 10 | 0.050 | 200 | | 3 | 10 | 0.050 | 200 | | 4 | 9 | 0.045 | 200 | | 2 | 9 | 0.045 | 200 |
# standard deviation of target prevalence
std_stratified = sd(stratified_summary[['mean']])
std_stratified
## [1] 0.002738613
The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.
std_unstratified/std_stratified
## [1] 5.567764
If you want to cross-validate under another scheme–for example,
stratifying on the prevalences on an input class–you can write your own
custom cross-validation scheme and pass it into vtreat
in a similar
fashion as above. Your cross-validation scheme must have the same
signature as vtreat
’s
kWayCrossValidation
.
Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.
More notes on controlling vtreat
cross-validation can be found
here.
Note: it is important to not use leave-one-out cross-validation when
using nested or stacked modeling concepts (such as seen in vtreat
), we
have some notes on this
here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.