Nina Zumel, John Mount October 2019
These are notes on controlling the cross-validation plan in the R
version of vtreat
, for notes on the Python
version of vtreat
, please see here.
vtreat
First, try preparing this data using vtreat
.
By default, R
vtreat
uses a y
-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.
Here we start with a simple k
-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in vtreat
.
library(wrapr) library(rqdatatable) library(vtreat)
As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:
n_row <- 1000 set.seed(2019) d <- data.frame( x = rnorm(n = n_row), y = rbinom(n = n_row, size = 1, prob = 0.05) ) summary(d)
First, try preparing this data using vtreat
.
# # create the treatment plan # k <- 5 # number of cross-val folds treatment_unstratified <- mkCrossFrameCExperiment( d, varlist = 'x', outcomename = 'y', outcometarget = 1, ncross = k, splitFunction = kWayCrossValidation, verbose = FALSE) # prepare the training data prepared_unstratified = treatment_unstratified$crossFrame
Let's look at the distribution of the target outcome in each of the cross-validation groups:
# convenience function to mark the cross-validation group of each row label_rows <- function(d, cross_plan, label_column = 'group') { d[label_column] = 0 for(i in 1:length(cross_plan)) { app = cross_plan[[i]][['app']] d[app, label_column] = i } return(d) } # label the rows prepared_unstratified <- label_rows(prepared_unstratified, treatment_unstratified$evalSets) # print(head(prepared_unstratified)) # get some summary statistics on the data summarize_by_group <- local_td(prepared_unstratified) %.>% project(., sum %:=% sum(y), mean %:=% mean(y), size %:=% n(), groupby='group') unstratified_summary <- prepared_unstratified %.>% summarize_by_group unstratified_summary <- as.data.frame(unstratified_summary) knitr::kable(unstratified_summary)
# standard deviation of target prevalence per cross-val fold std_unstratified = sd(unstratified_summary[['mean']]) std_unstratified
The target prevalence in the cross validation groups can vary fairly widely with respect to the "true" prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.
In this situation, vtreat
has an alternative cross-validation sampler called kWayStratifiedY
that can be passed in as follows:
treatment_stratified <- mkCrossFrameCExperiment( d, varlist = 'x', outcomename = 'y', outcometarget = 1, ncross = k, splitFunction = kWayStratifiedY, verbose = FALSE) # prepare the training data prepared_stratified = treatment_stratified$crossFrame # examine the target prevalence prepared_stratified = label_rows(prepared_stratified, treatment_stratified$evalSets) stratified_summary <- prepared_stratified %.>% summarize_by_group stratified_summary <- as.data.frame(stratified_summary) knitr::kable(stratified_summary)
# standard deviation of target prevalence std_stratified = sd(stratified_summary[['mean']]) std_stratified
The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.
std_unstratified/std_stratified
If you want to cross-validate under another scheme--for example, stratifying on the prevalences on an input class--you can write your own custom cross-validation scheme and pass it into vtreat
in a similar fashion as above. Your cross-validation scheme must have the same signature as vtreat
's kWayCrossValidation
.
Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.
More notes on controlling vtreat
cross-validation can be found here.
Note: it is important to not use leave-one-out cross-validation when using nested or stacked modeling concepts (such as seen in vtreat
), we have some notes on this here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.