We will show how in some situations using "more data in cross-validation" can be harmful.
Our example: an outcome (y
) that is independent of a high-complexity
categorical variable (x
). We will combine this with a variable that is a noisy
constant and leave-one-out cross-validation (which is a deterministic procedure) to
get a bad result (failing to notice over-fit).
library("vtreat") set.seed(352355) nrow <- 100 d <- data.frame(x = sample(paste0('lev_', seq_len(nrow)), size = nrow, replace = TRUE), y = rnorm(nrow), stringsAsFactors = FALSE)
Introduce a deliberately bad custom coder.
This coder is bad in several ways:
y
, instead of a conditional difference of the dependent variable from the cross-validated mean of the dependent variable.# @param v character scalar: variable name # @param vcol character vector, independent or input variable values # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column bad_coder_noisy_constant <- function( v, vcol, y, weights) { # Notice we are returning a constant, independent of vcol! # this should not look informative. meanY <- sum(y*weights)/sum(weights) meanY + 1.0e-3*runif(length(y)) # noise to sneak past constant detector } # @param v character scalar: variable name # @param vcol character vector, independent or input variable values # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column bad_coder_noisy_conditional <- function( v, vcol, y, weights) { # Note: ignores weights agg <- aggregate(y ~ x, data = data.frame(x = vcol, y = y), FUN = mean) map <- agg$y names(map) <- agg$x map[vcol] + 1.0e-3*runif(length(y)) # noise to sneak past constant detector } # @param v character scalar: variable name # @param vcol character vector, independent or input variable values # @param y numeric, dependent or outcome variable to predict # @param weights row/example weights # @return scored training data column bad_coder_noisy_delta <- function( v, vcol, y, weights) { # Note: ignores weights agg <- aggregate(y ~ x, data = data.frame(x = vcol, y = y), FUN = mean) map <- agg$y - mean(y) names(map) <- agg$x map[vcol] + 1.0e-3*runif(length(y)) # noise to sneak past constant detector } customCoders <- list( 'n.bad_coder_noisy_constant' = bad_coder_noisy_constant, 'n.bad_coder_noisy_conditional' = bad_coder_noisy_conditional, 'n.bad_coder_noisy_delta' = bad_coder_noisy_delta) codeRestriciton <- c('bad_coder_noisy_constant', 'bad_coder_noisy_conditional', 'bad_coder_noisy_delta', 'catN')
vtreat
correctly works on this example in the design/prepare pattern, and rejects the bad custom variable.
treatplanN <- designTreatmentsN(d, varlist = c('x'), outcomename = 'y', customCoders = customCoders, codeRestriction = codeRestriciton, verbose = FALSE) knitr::kable(treatplanN$scoreFrame)
Notice vtreat
correctly identified none of the variables as being significant.
treatedD <- prepare(treatplanN, d) summary(lm(y ~ x_bad_coder_noisy_constant, data= treatedD))
However, specifying oneWayHoldout
as the cross-validation technique introduces
sampling variation that is correlated with the outcome. This causes the value in
the synthetic cross-frame (used both for calculating variable significances and
returned to the use for further training) to have a spurious correlation with the
outcome. The completely deterministic structure of leave-one-out holdout itself
represents an information leak that poisons results.
cfeBad <- mkCrossFrameNExperiment(d, varlist = c('x'), outcomename = 'y', customCoders = customCoders, codeRestriction = codeRestriciton, splitFunction = oneWayHoldout, verbose = FALSE) knitr::kable(cfeBad$treatments$scoreFrame)
Notice the bad constant coder was (falsely) reported as usable and (falsely) appears useful on the cross-frame. Also notice the normal coders such as impact (which was fully rejected by vtreat
) and levels codes were properly rejected.
What happened is:
y
into the bad coder. Essentially the leave-one-out cross validation is consuming a number of degrees of freedom equal to the number of different data sets its presents (one per data row).vtreat
impact/effects coders are careful to return the difference from cross-validation segment mean (which would be zero for all constant values).In the failing example the value returned data-row k
is essentially the mean of all rows except the k
-th row due to the leave-one-out holdout. Call this estimate e(k)
(the estimate assigned to the k
-th row).
The coding-estimate for the k
-th row is essentially (1/(n-1)) sum(i = 1, ...,n; i not k) y(i)
(where n
is the number of training data rows, and y(i)
is the i
-th dependent value). That is the coder builds its coding of the k
-th row by averaging all of the training dependent values it is allowed to see under the leave-1-out cross validation procedure. In an isolated sense its calculation of the k
-th row is independent of y(k)
as that value was not shown to the procedure at that time.
However by algebra we have this estimate e(k)
is also equal to (n/(n-1)) mean(y) - y(k)/(n-1)
. So a step in the procedure that also knows mean(y)
(such as say the lm()
linear regression models shown above, and the variable significance procedures used to build the scoreFrame
s) we know that y(k) = sum(y) - (n-1) e(k)
. Or in vector form (y
and e
being the vectors, all other terms scalars): y = sum(y) - (n-1) e
. Jointly for all rows the dependent variable y
is a simple linear function of the estimates e
, even though each estimate e(k)
with no knowledge of the dependent value y(k)
in the same row.
Or: to an observer that knows n
and mean(y)
(and hence sum(y)
) e(k)
completely determines y(k)
even though it was constructed without knowledge of y(k)
.
This failing is because:
Fully nested cross-simulation (where even the last step is under the cross-control and enumerating excluded sets of training rows) is likely too cumbersome (requiring more code coordination) and expensive (upping the size of the sets of rows we have to exclude) to force on implementers who are also unlikely to see any benefit in non-degenerate cases. The partially nested cross-simulation used in vtreat
is likely a good practical compromise (though we may explore full-nesting for the score frame estimates, as that is a step completely under vtreat
control).
The current vtreat
procedures are very strong and fully "up to the job" of assisting in construction of best possible machine learning models. However in certain degenerate cases (near-constant encoding combined completely deterministic cross-validation; neither of which is a default behavior of vtreat
) the cross validation system itself can introduce an information leak that promotes over-fit for some custom coders. vtreat
's built-in coders are estimates of conditional changes from apparent mean (not estimates of conditional values), so tend to avoid the above issues.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.