We will show how in some situations using "more data in cross-validation" can be harmful.

Our example: an outcome (y) that is independent of a high-complexity categorical variable (x). We will combine this with a variable that is a noisy constant and leave-one-out cross-validation (which is a deterministic procedure) to get a bad result (failing to notice over-fit).

library("vtreat")

set.seed(352355)

nrow <- 100
d <- data.frame(x = sample(paste0('lev_', seq_len(nrow)), size = nrow, replace = TRUE),
                y = rnorm(nrow),
                stringsAsFactors = FALSE)

Introduce a deliberately bad custom coder.

This coder is bad in several ways:

# @param v character scalar: variable name
# @param vcol character vector, independent or input variable values
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
bad_coder_noisy_constant <- function(
  v, vcol, 
  y, 
  weights) {
  # Notice we are returning a constant, independent of vcol!
  # this should not look informative.
  meanY <- sum(y*weights)/sum(weights)
  meanY + 1.0e-3*runif(length(y)) # noise to sneak past constant detector
}

# @param v character scalar: variable name
# @param vcol character vector, independent or input variable values
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
bad_coder_noisy_conditional <- function(
  v, vcol, 
  y, 
  weights) {
  # Note: ignores weights
  agg <- aggregate(y ~ x, data = data.frame(x = vcol, y = y), FUN = mean)
  map <- agg$y
  names(map) <- agg$x
  map[vcol] + 1.0e-3*runif(length(y)) # noise to sneak past constant detector
}

# @param v character scalar: variable name
# @param vcol character vector, independent or input variable values
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
bad_coder_noisy_delta <- function(
  v, vcol, 
  y, 
  weights) {
  # Note: ignores weights
  agg <- aggregate(y ~ x, data = data.frame(x = vcol, y = y), FUN = mean)
  map <- agg$y - mean(y)
  names(map) <- agg$x
  map[vcol] + 1.0e-3*runif(length(y)) # noise to sneak past constant detector
}

customCoders <- list(
  'n.bad_coder_noisy_constant' = bad_coder_noisy_constant,
  'n.bad_coder_noisy_conditional' = bad_coder_noisy_conditional,
  'n.bad_coder_noisy_delta' = bad_coder_noisy_delta)

codeRestriciton <- c('bad_coder_noisy_constant', 
                     'bad_coder_noisy_conditional', 
                     'bad_coder_noisy_delta',
                     'catN')

vtreat correctly works on this example in the design/prepare pattern, and rejects the bad custom variable.

treatplanN <- designTreatmentsN(d, 
                                varlist = c('x'),
                                outcomename = 'y',
                                customCoders = customCoders, 
                                codeRestriction = codeRestriciton,
                                verbose = FALSE)
knitr::kable(treatplanN$scoreFrame)

Notice vtreat correctly identified none of the variables as being significant.

treatedD <- prepare(treatplanN, d)
summary(lm(y ~ x_bad_coder_noisy_constant, data= treatedD))

However, specifying oneWayHoldout as the cross-validation technique introduces sampling variation that is correlated with the outcome. This causes the value in the synthetic cross-frame (used both for calculating variable significances and returned to the use for further training) to have a spurious correlation with the outcome. The completely deterministic structure of leave-one-out holdout itself represents an information leak that poisons results.

cfeBad <- mkCrossFrameNExperiment(d, 
                                  varlist = c('x'),
                                  outcomename = 'y',
                                  customCoders = customCoders,
                                  codeRestriction = codeRestriciton,
                                  splitFunction = oneWayHoldout,
                                  verbose = FALSE)
knitr::kable(cfeBad$treatments$scoreFrame)

Notice the bad constant coder was (falsely) reported as usable and (falsely) appears useful on the cross-frame. Also notice the normal coders such as impact (which was fully rejected by vtreat) and levels codes were properly rejected.

What happened is:

In the failing example the value returned data-row k is essentially the mean of all rows except the k-th row due to the leave-one-out holdout. Call this estimate e(k) (the estimate assigned to the k-th row).

The coding-estimate for the k-th row is essentially (1/(n-1)) sum(i = 1, ...,n; i not k) y(i) (where n is the number of training data rows, and y(i) is the i-th dependent value). That is the coder builds its coding of the k-th row by averaging all of the training dependent values it is allowed to see under the leave-1-out cross validation procedure. In an isolated sense its calculation of the k-th row is independent of y(k) as that value was not shown to the procedure at that time.

However by algebra we have this estimate e(k) is also equal to (n/(n-1)) mean(y) - y(k)/(n-1). So a step in the procedure that also knows mean(y) (such as say the lm() linear regression models shown above, and the variable significance procedures used to build the scoreFrames) we know that y(k) = sum(y) - (n-1) e(k). Or in vector form (y and e being the vectors, all other terms scalars): y = sum(y) - (n-1) e. Jointly for all rows the dependent variable y is a simple linear function of the estimates e, even though each estimate e(k) with no knowledge of the dependent value y(k) in the same row.

Or: to an observer that knows n and mean(y) (and hence sum(y)) e(k) completely determines y(k) even though it was constructed without knowledge of y(k).

This failing is because:

Fully nested cross-simulation (where even the last step is under the cross-control and enumerating excluded sets of training rows) is likely too cumbersome (requiring more code coordination) and expensive (upping the size of the sets of rows we have to exclude) to force on implementers who are also unlikely to see any benefit in non-degenerate cases. The partially nested cross-simulation used in vtreat is likely a good practical compromise (though we may explore full-nesting for the score frame estimates, as that is a step completely under vtreat control).

The current vtreat procedures are very strong and fully "up to the job" of assisting in construction of best possible machine learning models. However in certain degenerate cases (near-constant encoding combined completely deterministic cross-validation; neither of which is a default behavior of vtreat) the cross validation system itself can introduce an information leak that promotes over-fit for some custom coders. vtreat's built-in coders are estimates of conditional changes from apparent mean (not estimates of conditional values), so tend to avoid the above issues.



WinVector/vtreat documentation built on Aug. 29, 2023, 4:49 a.m.