John Mount, Win-Vector LLC 2020-01-28
We will show how in some situations using “more data in cross-validation” can be harmful.
Our example: an outcome (y
) that is independent of a high-complexity
categorical variable (x
). We will combine this with a variable that is
a noisy constant and leave-one-out cross-validation (which is a
deterministic procedure) to get a bad result (failing to notice
over-fit).
library("vtreat")
set.seed(352355)
nrow <- 100
d <- data.frame(x = sample(paste0('lev_', seq_len(nrow)), size = nrow, replace = TRUE),
y = rnorm(nrow),
stringsAsFactors = FALSE)
Introduce a deliberately bad custom coder.
This coder is bad in several ways:
y
, instead of a
conditional difference of the dependent variable from the
cross-validated mean of the dependent variable.# @param v character scalar: variable name
# @param vcol character vector, independent or input variable values
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
bad_coder_noisy_constant <- function(
v, vcol,
y,
weights) {
# Notice we are returning a constant, independent of vcol!
# this should not look informative.
meanY <- sum(y*weights)/sum(weights)
meanY + 1.0e-3*runif(length(y)) # noise to sneak past constant detector
}
# @param v character scalar: variable name
# @param vcol character vector, independent or input variable values
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
bad_coder_noisy_conditional <- function(
v, vcol,
y,
weights) {
# Note: ignores weights
agg <- aggregate(y ~ x, data = data.frame(x = vcol, y = y), FUN = mean)
map <- agg$y
names(map) <- agg$x
map[vcol] + 1.0e-3*runif(length(y)) # noise to sneak past constant detector
}
# @param v character scalar: variable name
# @param vcol character vector, independent or input variable values
# @param y numeric, dependent or outcome variable to predict
# @param weights row/example weights
# @return scored training data column
bad_coder_noisy_delta <- function(
v, vcol,
y,
weights) {
# Note: ignores weights
agg <- aggregate(y ~ x, data = data.frame(x = vcol, y = y), FUN = mean)
map <- agg$y - mean(y)
names(map) <- agg$x
map[vcol] + 1.0e-3*runif(length(y)) # noise to sneak past constant detector
}
customCoders <- list(
'n.bad_coder_noisy_constant' = bad_coder_noisy_constant,
'n.bad_coder_noisy_conditional' = bad_coder_noisy_conditional,
'n.bad_coder_noisy_delta' = bad_coder_noisy_delta)
codeRestriciton <- c('bad_coder_noisy_constant',
'bad_coder_noisy_conditional',
'bad_coder_noisy_delta',
'catN')
vtreat
correctly works on this example in the design/prepare pattern,
and rejects the bad custom variable.
treatplanN <- designTreatmentsN(d,
varlist = c('x'),
outcomename = 'y',
customCoders = customCoders,
codeRestriction = codeRestriciton,
verbose = FALSE)
knitr::kable(treatplanN$scoreFrame)
| varName | varMoves | rsq | sig | needsSplit | extraModelDegrees | origName | code | | :-------------------------------- | :------- | --------: | --------: | :--------- | ----------------: | :------- | :----------------------------- | | x_bad_coder_noisy_constant | TRUE | 0.0020199 | 0.6570383 | TRUE | 66 | x | bad_coder_noisy_constant | | x_bad_coder_noisy_conditional | TRUE | 0.0012397 | 0.7280086 | TRUE | 66 | x | bad_coder_noisy_conditional | | x_bad_coder_noisy_delta | TRUE | 0.0013310 | 0.7185787 | TRUE | 66 | x | bad_coder_noisy_delta | | x_catN | TRUE | 0.0013300 | 0.7186823 | TRUE | 66 | x | catN |
Notice vtreat
correctly identified none of the variables as being
significant.
treatedD <- prepare(treatplanN, d)
## Warning in prepare.treatmentplan(treatplanN, d): possibly called prepare() on
## same data frame as designTreatments*()/mkCrossFrame*Experiment(), this can lead
## to over-fit. To avoid this, please use mkCrossFrame*Experiment$crossFrame.
summary(lm(y ~ x_bad_coder_noisy_constant, data= treatedD))
##
## Call:
## lm(formula = y ~ x_bad_coder_noisy_constant, data = treatedD)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.48154 -0.68676 -0.06004 0.60649 2.85883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.05 100.17 0.759 0.450
## x_bad_coder_noisy_constant -322.30 425.85 -0.757 0.451
##
## Residual standard error: 1.044 on 98 degrees of freedom
## Multiple R-squared: 0.005811, Adjusted R-squared: -0.004334
## F-statistic: 0.5728 on 1 and 98 DF, p-value: 0.451
However, specifying oneWayHoldout
as the cross-validation technique
introduces sampling variation that is correlated with the outcome. This
causes the value in the synthetic cross-frame (used both for calculating
variable significances and returned to the use for further training) to
have a spurious correlation with the outcome. The completely
deterministic structure of leave-one-out holdout itself represents an
information leak that poisons results.
cfeBad <- mkCrossFrameNExperiment(d,
varlist = c('x'),
outcomename = 'y',
customCoders = customCoders,
codeRestriction = codeRestriciton,
splitFunction = oneWayHoldout,
verbose = FALSE)
knitr::kable(cfeBad$treatments$scoreFrame)
| varName | varMoves | rsq | sig | needsSplit | extraModelDegrees | origName | code | | :-------------------------------- | :------- | --------: | --------: | :--------- | ----------------: | :------- | :----------------------------- | | x_bad_coder_noisy_constant | TRUE | 0.9996395 | 0.0000000 | TRUE | 66 | x | bad_coder_noisy_constant | | x_bad_coder_noisy_conditional | TRUE | 0.0008197 | 0.7773529 | TRUE | 66 | x | bad_coder_noisy_conditional | | x_bad_coder_noisy_delta | TRUE | 0.0018823 | 0.6682155 | TRUE | 66 | x | bad_coder_noisy_delta | | x_catN | TRUE | 0.0018844 | 0.6680390 | TRUE | 66 | x | catN |
Notice the bad constant coder was (falsely) reported as usable and
(falsely) appears useful on the cross-frame. Also notice the normal
coders such as impact (which was fully rejected by vtreat
) and levels
codes were properly rejected.
What happened is:
y
into the bad coder. Essentially the leave-one-out cross
validation is consuming a number of degrees of freedom equal to the
number of different data sets its presents (one per data row).vtreat
impact/effects coders are careful to return the
difference from cross-validation segment mean (which would be zero
for all constant values).In the failing example the value returned data-row k
is essentially
the mean of all rows except the k
-th row due to the leave-one-out
holdout. Call this estimate e(k)
(the estimate assigned to the k
-th
row).
The coding-estimate for the k
-th row is essentially (1/(n-1)) sum(i
= 1, ...,n; i not k) y(i)
(where n
is the number of training data
rows, and y(i)
is the i
-th dependent value). That is the coder
builds its coding of the k
-th row by averaging all of the training
dependent values it is allowed to see under the leave-1-out cross
validation procedure. In an isolated sense its calculation of the k
-th
row is independent of y(k)
as that value was not shown to the
procedure at that time.
However by algebra we have this estimate e(k)
is also equal to
(n/(n-1)) mean(y) - y(k)/(n-1)
. So a step in the procedure that also
knows mean(y)
(such as say the lm()
linear regression models shown
above, and the variable significance procedures used to build the
scoreFrame
s) we know that y(k) = sum(y) - (n-1) e(k)
. Or in vector
form (y
and e
being the vectors, all other terms scalars): y =
sum(y) - (n-1) e
. Jointly for all rows the dependent variable y
is a
simple linear function of the estimates e
, even though each estimate
e(k)
with no knowledge of the dependent value y(k)
in the same row.
Or: to an observer that knows n
and mean(y)
(and hence sum(y)
)
e(k)
completely determines y(k)
even though it was constructed
without knowledge of y(k)
.
This failing is because:
Fully nested cross-simulation (where even the last step is under the
cross-control and enumerating excluded sets of training rows) is likely
too cumbersome (requiring more code coordination) and expensive (upping
the size of the sets of rows we have to exclude) to force on
implementers who are also unlikely to see any benefit in non-degenerate
cases. The partially nested cross-simulation used in vtreat
is likely
a good practical compromise (though we may explore full-nesting for the
score frame estimates, as that is a step completely under vtreat
control).
The current vtreat
procedures are very strong and fully “up to the
job” of assisting in construction of best possible machine learning
models. However in certain degenerate cases (near-constant encoding
combined completely deterministic cross-validation; neither of which is
a default behavior of vtreat
) the cross validation system itself can
introduce an information leak that promotes over-fit for some custom
coders. vtreat
’s built-in coders are estimates of conditional changes
from apparent mean (not estimates of conditional values), so tend to
avoid the above issues.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.