rare_level_sampler | R Documentation |
Provides a sampling function to be supplied to the sampfrac
argument of function pre
, making sure that each level of specified factor(s)
are present in each sample.
rare_level_sampler(factors, data, sampfrac = 0.5, warning = FALSE)
factors |
Character vector with name(s) of factors with rare levels. |
data |
|
sampfrac |
numeric value |
warning |
logical. Whether a warning should be printed if observations with rare factor levels are added to the training sample of the current iteration. |
Categorical predictor variables (factors) with rare levels may be problematic
in boosting algorithms employing sampling (which is employed by default in
function pre
).
If a sample in a given boosting iteration does not have any observations with a given
(rare) level of a factor, while this level is present in the full training dataset, and
the factor is selected for splitting in the tree, then no prediction for that level of the factor
can be generated, resulting in an error. Note that boosting methods other than pre
that also
employ sampling (e.g., gbm
or xgboost
) may not generate an error in such cases,
but also do not document how intermediate predictions are generated in such a case. It is likely that
these methods use one-hot-encoding of factors, which from a perspective of model interpretation
introduces new problems, especially when the aim is to obtain a sparse set of rules as in 'pre'.
With function pre()
, the rare-factor-level issue, if encountered, can be dealt with by the user
in one of the following ways (in random order):
Use a sampling function that guarantees inclusion of rare factor levels in each sample. E.g.,
use rare_level_sampler
, yielding a sampling function which creates training samples
guaranteed to include each level of specified factor(s). Advantage: No loss of information, easy to implement,
guaranteed to solve the issue. Disadvantage: May result in oversampling
of observations with rare factor levels, potentially biasing results. The bias is likely small though, and
will be larger for smaller sample sizes and sampling fractions, and for larger numbers of rare
levels. The latter will also increase computational demands.
Specify learnrate = 0
. This results in a (su)bagging instead of boosting approach.
Advantage: Eliminates the rare-factor-level issue completely, because intermediate predictions
need not be computed. Disadvantage: Boosting with low learning rate often improves predictive accuracy.
Data pre-processing: Before running function pre()
, combine rare factor levels
with other levels of the factors. Advantage: Limited loss of information. Disadvantage: Likely, but
not guaranteed to solve the issue.
Data pre-processing: Apply one-hot encoding to the predictor matrix before applying function 'pre()'. This can easily be
done through applying function model.matrix
. Advantage: Guaranteed to solve the error,
easy to implement. Disadvantage: One-hot-encoding increases the number of predictor variables
which may reduce interpretability and, but probably to a lesser extent, accuracy.
Data pre-processing: Remove observations with rare factor levels from the dataset
before running function pre()
. Advantage: Guaranteed to solve the error. Disadvantage:
Removing outliers results in a loss of information, and may bias the results.
Increase the value of sampfrac
argument of function pre()
. Advantage: Easy to
implement. Disadvantage: Larger samples are more likely but not guaranteed to contain all possible
factor levels, thus not guaranteed to solve the issue.
A sampling function, which generates sub- or bootstrap samples as usual in function pre
, but
checks if all levels of the specified factor(s) are present and adds observation with those levels if not.
If warning = TRUE
, a warning is issued).
pre
## Create dataset with two factors containing rare levels
dat <- iris[iris$Species != "versicolor", ]
dat <- rbind(dat, iris[iris$Species == "versicolor", ][1:5, ])
dat$factor2 <- factor(rep(1:21, times = 5))
## Set up sampling function
samp_func <- rare_level_sampler(c("Species", "factor2"), data = dat,
sampfrac = .51, warning = TRUE)
## Illustrate what it does
N <- nrow(dat)
wts <- rep(1, times = nrow(dat))
set.seed(3)
dat[samp_func(n = N, weights = wts), ] # single sample
for (i in 1:500) dat[samp_func(n = N, weights = wts), ]
warnings() # to illustrate warnings that may occur when fitting a full PRE
## Illustrate use with function pre:
## (Note: low ntrees value merely to reduce computation time for the example)
set.seed(42)
# iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20) # would yield error
iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20,
sampfrac = samp_func) # should work
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.