Provides a sampling function to be supplied to the
argument of function
pre, making sure that each level of specified factor(s)
are present in each sample.
rare_level_sampler(factors, data, sampfrac = 0.5, warning = FALSE)
Character vector with name(s) of factors with rare levels.
numeric value > 0 and ≤ 1. Specifies
the fraction of randomly selected training observations used to produce each
tree. Values < 1 will result in sampling without replacement (i.e.,
subsampling), a value of 1 will result in sampling with replacement
(i.e., bootstrap sampling). Alternatively, a sampling function may be supplied,
which should take arguments
logical. Whether a warning should be printed if observations with rare factor levels are added to the training sample of the current iteration.
Categorical predictor variables (factors) with rare levels may be problematic
in boosting algorithms employing sampling (which is employed by default in
If a sample in a given boosting iteration does not have any observations with a given
(rare) level of a factor, while this level is present in the full training dataset, and
the factor is selected for splitting in the tree, then no prediction for that level of the factor
can be generated, resulting in an error. Note that boosting methods other than
pre that also
employ sampling (e.g.,
xgboost) may not generate an error in such cases,
but also do not document how intermediate predictions are generated in such a case. It is likely that
these methods use one-hot-encoding of factors, which from a perspective of model interpretation
introduces new problems, especially when the aim is to obtain a sparse set of rules as in 'pre'.
pre(), the rare-factor-level issue, if encountered, can be dealt with by the user
in one of the following ways (in random order):
Use a sampling function that guarantees inclusion of rare factor levels in each sample. E.g.,
rare_level_sampler, yielding a sampling function which creates training samples
guaranteed to include each level of specified factor(s). Advantage: No loss of information, easy to implement,
guaranteed to solve the issue. Disadvantage: May result in oversampling
of observations with rare factor levels, potentially biasing results. The bias is likely small though, and
will be larger for smaller sample sizes and sampling fractions, and for larger numbers of rare
levels. The latter will also increase computational demands.
learnrate = 0. This results in a (su)bagging instead of boosting approach.
Advantage: Eliminates the rare-factor-level issue completely, because intermediate predictions
need not be computed. Disadvantage: Boosting with low learning rate often improves predictive accuracy.
Data pre-processing: Before running function
pre(), combine rare factor levels
with other levels of the factors. Advantage: Limited loss of information. Disadvantage: Likely, but
not guaranteed to solve the issue.
Data pre-processing: Apply one-hot encoding to the predictor matrix before applying function 'pre()'. This can easily be
done through applying function
model.matrix. Advantage: Guaranteed to solve the error,
easy to implement. Disadvantage: One-hot-encoding increases the number of predictor variables
which may reduce interpretability and, but probably to a lesser extent, accuracy.
Data pre-processing: Remove observations with rare factor levels from the dataset
before running function
pre(). Advantage: Guaranteed to solve the error. Disadvantage:
Removing outliers results in a loss of information, and may bias the results.
Increase the value of
sampfrac argument of function
pre(). Advantage: Easy to
implement. Disadvantage: Larger samples are more likely but not guaranteed to contain all possible
factor levels, thus not guaranteed to solve the issue.
A sampling function, which generates sub- or bootstrap samples as usual in function
checks if all levels of the specified factor(s) are present and adds observation with those levels if not.
warning = TRUE, a warning is issued).
## Create dataset with two factors containing rare levels dat <- iris[iris$Species != "versicolor", ] dat <- rbind(dat, iris[iris$Species == "versicolor", ][1:5, ]) dat$factor2 <- factor(rep(1:21, times = 5)) ## Set up sampling function samp_func <- rare_level_sampler(c("Species", "factor2"), data = dat, sampfrac = .51, warning = TRUE) ## Illustrate behavior of sampling function N <- nrow(dat) wts <- rep(1, times = nrow(dat)) set.seed(3) dat[samp_func(n = N, weights = wts), ] # single sample for (i in 1:500) dat[samp_func(n = N, weights = wts), ] warnings() # to illustrates warnings that may occur when fitting a full PRE ## Illustrate use of function generator with function pre: ## (Note: low ntrees value merely to reduce computation time for the example) set.seed(42) # iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20) # would yield error iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20, sampfrac = samp_func) # should work
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.