Description Usage Arguments Details Value References Examples
For a given data matrix and its corresponding vector of labels, we calculate the bootstrap cross-validation (BCV) error rate from Fu, Carroll, and Wang (2005) for a given classifier.
1 2 | errorest_bcv(x, y, train, classify, num_bootstraps = 50,
num_folds = 10, hold_out = NULL, ...)
|
x |
a matrix of n observations (rows) and p features (columns) |
y |
a vector of n class labels |
train |
a function that builds the classifier. (See details.) |
classify |
a function that classifies observations
from the constructed classifier from |
num_bootstraps |
the number of bootstrap replications |
num_folds |
the number of cross-validation folds.
Ignored if |
hold_out |
the hold-out size for cross-validation. See Details. |
... |
additional arguments passed to the function
specified in |
To calculate the BCV error rate, we sample from the data
with replacement to obtain a bootstrapped training data
set. We then compute a cross-validation error rate with
the given classifier (given in train
) on the
bootstrapped training data set. We repeat this process
num_bootstraps
times to obtain a set of
bootstrapped cross-validation error rates. We report the
average of these error rates. The
errorest_cv
function is used to compute the
cross-validation (CV) error rate estimator for each
bootstrap iteration.
Fu et al. (2005) note that the BCV method works well because it is a bagging classification error. Furthermore, consider the leave-one-out (LOO) error rate estimator. For small sample sizes, the data are sparse, so that the left out observation has a high probability of being far in distance from the remaining training data set. Hence, the LOO error rate estimator yields a large variance for small data sets.
Rather than partitioning the observations into folds, an
alternative convention is to specify the 'hold-out' size
for each test data set. Note that this convention is
equivalent to the notion of folds. We allow the user to
specify either option with the hold_out
and
num_folds
arguments. The num_folds
argument
is the default option but is ignored if the
hold_out
argument is specified (i.e. is not
NULL
).
We expect that the first two arguments of the classifier
function given in train
are x
and y
,
corresponding to the data matrix and the vector of their
labels. Additional arguments can be passed to the
train
function. The returned object should be a
classifier that will be passed to the function given in
the classify
argument.
We stay with the usual R convention for the
classify
function. We expect that this function
takes two arguments: 1. an object
argument which
contains the trained classifier returned from the
function specified in train
; and 2. a
newdata
argument which contains a matrix of
observations to be classified – the matrix should have
rows corresponding to the individual observations and
columns corresponding to the features (covariates).
the BCV error rate estimate
Fu, W.J., Carroll, R.J., and Wang, S. (2005), "Estimating misclassification error with small samples via bootstrap cross-validation," Bioinformatics, vol. 21, no. 9, pp. 1979-1986.
1 2 3 4 5 6 7 8 9 10 11 | require('MASS')
iris_x <- data.matrix(iris[, -5])
iris_y <- iris[, 5]
# Because the \code{classify} function returns multiples objects in a list,
# we provide a wrapper function that returns only the class labels.
lda_wrapper <- function(object, newdata) { predict(object, newdata)$class }
set.seed(42)
errorest_bcv(x = iris_x, y = iris_y, train = MASS:::lda,
classify = lda_wrapper)
# Output: 0.02213333
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.