Crossvalidation for model evaluation
Description
Estimate the prediction error of a model via (repeated) Kfold crossvalidation. It is thereby possible to supply an object returned by a model fitting function, a model fitting function itself, or an unevaluated function call to a model fitting function.
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25  cvFit(object, ...)
## Default S3 method:
cvFit(object, data = NULL, x = NULL,
y, cost = rmspe, K = 5, R = 1,
foldType = c("random", "consecutive", "interleaved"),
folds = NULL, names = NULL, predictArgs = list(),
costArgs = list(), envir = parent.frame(), seed = NULL,
...)
## S3 method for class 'function'
cvFit(object, formula, data = NULL,
x = NULL, y, args = list(), cost = rmspe, K = 5, R = 1,
foldType = c("random", "consecutive", "interleaved"),
folds = NULL, names = NULL, predictArgs = list(),
costArgs = list(), envir = parent.frame(), seed = NULL,
...)
## S3 method for class 'call'
cvFit(object, data = NULL, x = NULL, y,
cost = rmspe, K = 5, R = 1,
foldType = c("random", "consecutive", "interleaved"),
folds = NULL, names = NULL, predictArgs = list(),
costArgs = list(), envir = parent.frame(), seed = NULL,
...)

Arguments
object 
the fitted model for which to estimate the
prediction error, a function for fitting a model, or an
unevaluated function call for fitting a model (see

formula 
a 
data 
a data frame containing the variables
required for fitting the models. This is typically used
if the model in the function call is described by a

x 
a numeric matrix containing the predictor variables. This is typically used if the function call for fitting the models requires the predictor matrix and the response to be supplied as separate arguments. 
y 
a numeric vector or matrix containing the response. 
args 
a list of additional arguments to be passed to the model fitting function. 
cost 
a cost function measuring prediction loss.
It should expect the observed values of the response to
be passed as the first argument and the predicted values
as the second argument, and must return either a
nonnegative scalar value, or a list with the first
component containing the prediction error and the second
component containing the standard error. The default is
to use the root mean squared prediction error (see

K 
an integer giving the number of groups into
which the data should be split (the default is five).
Keep in mind that this should be chosen such that all
groups are of approximately equal size. Setting 
R 
an integer giving the number of replications for repeated Kfold crossvalidation. This is ignored for for leaveoneout crossvalidation and other nonrandom splits of the data. 
foldType 
a character string specifying the type of
folds to be generated. Possible values are

folds 
an object of class 
names 
an optional character vector giving names for the arguments containing the data to be used in the function call (see “Details”). 
predictArgs 
a list of additional arguments to be
passed to the 
costArgs 
a list of additional arguments to be
passed to the prediction loss function 
envir 
the 
seed 
optional initial seed for the random number
generator (see 
... 
additional arguments to be passed down. 
Details
(Repeated) Kfold crossvalidation is performed in
the following way. The data are first split into K
previously obtained blocks of approximately equal size.
Each of the K data blocks is left out once to fit
the model, and predictions are computed for the
observations in the leftout block with the
predict
method of the fitted model.
Thus a prediction is obtained for each observation.
The response variable and the obtained predictions for
all observations are then passed to the prediction loss
function cost
to estimate the prediction error.
For repeated crossvalidation, this process is replicated
and the estimated prediction errors from all replications
as well as their average are included in the returned
object.
Furthermore, if the response is a vector but the
predict
method of the fitted models
returns a matrix, the prediction error is computed for
each column. A typical use case for this behavior would
be if the predict
method returns
predictions from an initial model fit and stepwise
improvements thereof.
If formula
or data
are supplied, all
variables required for fitting the models are added as
one argument to the function call, which is the typical
behavior of model fitting functions with a
formula
interface. In this case,
the accepted values for names
depend on the
method. For the function
method, a character
vector of length two should supplied, with the first
element specifying the argument name for the formula and
the second element specifying the argument name for the
data (the default is to use c("formula", "data")
).
Note that names for both arguments should be supplied
even if only one is actually used. For the other
methods, which do not have a formula
argument, a
character string specifying the argument name for the
data should be supplied (the default is to use
"data"
).
If x
is supplied, on the other hand, the predictor
matrix and the response are added as separate arguments
to the function call. In this case, names
should
be a character vector of length two, with the first
element specifying the argument name for the predictor
matrix and the second element specifying the argument
name for the response (the default is to use c("x",
"y")
). It should be noted that the formula
or
data
arguments take precedence over x
.
Value
An object of class "cv"
with the following
components:
n 
an integer giving the number of observations. 
K 
an integer giving the number of folds. 
R 
an integer giving the number of replications. 
cv 
a numeric vector containing the respective estimated prediction errors. For repeated crossvalidation, those are average values over all replications. 
se 
a numeric vector containing the respective estimated standard errors of the prediction loss. 
reps 
a numeric matrix in which each column contains the respective estimated prediction errors from all replications. This is only returned for repeated crossvalidation. 
seed 
the seed of the random number generator before crossvalidation was performed. 
call 
the matched function call. 
Author(s)
Andreas Alfons
See Also
cvTool
, cvSelect
,
cvTuning
, cvFolds
,
cost
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23  library("robustbase")
data("coleman")
## via model fit
# fit an MM regression model
fit < lmrob(Y ~ ., data=coleman)
# perform crossvalidation
cvFit(fit, data = coleman, y = coleman$Y, cost = rtmspe,
K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
## via model fitting function
# perform crossvalidation
# note that the response is extracted from 'data' in
# this example and does not have to be supplied
cvFit(lmrob, formula = Y ~ ., data = coleman, cost = rtmspe,
K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
## via function call
# set up function call
call < call("lmrob", formula = Y ~ .)
# perform crossvalidation
cvFit(call, data = coleman, y = coleman$Y, cost = rtmspe,
K = 5, R = 10, costArgs = list(trim = 0.1), seed = 1234)
