crossValidation: Run a Cross Validation Experiment
In DMwR: Functions and data for "Data Mining with R"

Description Usage Arguments Details Value Author(s) References See Also Examples

Function that performs a cross validation experiment of a learning system on a given data set. The function is completely generic. The generality comes from the fact that the function that the user provides as the system to evaluate, needs in effect to be a user-defined function that takes care of the learning, testing and calculation of the statistics that the user wants to estimate through cross validation.

1	crossValidation(sys, ds, sets, itsInfo = F)

`sys`	`sys` is an object of the class `learner` representing the system to evaluate.
`ds`	`ds` is an object of the class `dataset` representing the data set to be used in the evaluation.
`sets`	`sets` is an object of the class `cvSettings` representing the cross validation experimental settings to use.
`itsInfo`	Boolean value determining whether the object returned by the function should include as an attribute a list with as many components as there are iterations in the experimental process, with each component containing information that the user-defined function decides to return on top of the standard error statistics. See the Details section for more information.

The idea of this function is to carry out a cross validation experiment of a given learning system on a given data set. The goal of this experiment is to estimate the value of a set of evaluation statistics by means of cross validation. k-Fold cross validation estimates are obtained by randomly partition the given data set into k equal size sub-sets. Then a learn+test process is repeated k times. At each iteration one of the k partitions is left aside as test set and the model is obtained with a training set formed by the remaining k-1 partitions. The process is repeated leaving each time one of the partitions aside as test set. In the end the average of the k scores obtained on each iteration is the cross validation estimate.

It is the user responsibility to decide which statistics are to be evaluated on each iteration and how they are calculated. This is done by creating a function that the user knows it will be called by this cross validation routine at each iteration of the cross validation process. This user-defined function must assume that it will receive in the first 3 arguments a formula, a training set and a testing set, respectively. It should also assume that it may receive any other set of parameters that should be passed towards the learning algorithm. The result of this user-defined function should be a named vector with the values of the statistics to be estimated obtained by the learner when trained with the given training set, and tested on the given test set. See the Examples section below for an example of these functions.

If the itsInfo parameter is set to the value TRUE then the hldRun object that is the result of the function will have an attribute named itsInfo that will contain extra information from the individual repetitions of the hold out process. This information can be accessed by the user by using the function attr(), e.g. attr(returnedObject,'itsInfo'). For this information to be collected on this attribute the user needs to code its user-defined functions in a way that it returns the vector of the evaluation statistics with an associated attribute named itInfo (note that it is "itInfo" and not "itsInfo" as above), which should be a list containing whatever information the user wants to collect on each repetition. This apparently complex infra-structure allows you to pass whatever information you which from each iteration of the experimental process. A typical example is the case where you want to check the individual predictions of the model on each test case of each repetition. You could pass this vector of predictions as a component of the list forming the attribute itInfo of the statistics returned by your user-defined function. In the end of the experimental process you will be able to inspect/use these predictions by inspecting the attribute itsInfo of the cvRun object returned by the crossValidation() function. See the Examples section on the help page of the function holdout() for an illustration of this potentiality.

The result of the function is an object of class cvRun.

Luis Torgo ltorgo@dcc.fc.up.pt

Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).

http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR

experimentalComparison, cvRun,cvSettings, monteCarlo, holdOut, loocv, bootstrap

## Estimating the mean absolute error and the normalized mean squared
## error of rpart on the swiss data, using one repetition of 10-fold CV
data(swiss)

## First the user defined function (note: can have any name)
cv.rpart <- function(form, train, test, ...) {
    require(rpart)
    model <- rpart(form, train, ...)
    preds <- predict(model, test)
    regr.eval(resp(form, test), preds,
              stats=c('mae','nmse'), train.y=resp(form, train))
}

## Now the evaluation
eval.res <- crossValidation(learner('cv.rpart',pars=list()),
                            dataset(Infant.Mortality ~ ., swiss),
                            cvSettings(1,10,1234))

## Check a summary of the results
summary(eval.res)

## Plot them
## Not run: 
plot(eval.res)

## End(Not run)

Loading required package: lattice
Loading required package: grid

 1 x 10 - Fold Cross Validation run with seed =  1234 
Repetition  1 
Fold:  1Loading required package: rpart
  2  3  4  5  6  7  8  9  10

== Summary of a Cross Validation Experiment ==

 1 x 10 - Fold Cross Validation run with seed =  1234 

* Data set ::  swiss
* Learner  ::  cv.rpart  with parameters 

* Summary of Experiment Results:

              mae      nmse
avg     2.2051353 1.0632209
std     0.6742875 1.0395728
min     1.3034226 0.4793407
max     3.4877820 3.9397800
invalid 0.0000000 0.0000000