Description Usage Arguments Details Value Author(s) References See Also Examples
Function that performs a hold out experiment of a learning system on a given data set. The function is completely generic. The generality comes from the fact that the function that the user provides as the system to evaluate, needs in effect to be a user-defined function that takes care of the learning, testing and calculation of the statistics that the user wants to estimate using the hold out method.
1 | holdOut(sys, ds, sets, itsInfo = F)
|
sys |
|
ds |
|
sets |
|
itsInfo |
Boolean value determining whether the object returned by the function should include as an attribute a list with as many components as there are iterations in the experimental process, with each component containing information that the user-defined function decides to return on top of the standard error statistics. See the Details section for more information. |
The idea of this function is to carry out a hold out experiment of a given learning system on a given data set. The goal of this experiment is to estimate the value of a set of evaluation statistics by means of the hold out method. Hold out estimates are obtained by randomly dividing the given data set in two separate partitions, one that is used for obtaining the prediction model and the other for testing it. This learn+test process is repeated k times. In the end the average of the k scores obtained on each repetition is the hold out estimate.
It is the user responsibility to decide which statistics are to be evaluated on each iteration and how they are calculated. This is done by creating a function that the user knows it will be called by this hold out routine at each repetition of the learn+test process. This user-defined function must assume that it will receive in the first 3 arguments a formula, a training set and a testing set, respectively. It should also assume that it may receive any other set of parameters that should be passed towards the learning algorithm. The result of this user-defined function should be a named vector with the values of the statistics to be estimated obtained by the learner when trained with the given training set, and tested on the given test set. See the Examples section below for an example of these functions.
If the itsInfo
parameter is set to the value
TRUE
then the hldRun
object that is the result
of the function will have an attribute named itsInfo
that will contain extra information from the individual repetitions of
the hold out process. This information can be accessed by the user by
using the function attr()
,
e.g. attr(returnedObject,'itsInfo')
. For this
information to be collected on this attribute the user needs to code
its user-defined functions in a way that it returns the vector of the
evaluation statistics with an associated attribute named
itInfo
(note that it is "itInfo" and not "itsInfo" as
above), which should be a list containing whatever information the
user wants to collect on each repetition. This apparently complex
infra-structure allows you to pass whatever information you which from
each iteration of the experimental process. A typical example is the
case where you want to check the individual predictions of the model
on each test case of each repetition. You could pass this vector of
predictions as a component of the list forming the attribute
itInfo
of the statistics returned by your user-defined
function. In the end of the experimental process you will be able to
inspect/use these predictions by inspecting the attribute
itsInfo
of the hldRun
object returned by the
holdOut()
function. See the Examples section for an
illustration of this potentiality.
The result of the function is an object of class hldRun
.
Luis Torgo ltorgo@dcc.fc.up.pt
Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).
http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR
experimentalComparison
,
hldRun
,hldSettings
, monteCarlo
, crossValidation
, loocv
, bootstrap
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | ## Estimating the mean absolute error and the normalized mean squared
## error of rpart on the swiss data, using 10 repetitions of 70%-30%
## Hold Out experiment
data(swiss)
## First the user defined function (note: can have any name)
hld.rpart <- function(form, train, test, ...) {
require(rpart)
model <- rpart(form, train, ...)
preds <- predict(model, test)
regr.eval(resp(form, test), preds,
stats=c('mae','nmse'), train.y=resp(form, train))
}
## Now the evaluation
eval.res <- holdOut(learner('hld.rpart',pars=list()),
dataset(Infant.Mortality ~ ., swiss),
hldSettings(10,0.3,1234))
## Check a summary of the results
summary(eval.res)
## Plot them
## Not run:
plot(eval.res)
## End(Not run)
## An illustration of the use of the itsInfo parameter.
## In this example the goal is to be able to check values predicted on
## each iteration of the experimental process (e.g. checking for extreme
## values)
## We need a different user-defined function that exports this
## information as an attribute
hld.rpart <- function(form, train, test, ...) {
require(rpart)
model <- rpart(form, train, ...)
preds <- predict(model, test)
eval.stats <- regr.eval(resp(form, test), preds,
stats=c('mae','nmse'),
train.y=resp(form,train))
structure(eval.stats,itInfo=list(predictions=preds))
}
## Now lets run the experimental comparison
eval.res <- holdOut(learner('hld.rpart',pars=list()),
dataset(Infant.Mortality ~ ., swiss),
hldSettings(10,0.3,1234),
itsInfo=TRUE)
## getting the information with the predictions for all 10 repetitions
info <- attr(eval.res,'itsInfo')
## checking the predictions on the 5th repetition
info[[5]]
|
Loading required package: lattice
Loading required package: grid
10 x 70 %/ 30 % Holdout run with seed = 1234
Repetition 1Loading required package: rpart
Repetition 2
Repetition 3
Repetition 4
Repetition 5
Repetition 6
Repetition 7
Repetition 8
Repetition 9
Repetition 10
== Summary of a Hold Out Experiment ==
10 x 70 %/ 30 % Holdout run with seed = 1234
* Data set :: swiss
* Learner :: hld.rpart with parameters:
* Summary of Experiment Results:
mae nmse
avg 2.4461178 1.1084398
std 0.2211913 0.2426966
min 2.1378088 0.7752864
max 2.7664827 1.5103750
invalid 0.0000000 0.0000000
10 x 70 %/ 30 % Holdout run with seed = 1234
Repetition 1
Repetition 2
Repetition 3
Repetition 4
Repetition 5
Repetition 6
Repetition 7
Repetition 8
Repetition 9
Repetition 10
Val de Ruz Aubonne Boudry Echallens Conthey Sarine Lavaux
21.75000 19.54167 21.75000 19.54167 19.54167 21.75000 19.54167
Grandson Lausanne Broye Payerne Veveyse Orbe Le Locle
21.75000 17.26667 19.54167 19.54167 19.54167 17.26667 21.75000
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.