Description Usage Arguments Details Value Note Examples
A general abstraction of the k-fold cross validation procedure.
1 2 |
proc |
the procedure to be k-fold cross validated. |
k |
the number of folds. |
data |
a matrix or data.frame from which the folds will be created. |
params |
a list or data.frame. If |
.rngSeed |
the seed set before randomly generating fold indices. |
.chunkSize |
the number of parameter combinations to be processed
at once (see help for |
.doSEQ |
logical flag indicating whether cross validation should
be run sequentially or with |
This function leverages foreach and
iter to perform k-fold cross validation in a distributed
fashion (provided a parallel backend is registered).
Because the heart of this function is a pair of nested foreach loops
one should be careful of "over-parallelization". Meaning, if the routine
inside proc is already natively parallel, then by invoking this
routine around proc you'll be distributing a distributed computation.
This may not yield the speed gains you would expect.
One work around to this – assuming proc is parallelized using
foreach is to call create a wrapper around proc that calls
registerDoSEQ. For example,
proC <- function(...) {registerDoSEQ(); proc(...)}
Alternatively, you could run kFoldCV sequentially by setting
.doSEQ to TRUE.
For a procedure proc <- function(data, newdata, arg1, ..., argN){...}
, it may end up that cross-validating a single N-tuple of arguments
c(arg1, ..., argN) may be very quick. Hence, the time it takes
to send off proc, the data and the appropriate combinations of
params may overwhelm the actual computation time. In this instance,
one should consider changing .chunkSize from 1 to n
(where n is any reasonable integer value that would justify the
passing of data to a distant node).
a vector whose length is equal to nrow(params), if
params is a data.frame, or the number of combinations of elements of
params if it's a list. The i-th component corresponds to the k-fold
cross-validated value of proc evaluated with parameters from the i-th
combination of params.
The current implementation of this assumes that entries in params are
numeric so that as.matrix(expand.grid(params)) is a numeric matrix
with named columns. A work around to passing character parameters would be
to translate the character parameter to an integer, and write a wrapper
for proc that translates the interger back to the appropriate
string. See the example below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | # simple example with k-NN where we can build our own wrapper
library(class)
data(iris)
.iris <- iris[, 5:1] # put response as first column
# make a wrapper for class::knn
f <- function(data, newdata, k) {
preds <- knn(train=data[,-1],
test=newdata[, -1],
cl=data[, 1],
k=k)
mean(preds==newdata[, 1])
}
params <- list(k=c(1,3,5,7))
accuracy <- kFoldCV(f, 10, .iris, params, .rngSeed=407)
data.frame(expand.grid(params), accuracy=accuracy)
# look at a more complicated example:
# cross validate an svm with different kernels and different models
require(e1071)
g <- function(data, newdata, kernel, cost, gamma, formula) {
kern <- switch(kernel, "linear", "radial", stop("invalid kernel"))
form <- switch(formula,
as.formula(Species ~ .),
as.formula(Species ~ Petal.Length + Petal.Width),
as.formula(Petal.Length ~ .),
stop('invalid formula'))
svmWrapper <- function(data, newdata, kernel, cost, gamma, form) {
svmObj <- svm(formula=form, data=data, kernel=kernel,
cost=cost, gamma=gamma)
predict(svmObj, newdata)
}
preds <- svmWrapper(data, newdata, kernel=kern, cost=cost,
gamma=gamma, form=form)
if (formula != 3) {
mean(preds == newdata[["Species"]])
} else {
mean((preds - newdata[["Petal.Length"]])^2)
}
}
params <- list(kernel=1:2, cost=c(10,50), gamma=0.01, formula=1)
accuracy <- kFoldCV(g, 10, iris, params)
data.frame(expand.grid(params), metric=accuracy)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.