Description Usage Arguments Details Value Author(s) See Also Examples
This function runs cross-validation for a given supervised learning model, which is specified by the training function, prediction function, and metric function. The user might need to write wrappers for the functions so that they satisfy the format requirements desceribed in the following. This function works on both in-memory and in-database data.
1 2 |
train |
A training function. Its first argument must be a |
predict |
A prediction function. It must have only two arguments, which are the fitted model (the first argument) and the new data input for prediction (the second argument). |
metric |
A metric function. It must have only two arguments. The first argument is the prediction and the second is the data that contains the actual value. This function shoud measure the difference between the predicted and actual values and produce a single numeric value. |
data |
A |
params |
A list, default is |
k |
An integer, default is 10. The cross-validation fold number. |
approx.cut |
A boolean, default is TRUE. Whether to cut the data into |
verbose |
A logical value, default is |
find.min |
A logical value, default is |
In order to cut the data table into k
equal pieces, a column of unique id for every row needs to be attached to the data so that one can cut the data using different ranges of the row id. For example, for a 1000 rows data table, when id is 1-100, 101-200, ..., one can cut the data into 10 pieces. The id should be randomly assigned to the rows for cross-validation to use. Note that the original data is not touched in this process, instead all the data is copied to a new temporary table with the id column created in the new table. Because a unique id is to be randomly assigned to each row, this process cannot be easily parallelized.
When approx.cut
is TRUE
, which is the default, a column of uniform random integer instead of consecutive integers is created in the temporary table. We apply the same method to cut the data using the different ranges of this column, for example, 1-100, 101-200, etc. Apparently, the k
pieces of data do not have an exact equal size, and the sizes of them are only approximately equal. However, for big data sets, the differeces are relatively small and should not affect the result. This process does not generate unique ID's for the rows, but can be easily parallelized, so this method is much faster for big data sets.
If params
is NULL
, this function returns
a list
, which contains two elements: err
and
err.std
, which are the errors and its standard deviation.
If params
is not NULL
, this function returns a cv.generic
object, which is a list
that contains the following items:
metric |
A list, which contains - - - |
params |
A data.frame, which contains all the parameter sets. |
best |
The fit that has the optimum metric value. |
best.params |
A list, the set of parameters that produces the optimum metric value. |
Author: Predictive Analytics Team at Pivotal Inc.
Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io
generic.bagging
does the boostrap aggregate computation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | ## Not run:
## set up the database connection
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)
## ----------------------------------------------------------------------
dat <- as.db.data.frame(abalone, conn.id = cid, verbose = FALSE)
err <- generic.cv( function(data) {
madlib.lm(rings ~ . - id - sex, data = data)
},
predict,
function(predicted, data) {
lookat(mean((data$rings - predicted)^2))
}, data = dat, verbose = FALSE)
## ----------------------------------------------------------------------
x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100, 0.1, 2)
dat <- data.frame(x, y)
delete("eldata", conn.id = cid)
z <- as.db.data.frame(dat, "eldata", conn.id = cid, verbose = FALSE)
g <- generic.cv(
train = function (data, alpha, lambda) {
madlib.elnet(y ~ ., data = data, family = "gaussian",
alpha = alpha, lambda = lambda,
control = list(random.stepsize=TRUE))
},
predict = predict,
metric = function (predicted, data) {
lk(mean((data$y - predicted)^2))
},
data = z,
params = list(alpha=1, lambda=seq(0,0.2,0.1)),
k = 5, find.min = TRUE, verbose = FALSE)
plot(g$params$lambda, g$metric$avg, type = 'b')
g$best
## ----------------------------------------------------------------------
db.disconnect(cid, verbose = FALSE)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.