generic.cv: Generic cross-validation for supervised learning algorithms

Description Usage Arguments Details Value Author(s) See Also Examples

Description

This function runs cross-validation for a given supervised learning model, which is specified by the training function, prediction function, and metric function. The user might need to write wrappers for the functions so that they satisfy the format requirements desceribed in the following. This function works on both in-memory and in-database data.

Usage

1
2
generic.cv(train, predict, metric, data, params = NULL, k = 10,
approx.cut = TRUE, verbose = TRUE, find.min = TRUE)

Arguments

train

A training function. Its first argument must be a db.obj object which is the wrapper for the data in database. Given the data, it produces the model. It can also have other parameters that specifies the model, and these parameters must appear in the list params.

predict

A prediction function. It must have only two arguments, which are the fitted model (the first argument) and the new data input for prediction (the second argument).

metric

A metric function. It must have only two arguments. The first argument is the prediction and the second is the data that contains the actual value. This function shoud measure the difference between the predicted and actual values and produce a single numeric value.

data

A db.obj object, which wraps the data in the database, used for cross-validation. Or a data.frame, which contains data in memory.

params

A list, default is NULL. The values of each parameters used by the training function. An array of values for each parameter is an element in the list. The value arrays for different parameters do not have to be the same length. The arrays of shorter lengths are circularly expanded to the length of the longest element.

k

An integer, default is 10. The cross-validation fold number.

approx.cut

A boolean, default is TRUE. Whether to cut the data into k pieces in an approximate way, which is faster than the accurate way. For big data sets, cutting the data into k pieces in an approximate way does not affect the result. See details for more.

verbose

A logical value, default is TRUE. Whether to print

find.min

A logical value, default is TRUE. Whether the best set of parameters produces the mode with the minimum metric value. Then a model will be trained on the whole data set using the best set of parameters. If it is FALSE, the parameter set with the maximum metric value will be used. This is ignored if params is NULL.

Details

In order to cut the data table into k equal pieces, a column of unique id for every row needs to be attached to the data so that one can cut the data using different ranges of the row id. For example, for a 1000 rows data table, when id is 1-100, 101-200, ..., one can cut the data into 10 pieces. The id should be randomly assigned to the rows for cross-validation to use. Note that the original data is not touched in this process, instead all the data is copied to a new temporary table with the id column created in the new table. Because a unique id is to be randomly assigned to each row, this process cannot be easily parallelized.

When approx.cut is TRUE, which is the default, a column of uniform random integer instead of consecutive integers is created in the temporary table. We apply the same method to cut the data using the different ranges of this column, for example, 1-100, 101-200, etc. Apparently, the k pieces of data do not have an exact equal size, and the sizes of them are only approximately equal. However, for big data sets, the differeces are relatively small and should not affect the result. This process does not generate unique ID's for the rows, but can be easily parallelized, so this method is much faster for big data sets.

Value

If params is NULL, this function returns a list, which contains two elements: err and err.std, which are the errors and its standard deviation.

If params is not NULL, this function returns a cv.generic object, which is a list that contains the following items:

metric

A list, which contains

- avg The average metric value for each set of parameters.

- std The standard deviation for the metric values of each set of parameters.

- vals A matrix that contains all the metric value measured, whose rows correspond to different set of parameters, and columns correspond to different folds of cross-validation.

params

A data.frame, which contains all the parameter sets.

best

The fit that has the optimum metric value.

best.params

A list, the set of parameters that produces the optimum metric value.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

See Also

generic.bagging does the boostrap aggregate computation.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
## Not run: 



## set up the database connection
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

## ----------------------------------------------------------------------

dat <- as.db.data.frame(abalone, conn.id = cid, verbose = FALSE)

err <- generic.cv(     function(data) {
        madlib.lm(rings ~ . - id - sex, data = data)
    },
    predict,
    function(predicted, data) {
        lookat(mean((data$rings - predicted)^2))
    }, data = dat, verbose = FALSE)

## ----------------------------------------------------------------------

x <- matrix(rnorm(100*20),100,20)
y <- rnorm(100, 0.1, 2)

dat <- data.frame(x, y)
delete("eldata", conn.id = cid)
z <- as.db.data.frame(dat, "eldata", conn.id = cid, verbose = FALSE)

g <- generic.cv(
    train = function (data, alpha, lambda) {
        madlib.elnet(y ~ ., data = data, family = "gaussian",
        alpha = alpha, lambda = lambda,
        control = list(random.stepsize=TRUE))
    },
    predict = predict,
    metric = function (predicted, data) {
        lk(mean((data$y - predicted)^2))
    },
    data = z,
    params = list(alpha=1, lambda=seq(0,0.2,0.1)),
    k = 5, find.min = TRUE, verbose = FALSE)

plot(g$params$lambda, g$metric$avg, type = 'b')

g$best

## ----------------------------------------------------------------------

db.disconnect(cid, verbose = FALSE)

## End(Not run)

PivotalR documentation built on March 13, 2021, 1:06 a.m.