CV.SuperLearner: computes V-fold cross validation of the Super Learner

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Computes the V-fold cross-validation estimates from the Super Learner. The function splits the data into V folds and calls SuperLearner.

Usage

1
CV.SuperLearner(Y, X, SL.library, outside.V = 20, inside.V = 20, shuffle = TRUE, verbose = FALSE, family = gaussian(), method="NNLS", id=NULL, save.fit.library=FALSE, trim.logit=0.001, obsWeights= NULL, stratifyCV = FALSE, ...)

Arguments

Y

The outcome variable

X

The predictor variables

SL.library

The library of prediction algorithms to be used in convex.SL

outside.V

An integer for the number of folds to split the data into

inside.V

An integer for the number of folds each Super Learner should use

shuffle

A logical value indicating whether the rows of the data should be shuffled before the data splits

verbose

A logical value to produce additional output

family

currently allows gaussian or binomial to describe the distribution of the outcome.

method

Loss function for combining prediction in the library. Currently either "NNLS" (the default), "NNLS2", or "NNloglik". NNLS and NNLS2 are non-negative least squares based on the Lawson-Hanson algorithm and the dual method of Goldfarb and Idnani, respectively. NNLS and NNLS2 will work for both gaussian and binomial outcomes. NNloglik is a non-negative binomial likelihood maximization using the BFGS quasi-Newton optimization method.

id

cluster identification variable. For the cross-validation splits used to find the weights for each prediction algorithm, id forces observations in the same cluster to be in the same validation fold.

obsWeights

observation weights

save.fit.library

a logical value whether to save the fit of each algorithm in the library on the full data set. This must be TRUE for predict.SuperLearner to work.

trim.logit

Only used if method="NNloglik". specifies a truncation level for the logit function for stability.

stratifyCV

a logical value for the cross-validation splits. If TRUE and the family is binomial then the splits will stratify on the outcome to give (roughly) equal proportions of the outcome in all splits. Currently will not work in combination with cluster id.

...

additional arguments ...

Details

see SuperLearner for details on the Super Learner

Value

CV.fit.SL

A list containing the output from each SuperLearner

pred.SL

The V-fold cross-validation super learner predictions for the outcome. These can be used to estimate the honest cross-validated risk

pred.discreteSL

The V-fold cross-validated discrete super learner prediction for the outcome. The discrete super learner selects the algorithm with the minimum internal cross-validated risk estimate. See output value whichDiscreteSL for the algorithm name associated with each fold

whichDiscreteSL

The prediction algorithm selected in each outside V fold as the discrete super learner

pred.library

The V-fold cross-validation predictions for the outcome from all algorithms in the library

coef.SL

a matrix of coefficients in the SuperLearner across the V folds

folds

a list with the cross-validation splits

call

the function call

Author(s)

Eric C Polley ecpolley@berkeley.edu

References

van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2008) Super Learner, Statistical Applications of Genetics and Molecular Biology, 6, article 25. http://www.bepress.com/sagmb/vol6/iss1/art25

See Also

SuperLearner

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## Not run:  
## simulate data
set.seed(23432)
## training set
n <- 200
p <- 20
X <- matrix(rnorm(n*p), nrow=n, ncol=p)
colnames(X) <- paste("X",1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + X[, 2]^2 - X[, 3] + X[, 1]*X[, 4] + X[, 5] + X[, 6] - X[, 7] + rnorm(n)

## test set
m <- 1000
newX <- matrix(rnorm(m*p), nrow=m, ncol=p)
colnames(newX) <- paste("X",1:p, sep="")
newX <- data.frame(newX)
newY <- newX[, 1] + newX[, 2]^2 - newX[, 3] + newX[, 1]*newX[, 4] + newX[, 5] + newX[, 6] - newX[, 7] + rnorm(m)

## generate Library and run Super Learner
SL.library <- c("SL.glmnet","SL.glm","SL.randomForest")
test <- SuperLearner(Y=Y, X=X, newX=newX, SL.library=SL.library, verbose=TRUE, V=20)
test
testCV <- CV.SuperLearner(Y=Y, X=X, SL.library=SL.library, verbose=TRUE, outside.V=10, inside.V = 20)
testCV
## compare SuperLearner honest CV risk with discrete super learner CV risk
mean((Y - testCV$pred.SL)^2)
mean((Y - testCV$pred.discreteSL)^2)
apply(testCV$pred.library, 2, function(x) mean((Y - x)^2))
summary(testCV)

## Binary outcome:
set.seed(1)
N <- 200
X <- matrix(rnorm(N*10), N, 10)
X <- as.data.frame(X)
Y <- rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4])))

SL.library <- c("SL.glmnet","SL.glm","SL.randomForest", "SL.knn20", "SL.knn30", "SL.knn40", "SL.knn50", "SL.glmnet.alpha50", "SL.gam", "SL.gam.3")

testCV.NNLS <- CV.SuperLearner(Y=Y, X=X, SL.library=SL.library, verbose=TRUE, outside.V=10, inside.V = 20, method = "NNLS", family = binomial())
summary(testCV.NNLS)

testCV.NNloglik <- CV.SuperLearner(Y=Y, X=X, SL.library=SL.library, verbose=TRUE, outside.V=10, inside.V = 20, method = "NNloglik", family = binomial())
summary(testCV.NNloglik)
 

## End(Not run)

ecpolley/SuperLearner_Old documentation built on May 15, 2019, 10:08 p.m.