twostageSL: Two-stage Super Learner Prediction

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/twostageSL.R

Description

A Prediction Function for the Two-stage Super Learner. The twostageSL function takes a training set pair (X,Y) and returns the predicted values based on a validation set.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
twostageSL(
  Y,
  X,
  newX = NULL,
  library.2stage,
  library.1stage,
  twostage,
  family.1 = binomial,
  family.2 = gaussian,
  family.single = gaussian,
  method = "method.CC_LS.scale",
  id = NULL,
  verbose = FALSE,
  control = list(),
  cvControl = list(),
  obsWeights = NULL,
  env = parent.frame()
)

Arguments

Y

The outcome in the training data set. Must be a numeric vector.

X

The predictor variables in the training data set, usually a data.frame.

newX

The predictor variables in the validation data set. The structure should match X. If missing, uses X for newX.

library.2stage

Candidate prediction algorithms in two-stage super learner. A list containing prediction algorithms at stage 1 and stage 2, the prediction algorithms are either a character vector or a list containing character vectors. See details below for examples on the structure. A list of functions included in the twostageSL package can be found with twostage_listWrappers.

library.1stage

Candidate prediction algorithms in standard super learner. Either a character vector of prediction algorithms or a list containing character vectors. See details below for examples on the structure. A list of functions included in the twostageSL package can be found with twostage_listWrappers.

twostage

logical; TRUE for implementing two-stage super learner; FALSE for implementing standatd super learner

family.1

Error distribution of the stage 1 outcome for two-stage super learner. Currently only allows binomial (default) to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.

family.2

Error distribution of the stage 2 outcome for two-stage super learner. Currently only allows gaussian (default) to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.

family.single

Error distribution of the outcome for standard super learner. Currently only allows gaussian (default) to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.

method

Details on estimating the coefficients for the two-stage super learner and the model to combine the individual algorithms in the library. Currently, the built in option is only "method.CC_LS.scale" (default) which is a scaled version of CC_LS. CC_LS.scale uses Goldfarb and Idnani's quadratic programming algorithm to calculate the best convex combination of weights to minimize the squared error loss. In addition, CC_LS.scale divides the quadratic function by a large constant to shrink the huge matrix and vector in quadratic function.

id

Optional cluster identification variable. For the cross-validation splits, id forces observations in the same cluster to be in the same validation fold. id is passed to the prediction and screening algorithms in library.2stage and library.1stage, but be sure to check the individual wrappers as many of them ignore the information.

verbose

logical; TRUE for printing progress during the computation (helpful for debugging).

control

A list of parameters to control the estimation process. Parameters include saveFitLibrary and trimLogit. See twostageSL.control for details.

cvControl

A list of parameters to control the cross-validation process. Parameters include V, stratifyCV, shuffle and validRows. See twostageSL.CV.control for details.

obsWeights

Optional observation weights variable. As with id above, obsWeights is passed to the prediction and screening algorithms, but many of the built in wrappers ignore (or can't use) the information. If you are using observation weights, make sure the library you specify uses the information.

env

Environment containing the learner functions. Defaults to the calling environment.

Details

twostageSL fits the two-stage super learner prediction algorithm. The weights for each algorithm in library.2stage and library.1stage is estimated, along with the fit of each algorithm.

The prescreen algorithms. These algorithms first rank the variables in X based on either a univariate regression p-value or the randomForest variable importance. A subset of the variables in X is selected based on a pre-defined cut-off. With this subset of the X variables, the algorithms in library.2stage and library.1stage are then fit.

The twostageSL package contains a few prediction and screening algorithm wrappers. The full list of wrappers can be viewed with listWrappers(). The design of the twostageSL package is such that the user can easily add their own wrappers.

Value

An object with S3 class twostageSL containing:

call

The matched call.

libraryNames

A character vector with the names of the algorithms in the library. The format is 'predictionAlgorithm_screeningAlgorithm' with '_All' used to denote the prediction algorithm run on all variables in X.

library.Num

Number of prediction algorithms in library.2stage and library.1stage.

orig.library

Returns the prediction algorithms and screening algorithms in each stage of library.2stage and library.1stage seperately.

SL.library

Returns the prediction algorithms and screening algorithms in library.2stage and library.1stage.

SL.predict

The predicted values from the two-stage super learner for the rows in newX.

coef

Coefficients for the two-stage super learner.

library.predict

A matrix with the predicted values from each algorithm in library.2stage and library.1stage for the rows in newX.

Z

The Z matrix (the cross-validated predicted values for each algorithm in library.2stage and library.1stage).

cvRisk

A numeric vector with the V-fold cross-validated risk estimate for each algorithm in library.2stage and library.1stage. Note that this does not contain the CV risk estimate for the two-stage super learner, only the individual algorithms in the library.

family

Returns the family.1, family.2 and family.single value from above

fitLibrary

A list with the fitted objects for each algorithm in library.2stage and library.1stage on the full training data set.

cvfitLibrary

A list with fitted objects for each algorithm in library.2stage and library.1stage on each of v different training data sets.

varNames

A character vector with the names of the variables in X.

validRows

A list containing the row numbers for the V-fold cross-validation step.

number0

A dataframe indicating the number of zeros in each of the v fold.

method

A list with the method functions.

whichScreen

A logical matrix indicating which variables passed each screening algorithm.

control

The control list.

cvControl

The cvControl list.

errorsInCVLibrary

A logical vector indicating if any algorithms experienced an error within the CV step.

errorsInLibrary

A logical vector indicating if any algorithms experienced an error on the full data.

data

The data frame including the predict variables and outcome in the training data set.

env

Environment passed into function which will be searched to find the learner functions. Defaults to the calling environment.

times

A list that contains the execution time of the twostageSL, plus separate times for model fitting and prediction.

Author(s)

Ziyue Wu

References

van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2008) Super Learner, Statistical Applications of Genetics and Molecular Biology, 6, article 25.

See Also

SuperLearner.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
## simulate data
set.seed(123)

## training set
n <- 10000
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- rep(NA,n)
## probability of outcome being zero
prob <- plogis(1 + X[,1] + X[,2] + X[,1]*X[,2])
g <- rbinom(n,1,prob)
## assign zero outcome
ind <- g==0
Y[ind] <- 0
## assign non-zero outcome
ind <- g==1
Y[ind] <- 10 + X[ind, 1] + sqrt(abs(X[ind, 2] * X[ind, 3])) + X[ind, 2] - X[ind, 3] + rnorm(sum(ind))

## test set
m <- 1000
newX <- matrix(rnorm(m*p), nrow = m, ncol = p)
colnames(newX) <- paste("X", 1:p, sep="")
newX <- data.frame(newX)
newY <- rep(NA,m)
## probability of outcome being zero
newprob <- plogis(1 + newX[,1] + newX[,2] + newX[,1]*newX[,2])
newg <- rbinom(m,1,newprob)
## assign zero outcome
newind <- newg==0
newY[newind] <- 0
## assign non-zero outcome
newind <- g==1
newY[newind] <- 10 + newX[newind, 1] + sqrt(abs(newX[newind, 2] * newX[newind, 3])) + newX[newind, 2] - X[newind, 3] + rnorm(sum(newind))

## generate the Library
twostage.library <- list(stage1=c("SL.glm","SL.mean","SL.earth"),
                        stage2=c("SL.glm","SL.mean","SL.earth"))
onestage.library <- c("SL.glm","SL.mean","SL.earth")

## run the twostage super learner
two <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = TRUE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
two
## run the standard super learner
one <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = FALSE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
one

## library with screening
twostage.library <- list(stage1=list(c("SL.glm","screen.glmnet"),
                                    c("SL.earth","screen.corP"),
                                    c("SL.mean","All")),
                        stage2=list(c("SL.glm","screen.glmnet"),
                                    c("SL.earth","screen.corP"),
                                    c("SL.mean","All")))
onestage.library <- list(c("SL.glm","screen.glmnet"),
                        c("SL.earth","screen.corP"),
                        c("SL.mean","All"))

## run the twostage super learner
two <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = TRUE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
two
## run the standard super learner
one <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = FALSE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
one

wuziyueemory/twostageSL documentation built on Oct. 19, 2020, 3:45 p.m.