twostageSL: Two-stage Super Learner Prediction
In wuziyueemory/twostageSL: Two-Stage Super Learner Prediction Function

Description Usage Arguments Details Value Author(s) References See Also Examples

A Prediction Function for the Two-stage Super Learner. The twostageSL function takes a training set pair (X,Y) and returns the predicted values based on a validation set.

twostageSL(
  Y,
  X,
  newX = NULL,
  library.2stage,
  library.1stage,
  twostage,
  family.1 = binomial,
  family.2 = gaussian,
  family.single = gaussian,
  method = "method.CC_LS.scale",
  id = NULL,
  verbose = FALSE,
  control = list(),
  cvControl = list(),
  obsWeights = NULL,
  env = parent.frame()
)

`Y`	The outcome in the training data set. Must be a numeric vector.
`X`	The predictor variables in the training data set, usually a data.frame.
`newX`	The predictor variables in the validation data set. The structure should match X. If missing, uses X for newX.
`library.2stage`	Candidate prediction algorithms in two-stage super learner. A list containing prediction algorithms at stage 1 and stage 2, the prediction algorithms are either a character vector or a list containing character vectors. See details below for examples on the structure. A list of functions included in the `twostageSL` package can be found with `twostage_listWrappers`.
`library.1stage`	Candidate prediction algorithms in standard super learner. Either a character vector of prediction algorithms or a list containing character vectors. See details below for examples on the structure. A list of functions included in the `twostageSL` package can be found with `twostage_listWrappers`.
`twostage`	logical; TRUE for implementing two-stage super learner; FALSE for implementing standatd super learner
`family.1`	Error distribution of the stage 1 outcome for two-stage super learner. Currently only allows `binomial` (default) to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.
`family.2`	Error distribution of the stage 2 outcome for two-stage super learner. Currently only allows `gaussian` (default) to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.
`family.single`	Error distribution of the outcome for standard super learner. Currently only allows `gaussian` (default) to describe the error distribution. Link function information will be ignored and should be contained in the method argument below.
`method`	Details on estimating the coefficients for the two-stage super learner and the model to combine the individual algorithms in the library. Currently, the built in option is only "method.CC_LS.scale" (default) which is a scaled version of CC_LS. CC_LS.scale uses Goldfarb and Idnani's quadratic programming algorithm to calculate the best convex combination of weights to minimize the squared error loss. In addition, CC_LS.scale divides the quadratic function by a large constant to shrink the huge matrix and vector in quadratic function.
`id`	Optional cluster identification variable. For the cross-validation splits, `id` forces observations in the same cluster to be in the same validation fold. `id` is passed to the prediction and screening algorithms in library.2stage and library.1stage, but be sure to check the individual wrappers as many of them ignore the information.
`verbose`	logical; TRUE for printing progress during the computation (helpful for debugging).
`control`	A list of parameters to control the estimation process. Parameters include `saveFitLibrary` and `trimLogit`. See `twostageSL.control` for details.
`cvControl`	A list of parameters to control the cross-validation process. Parameters include `V`, `stratifyCV`, `shuffle` and `validRows`. See `twostageSL.CV.control` for details.
`obsWeights`	Optional observation weights variable. As with `id` above, `obsWeights` is passed to the prediction and screening algorithms, but many of the built in wrappers ignore (or can't use) the information. If you are using observation weights, make sure the library you specify uses the information.
`env`	Environment containing the learner functions. Defaults to the calling environment.

twostageSL fits the two-stage super learner prediction algorithm. The weights for each algorithm in library.2stage and library.1stage is estimated, along with the fit of each algorithm.

The prescreen algorithms. These algorithms first rank the variables in X based on either a univariate regression p-value or the randomForest variable importance. A subset of the variables in X is selected based on a pre-defined cut-off. With this subset of the X variables, the algorithms in library.2stage and library.1stage are then fit.

The twostageSL package contains a few prediction and screening algorithm wrappers. The full list of wrappers can be viewed with listWrappers(). The design of the twostageSL package is such that the user can easily add their own wrappers.

An object with S3 class twostageSL containing:

`call`	The matched call.
`libraryNames`	A character vector with the names of the algorithms in the library. The format is 'predictionAlgorithm_screeningAlgorithm' with '_All' used to denote the prediction algorithm run on all variables in X.
`library.Num`	Number of prediction algorithms in `library.2stage` and `library.1stage`.
`orig.library`	Returns the prediction algorithms and screening algorithms in each stage of `library.2stage` and `library.1stage` seperately.
`SL.library`	Returns the prediction algorithms and screening algorithms in `library.2stage` and `library.1stage`.
`SL.predict`	The predicted values from the two-stage super learner for the rows in `newX`.
`coef`	Coefficients for the two-stage super learner.
`library.predict`	A matrix with the predicted values from each algorithm in `library.2stage` and `library.1stage` for the rows in `newX`.
`Z`	The Z matrix (the cross-validated predicted values for each algorithm in `library.2stage` and `library.1stage`).
`cvRisk`	A numeric vector with the V-fold cross-validated risk estimate for each algorithm in `library.2stage` and `library.1stage`. Note that this does not contain the CV risk estimate for the two-stage super learner, only the individual algorithms in the library.
`family`	Returns the `family.1`, `family.2` and `family.single` value from above
`fitLibrary`	A list with the fitted objects for each algorithm in `library.2stage` and `library.1stage` on the full training data set.
`cvfitLibrary`	A list with fitted objects for each algorithm in `library.2stage` and `library.1stage` on each of `v` different training data sets.
`varNames`	A character vector with the names of the variables in `X`.
`validRows`	A list containing the row numbers for the V-fold cross-validation step.
`number0`	A dataframe indicating the number of zeros in each of the `v` fold.
`method`	A list with the method functions.
`whichScreen`	A logical matrix indicating which variables passed each screening algorithm.
`control`	The `control` list.
`cvControl`	The `cvControl` list.
`errorsInCVLibrary`	A logical vector indicating if any algorithms experienced an error within the CV step.
`errorsInLibrary`	A logical vector indicating if any algorithms experienced an error on the full data.
`data`	The data frame including the predict variables and outcome in the training data set.
`env`	Environment passed into function which will be searched to find the learner functions. Defaults to the calling environment.
`times`	A list that contains the execution time of the twostageSL, plus separate times for model fitting and prediction.

Ziyue Wu

van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2008) Super Learner, Statistical Applications of Genetics and Molecular Biology, 6, article 25.

SuperLearner.

## simulate data
set.seed(123)

## training set
n <- 10000
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- rep(NA,n)
## probability of outcome being zero
prob <- plogis(1 + X[,1] + X[,2] + X[,1]*X[,2])
g <- rbinom(n,1,prob)
## assign zero outcome
ind <- g==0
Y[ind] <- 0
## assign non-zero outcome
ind <- g==1
Y[ind] <- 10 + X[ind, 1] + sqrt(abs(X[ind, 2] * X[ind, 3])) + X[ind, 2] - X[ind, 3] + rnorm(sum(ind))

## test set
m <- 1000
newX <- matrix(rnorm(m*p), nrow = m, ncol = p)
colnames(newX) <- paste("X", 1:p, sep="")
newX <- data.frame(newX)
newY <- rep(NA,m)
## probability of outcome being zero
newprob <- plogis(1 + newX[,1] + newX[,2] + newX[,1]*newX[,2])
newg <- rbinom(m,1,newprob)
## assign zero outcome
newind <- newg==0
newY[newind] <- 0
## assign non-zero outcome
newind <- g==1
newY[newind] <- 10 + newX[newind, 1] + sqrt(abs(newX[newind, 2] * newX[newind, 3])) + newX[newind, 2] - X[newind, 3] + rnorm(sum(newind))

## generate the Library
twostage.library <- list(stage1=c("SL.glm","SL.mean","SL.earth"),
                        stage2=c("SL.glm","SL.mean","SL.earth"))
onestage.library <- c("SL.glm","SL.mean","SL.earth")

## run the twostage super learner
two <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = TRUE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
two
## run the standard super learner
one <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = FALSE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
one

## library with screening
twostage.library <- list(stage1=list(c("SL.glm","screen.glmnet"),
                                    c("SL.earth","screen.corP"),
                                    c("SL.mean","All")),
                        stage2=list(c("SL.glm","screen.glmnet"),
                                    c("SL.earth","screen.corP"),
                                    c("SL.mean","All")))
onestage.library <- list(c("SL.glm","screen.glmnet"),
                        c("SL.earth","screen.corP"),
                        c("SL.mean","All"))

## run the twostage super learner
two <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = TRUE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
two
## run the standard super learner
one <- twostageSL(Y=Y,
                 X=X,
                 newX = newX,
                 library.2stage <- twostage.library,
                 library.1stage <- onestage.library,
                 twostage = FALSE,
                 family.1=binomial,
                 family.2=gaussian,
                 family.single=gaussian,
                 cvControl = list(V = 5))
one