View source: R/SampleSplitSuperLearner.R

SampleSplitSuperLearner | R Documentation |

A Prediction Function for the Super Learner. The `SuperLearner`

function takes a training set pair (X,Y) and returns the predicted values based on a validation set. SampleSplitSuperLearner uses sample split validation whereas SuperLearner uses V-fold cross-validation.

```
SampleSplitSuperLearner(Y, X, newX = NULL, family = gaussian(), SL.library,
method = "method.NNLS", id = NULL, verbose = FALSE,
control = list(), split = 0.8, obsWeights = NULL)
```

`Y` |
The outcome in the training data set. Must be a numeric vector. |

`X` |
The predictor variables in the training data set, usually a data.frame. |

`newX` |
The predictor variables in the validation data set. The structure should match X. If missing, uses X for newX. |

`SL.library` |
Either a character vector of prediction algorithms or a list containing character vectors. See details below for examples on the structure. A list of functions included in the SuperLearner package can be found with |

`verbose` |
logical; TRUE for printing progress during the computation (helpful for debugging). |

`family` |
Currently allows |

`method` |
A list (or a function to create a list) containing details on estimating the coefficients for the super learner and the model to combine the individual algorithms in the library. See |

`id` |
Optional cluster identification variable. For the cross-validation splits, |

`obsWeights` |
Optional observation weights variable. As with |

`control` |
A list of parameters to control the estimation process. Parameters include |

`split` |
Either a single value between 0 and 1 indicating the fraction of the samples for the training split. A value of 0.8 will randomly assign 80 percent of the samples to the training split and the other 20 percent to the validation split. Alternatively, split can be a numeric vector with the row numbers of |

`SuperLearner`

fits the super learner prediction algorithm. The weights for each algorithm in `SL.library`

is estimated, along with the fit of each algorithm.

The prescreen algorithms. These algorithms first rank the variables in `X`

based on either a univariate regression p-value of the `randomForest`

variable importance. A subset of the variables in `X`

is selected based on a pre-defined cut-off. With this subset of the X variables, the algorithms in `SL.library`

are then fit.

The SuperLearner package contains a few prediction and screening algorithm wrappers. The full list of wrappers can be viewed with `listWrappers()`

. The design of the SuperLearner package is such that the user can easily add their own wrappers. We also maintain a website with additional examples of wrapper functions at https://github.com/ecpolley/SuperLearnerExtra.

`call` |
The matched call. |

`libraryNames` |
A character vector with the names of the algorithms in the library. The format is 'predictionAlgorithm_screeningAlgorithm' with '_All' used to denote the prediction algorithm run on all variables in X. |

`SL.library` |
Returns |

`SL.predict` |
The predicted values from the super learner for the rows in |

`coef` |
Coefficients for the super learner. |

`library.predict` |
A matrix with the predicted values from each algorithm in |

`Z` |
The Z matrix (the cross-validated predicted values for each algorithm in |

`cvRisk` |
A numeric vector with the V-fold cross-validated risk estimate for each algorithm in |

`family` |
Returns the |

`fitLibrary` |
A list with the fitted objects for each algorithm in |

`varNames` |
A character vector with the names of the variables in |

`validRows` |
A list containing the row numbers for the V-fold cross-validation step. |

`method` |
A list with the method functions. |

`whichScreen` |
A logical matrix indicating which variables passed each screening algorithm. |

`control` |
The |

`split` |
The |

`errorsInCVLibrary` |
A logical vector indicating if any algorithms experienced an error within the CV step. |

`errorsInLibrary` |
A logical vector indicating if any algorithms experienced an error on the full data. |

Eric C Polley epolley@uchicago.edu

van der Laan, M. J., Polley, E. C. and Hubbard, A. E. (2008) Super Learner, *Statistical Applications of Genetics and Molecular Biology*, **6**, article 25.

```
## Not run:
## simulate data
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
## test set
m <- 1000
newX <- matrix(rnorm(m*p), nrow = m, ncol = p)
colnames(newX) <- paste("X", 1:p, sep="")
newX <- data.frame(newX)
newY <- newX[, 1] + sqrt(abs(newX[, 2] * newX[, 3])) + newX[, 2] -
newX[, 3] + rnorm(m)
# generate Library and run Super Learner
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam",
"SL.polymars", "SL.mean")
test <- SampleSplitSuperLearner(Y = Y, X = X, newX = newX, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
test
# library with screening
SL.library <- list(c("SL.glmnet", "All"), c("SL.glm", "screen.randomForest",
"All", "screen.SIS"), "SL.randomForest", c("SL.polymars", "All"), "SL.mean")
test <- SuperLearner(Y = Y, X = X, newX = newX, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS")
test
# binary outcome
set.seed(1)
N <- 200
X <- matrix(rnorm(N*10), N, 10)
X <- as.data.frame(X)
Y <- rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] +
.1*X[, 3]*X[, 4] - .2*abs(X[, 4])))
SL.library <- c("SL.glmnet", "SL.glm", "SL.knn", "SL.mean")
# least squares loss function
test.NNLS <- SampleSplitSuperLearner(Y = Y, X = X, SL.library = SL.library,
verbose = TRUE, method = "method.NNLS", family = binomial())
test.NNLS
## End(Not run)
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.