cvrisk: Cross-Validation
In mboost: Model-Based Boosting

Description Usage Arguments Details Value References See Also Examples

Cross-validated estimation of the empirical risk for hyper-parameter selection.

## S3 method for class 'mboost'
cvrisk(object, folds = cv(model.weights(object)),
       grid = 1:mstop(object),
       papply = mclapply,
       fun = NULL, corrected = TRUE, mc.preschedule = FALSE, ...)
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
   B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)

`object`	an object of class `mboost`.
`folds`	a weight matrix with number of rows equal to the number of observations. The number of columns corresponds to the number of cross-validation runs. Can be computed using function `cv` and defaults to 25 bootstrap samples.
`grid`	a vector of stopping parameters the empirical risk is to be evaluated for.
`papply`	(parallel) apply function, defaults to `mclapply`. Alternatively, `parLapply` can be used. In the latter case, usually more setup is needed (see example for some details). To run `cvrisk` sequentially (i.e. not in parallel), one can use `lapply`.
`fun`	if `fun` is NULL, the out-of-sample risk is returned. `fun`, as a function of `object`, may extract any other characteristic of the cross-validated models. These are returned as is.
`corrected`	if `TRUE`, the corrected cross-validation scheme of Verweij and van Houwelingen (1993) is used in case of Cox models. Otherwise, the naive standard cross-validation scheme is used.
`mc.preschedule`	preschedule tasks if are parallelized using `mclapply` (default: `FALSE`)? For details see `mclapply`.
`weights`	a numeric vector of weights for the model to be cross-validated.
`type`	character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented.
`B`	number of folds, per default 25 for `bootstrap` and `subsampling` and 10 for `kfold`.
`prob`	percentage of observations to be included in the learning samples for subsampling.
`strata`	a factor of the same length as `weights` for stratification.
`...`	additional arguments passed to `mclapply`.

The number of boosting iterations is a hyper-parameter of the boosting algorithms implemented in this package. Honest, i.e., cross-validated, estimates of the empirical risk for different stopping parameters mstop are computed by this function which can be utilized to choose an appropriate number of boosting iterations to be applied.

Different forms of cross-validation can be applied, for example 10-fold cross-validation or bootstrapping. The weights (zero weights correspond to test cases) are defined via the folds matrix.

cvrisk runs in parallel on OSes where forking is possible (i.e., not on Windows) and multiple cores/processors are available. The scheduling can be changed by the corresponding arguments of mclapply (via the dot arguments).

The function cv can be used to build an appropriate weight matrix to be used with cvrisk. If strata is defined sampling is performed in each stratum separately thus preserving the distribution of the strata variable in each fold.

An object of class cvrisk (when fun wasn't specified), basically a matrix containing estimates of the empirical risk for a varying number of bootstrap iterations. plot and print methods are available as well as a mstop method.

Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006), The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675–699.

Andreas Mayr, Benjamin Hofner, and Matthias Schmid (2012). The importance of knowing when to stop - a sequential stopping rule for component-wise gradient boosting. Methods of Information in Medicine, 51, 178–186.
DOI: http://dx.doi.org/10.3414/ME11-02-0030

Verweij and van Houwelingen (1993). Cross-validation in survival analysis. Statistics in Medicine, 12:2305–2314.

AIC.mboost for AIC based selection of the stopping iteration. Use mstop to extract the optimal stopping iteration from cvrisk object.

  data("bodyfat", package = "TH.data")

  ### fit linear model to data
  model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)

  ### AIC-based selection of number of boosting iterations
  maic <- AIC(model)
  maic

  ### inspect coefficient path and AIC-based stopping criterion
  par(mai = par("mai") * c(1, 1, 1, 1.8))
  plot(model)
  abline(v = mstop(maic), col = "lightgray")

  ### 10-fold cross-validation
  cv10f <- cv(model.weights(model), type = "kfold")
  cvm <- cvrisk(model, folds = cv10f, papply = lapply)
  print(cvm)
  mstop(cvm)
  plot(cvm)

  ### 25 bootstrap iterations (manually)
  set.seed(290875)
  n <- nrow(bodyfat)
  bs25 <- rmultinom(25, n, rep(1, n)/n)
  cvm <- cvrisk(model, folds = bs25, papply = lapply)
  print(cvm)
  mstop(cvm)
  plot(cvm)

  ### same by default
  set.seed(290875)
  cvrisk(model, papply = lapply)

  ### 25 bootstrap iterations (using cv)
  set.seed(290875)
  bs25_2 <- cv(model.weights(model), type="bootstrap")
  all(bs25 == bs25_2)

  ### trees
  blackbox <- blackboost(DEXfat ~ ., data = bodyfat)
  cvtree <- cvrisk(blackbox, papply = lapply)
  plot(cvtree)


  ### cvrisk in parallel modes:

  ## Not run: ## parallel:::mclapply only runs properly on unix systems
    cvrisk(model)
  
## End(Not run)

  ## Not run: ## infrastructure needs to be set up in advance
    cl <- makeCluster(25) # e.g. to run cvrisk on 25 nodes via PVM
    myApply <- function(X, FUN, ...) {
      myFun <- function(...) {
          library("mboost") # load mboost on nodes
          FUN(...)
      }
      ## further set up steps as required
      parLapply(cl = cl, X, myFun, ...)
    }
    cvrisk(model, papply = myApply)
    stopCluster(cl)
  
## End(Not run)