View source: R/crossvalidation.R

cvrisk | R Documentation |

Cross-validated estimation of the empirical risk for hyper-parameter selection.

## S3 method for class 'mboost' cvrisk(object, folds = cv(model.weights(object)), grid = 0:mstop(object), papply = mclapply, fun = NULL, mc.preschedule = FALSE, ...) cv(weights, type = c("bootstrap", "kfold", "subsampling"), B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL) ## Plot cross-valiation results ## S3 method for class 'cvrisk' plot(x, xlab = "Number of boosting iterations", ylab = attr(x, "risk"), ylim = range(x), main = attr(x, "type"), ...)

`object` |
an object of class |

`folds` |
a weight matrix with number of rows equal to the number
of observations. The number of columns corresponds to
the number of cross-validation runs. Can be computed
using function |

`grid` |
a vector of stopping parameters the empirical risk is to be evaluated for. |

`papply` |
(parallel) apply function, defaults to |

`fun` |
if |

`mc.preschedule` |
preschedule tasks if are parallelized using |

`weights` |
a numeric vector of weights for the model to be cross-validated. |

`type` |
character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented. |

`B` |
number of folds, per default 25 for |

`prob` |
percentage of observations to be included in the learning samples for subsampling. |

`strata` |
a factor of the same length as |

`x` |
an object of class |

`xlab, ylab` |
axis labels. |

`ylim` |
limits of y-axis. |

`main` |
main title of graphic. |

`...` |
additional arguments passed to |

The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters `mstop`

are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied.

Different forms of cross-validation can be applied, for example
10-fold cross-validation or bootstrapping. The weights (zero weights
correspond to test cases) are defined via the `folds`

matrix.

`cvrisk`

runs in parallel on OSes where forking is possible
(i.e., not on Windows) and multiple cores/processors are available.
The scheduling
can be changed by the corresponding arguments of
`mclapply`

(via the dot arguments).

The function `cv`

can be used to build an appropriate
weight matrix to be used with `cvrisk`

. If `strata`

is defined
sampling is performed in each stratum separately thus preserving
the distribution of the `strata`

variable in each fold.

There exist various functions to display and work with
cross-validation results. One can `print`

and `plot`

(see above)
results and extract the optimal iteration via `mstop`

.

An object of class `cvrisk`

(when `fun`

wasn't specified), basically a matrix
containing estimates of the empirical risk for a varying number
of bootstrap iterations. `plot`

and `print`

methods
are available as well as a `mstop`

method.

Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006),
The design and analysis of benchmark experiments.
*Journal of Computational and Graphical Statistics*, **14**(3),
675–699.

Andreas Mayr, Benjamin Hofner, and Matthias Schmid (2012). The
importance of knowing when to stop - a sequential stopping rule for
component-wise gradient boosting. *Methods of Information in
Medicine*, **51**, 178–186.

DOI: doi: 10.3414/ME11-02-0030

`AIC.mboost`

for
`AIC`

based selection of the stopping iteration. Use `mstop`

to extract the optimal stopping iteration from `cvrisk`

object.

data("bodyfat", package = "TH.data") ### fit linear model to data model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE) ### AIC-based selection of number of boosting iterations maic <- AIC(model) maic ### inspect coefficient path and AIC-based stopping criterion par(mai = par("mai") * c(1, 1, 1, 1.8)) plot(model) abline(v = mstop(maic), col = "lightgray") ### 10-fold cross-validation cv10f <- cv(model.weights(model), type = "kfold") cvm <- cvrisk(model, folds = cv10f, papply = lapply) print(cvm) mstop(cvm) plot(cvm) ### 25 bootstrap iterations (manually) set.seed(290875) n <- nrow(bodyfat) bs25 <- rmultinom(25, n, rep(1, n)/n) cvm <- cvrisk(model, folds = bs25, papply = lapply) print(cvm) mstop(cvm) plot(cvm) ### same by default set.seed(290875) cvrisk(model, papply = lapply) ### 25 bootstrap iterations (using cv) set.seed(290875) bs25_2 <- cv(model.weights(model), type="bootstrap") all(bs25 == bs25_2) ## Not run: ############################################################ ## Do not run this example automatically as it takes ## some time (~ 5 seconds depending on the system) ### trees blackbox <- blackboost(DEXfat ~ ., data = bodyfat) cvtree <- cvrisk(blackbox, papply = lapply) plot(cvtree) ## End(Not run this automatically) ## End(Not run) ### cvrisk in parallel modes: ## Not run: ## at least not automatically ## parallel::mclapply() which is used here for parallelization only runs ## on unix systems (here we use 2 cores) cvrisk(model, mc.cores = 2) ## infrastructure needs to be set up in advance cl <- makeCluster(25) # e.g. to run cvrisk on 25 nodes via PVM myApply <- function(X, FUN, ...) { myFun <- function(...) { library("mboost") # load mboost on nodes FUN(...) } ## further set up steps as required parLapply(cl = cl, X, myFun, ...) } cvrisk(model, papply = myApply) stopCluster(cl) ## End(Not run)

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.