View source: R/crossvalidation.R

cvrisk | R Documentation |

Cross-validated estimation of the empirical risk for hyper-parameter selection.

```
## S3 method for class 'mboost'
cvrisk(object, folds = cv(model.weights(object)),
grid = 0:mstop(object),
papply = mclapply,
fun = NULL, mc.preschedule = FALSE, ...)
cv(weights, type = c("bootstrap", "kfold", "subsampling"),
B = ifelse(type == "kfold", 10, 25), prob = 0.5, strata = NULL)
## Plot cross-valiation results
## S3 method for class 'cvrisk'
plot(x,
xlab = "Number of boosting iterations", ylab = attr(x, "risk"),
ylim = range(x), main = attr(x, "type"), ...)
```

`object` |
an object of class |

`folds` |
a weight matrix with number of rows equal to the number
of observations. The number of columns corresponds to
the number of cross-validation runs. Can be computed
using function |

`grid` |
a vector of stopping parameters the empirical risk is to be evaluated for. |

`papply` |
(parallel) apply function, defaults to |

`fun` |
if |

`mc.preschedule` |
preschedule tasks if are parallelized using |

`weights` |
a numeric vector of weights for the model to be cross-validated. |

`type` |
character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation and subsampling are implemented. |

`B` |
number of folds, per default 25 for |

`prob` |
percentage of observations to be included in the learning samples for subsampling. |

`strata` |
a factor of the same length as |

`x` |
an object of class |

`xlab` , `ylab` |
axis labels. |

`ylim` |
limits of y-axis. |

`main` |
main title of graphic. |

`...` |
additional arguments passed to |

The number of boosting iterations is a hyper-parameter of the
boosting algorithms implemented in this package. Honest,
i.e., cross-validated, estimates of the empirical risk
for different stopping parameters `mstop`

are computed by
this function which can be utilized to choose an appropriate
number of boosting iterations to be applied.

Different forms of cross-validation can be applied, for example
10-fold cross-validation or bootstrapping. The weights (zero weights
correspond to test cases) are defined via the `folds`

matrix.

`cvrisk`

runs in parallel on OSes where forking is possible
(i.e., not on Windows) and multiple cores/processors are available.
The scheduling
can be changed by the corresponding arguments of
`mclapply`

(via the dot arguments).

The function `cv`

can be used to build an appropriate
weight matrix to be used with `cvrisk`

. If `strata`

is defined
sampling is performed in each stratum separately thus preserving
the distribution of the `strata`

variable in each fold.

There exist various functions to display and work with
cross-validation results. One can `print`

and `plot`

(see above)
results and extract the optimal iteration via `mstop`

.

An object of class `cvrisk`

(when `fun`

wasn't specified), basically a matrix
containing estimates of the empirical risk for a varying number
of bootstrap iterations. `plot`

and `print`

methods
are available as well as a `mstop`

method.

Torsten Hothorn, Friedrich Leisch, Achim Zeileis and Kurt Hornik (2006),
The design and analysis of benchmark experiments.
*Journal of Computational and Graphical Statistics*, **14**(3),
675–699.

Andreas Mayr, Benjamin Hofner, and Matthias Schmid (2012). The
importance of knowing when to stop - a sequential stopping rule for
component-wise gradient boosting. *Methods of Information in
Medicine*, **51**, 178–186.

DOI: \Sexpr[results=rd]{tools:::Rd_expr_doi("10.3414/ME11-02-0030")}

`AIC.mboost`

for
`AIC`

based selection of the stopping iteration. Use `mstop`

to extract the optimal stopping iteration from `cvrisk`

object.

```
data("bodyfat", package = "TH.data")
### fit linear model to data
model <- glmboost(DEXfat ~ ., data = bodyfat, center = TRUE)
### AIC-based selection of number of boosting iterations
maic <- AIC(model)
maic
### inspect coefficient path and AIC-based stopping criterion
par(mai = par("mai") * c(1, 1, 1, 1.8))
plot(model)
abline(v = mstop(maic), col = "lightgray")
### 10-fold cross-validation
cv10f <- cv(model.weights(model), type = "kfold")
cvm <- cvrisk(model, folds = cv10f, papply = lapply)
print(cvm)
mstop(cvm)
plot(cvm)
### 25 bootstrap iterations (manually)
set.seed(290875)
n <- nrow(bodyfat)
bs25 <- rmultinom(25, n, rep(1, n)/n)
cvm <- cvrisk(model, folds = bs25, papply = lapply)
print(cvm)
mstop(cvm)
plot(cvm)
### same by default
set.seed(290875)
cvrisk(model, papply = lapply)
### 25 bootstrap iterations (using cv)
set.seed(290875)
bs25_2 <- cv(model.weights(model), type="bootstrap")
all(bs25 == bs25_2)
## Not run:
############################################################
## Do not run this example automatically as it takes
## some time (~ 5 seconds depending on the system)
### trees
blackbox <- blackboost(DEXfat ~ ., data = bodyfat)
cvtree <- cvrisk(blackbox, papply = lapply)
plot(cvtree)
## End(Not run this automatically)
## End(Not run)
### cvrisk in parallel modes:
## Not run:
## at least not automatically
## parallel::mclapply() which is used here for parallelization only runs
## on unix systems (here we use 2 cores)
cvrisk(model, mc.cores = 2)
## infrastructure needs to be set up in advance
cl <- makeCluster(25) # e.g. to run cvrisk on 25 nodes via PVM
myApply <- function(X, FUN, ...) {
myFun <- function(...) {
library("mboost") # load mboost on nodes
FUN(...)
}
## further set up steps as required
parLapply(cl = cl, X, myFun, ...)
}
cvrisk(model, papply = myApply)
stopCluster(cl)
## End(Not run)
```

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.