validateFDboost: Cross-Validation and Bootstrapping over Curves
In FDboost: Boosting Functional Regression Models

Description Usage Arguments Details Value Note See Also Examples

Function validateFDboost() is experimental. Cross-Validation and bootstrapping over curves to compute the empirical risk for hyper-parameter selection and to compute resampled coefficients and predictions for the models.

applyFolds(object, folds = cv(rep(1, length(unique(object$id))), type =
  "bootstrap"), grid = 1:mstop(object), fun = NULL, riskFun = NULL,
  numInt = object$numInt, papply = mclapply, mc.preschedule = FALSE,
  showProgress = TRUE, ...)

validateFDboost(object, response = NULL, folds = cv(rep(1,
  length(unique(object$id))), type = "bootstrap"), grid = 1:mstop(object),
  fun = NULL, getCoefCV = TRUE, riskopt = c("mean", "median"),
  mrdDelete = 0, refitSmoothOffset = TRUE, showProgress = TRUE, ...)

## S3 method for class 'FDboost'
cvrisk(object, folds = cvLong(id = object$id, weights =
  model.weights(object)), grid = 1:mstop(object), papply = mclapply,
  fun = NULL, corrected = TRUE, mc.preschedule = FALSE, ...)

cvLong(id, weights = rep(1, l = length(id)), type = c("bootstrap", "kfold",
  "subsampling", "curves"), B = ifelse(type == "kfold", 10, 25), prob = 0.5,
  strata = NULL)

cvMa(ydim, weights = rep(1, l = ydim[1] * ydim[2]), type = c("bootstrap",
  "kfold", "subsampling", "curves"), B = ifelse(type == "kfold", 10, 25),
  prob = 0.5, strata = NULL, ...)

`object`	fitted FDboost-object
`folds`	a weight matrix with number of rows equal to the number of observed trajectories.
`grid`	the grid over which the optimal number of boosting iterations (mstop) is searched.
`fun`	if `fun` is `NULL`, the out-of-bag risk is returned. `fun`, as a function of `object`, may extract any other characteristic of the cross-validated models. These are returned as is.
`riskFun`	only exists in `applyFolds`; allows to compute other risk functions than the risk of the family that was specified in object. Must be specified as function of arguments `(y, f, w = 1)`, where `y` is the observed response, `f` is the prediciton from the model and `w` is the weight. The risk function must return a scalar numeric value for vector valued imput.
`numInt`	only exists in `applyFolds`; the scheme for numerical integration, see `numInt` in `FDboost`.
`papply`	(parallel) apply function, defaults to `mclapply`, see `cvrisk` for details.
`mc.preschedule`	Defaults to `FALSE`. Preschedule tasks if are parallelized using `mclapply`? For details see `mclapply`.
`showProgress`	logical, defaults to `TRUE`.
`...`	further arguments passed to `mclapply`
`response`	optional, specify a response vector for the computation of the prediction errors. Defaults to `NULL` which means that the response of the fitted model is used.
`getCoefCV`	logical, defaults to `TRUE`. Should the coefficients and predictions be computed for all the models on the sampled data?
`riskopt`	how is the optimal stopping iteration determined. Defaults to the mean, but median is possible as well.
`mrdDelete`	Delete values that are `mrdDelete` percent smaller than the mean of the response. Defaults to 0 which means that only response values being 0 are not used in the calculation of the MRD (= mean relative deviation).
`refitSmoothOffset`	logical, should the offset be refitted in each learning sample? Defaults to `TRUE`. In `cvrisk` the offset of the original model fit in `object` is used in all folds.
`corrected`	see `cvrisk`.
`id`	the id-vector as integers 1, 2, ... specifying which observations belong to the same curve, deprecated in `cvMa()`.
`weights`	a numeric vector of (integration) weights, defaults to 1.
`type`	character argument for specifying the cross-validation method. Currently (stratified) bootstrap, k-fold cross-validation, subsampling and leaving-one-curve-out cross validation (i.e. jack knife on curves) are implemented.
`B`	number of folds, per default 25 for `bootstrap` and `subsampling` and 10 for `kfold`.
`prob`	percentage of observations to be included in the learning samples for subsampling.
`strata`	a factor of the same length as `weights` for stratification.
`ydim`	dimensions of response-matrix

The number of boosting iterations is an important hyper-parameter of boosting and can be chosen using the functions applyFolds, cvrisk.FDboost and validateFDboost as they compute honest, i.e., out-of-bag, estimates of the empirical risk for different numbers of boosting iterations. The weights (zero weights correspond to test cases) are defined via the folds matrix, see cvrisk in package mboost.

applyFolds is still EXPERIMENTAL! The function applyFolds is especially suited to models with functional response. It recomputes the model in each fold using FDboost. Thus, all parameters are recomputed, including the smooth offset (if present) and the identifiability constraints (if present, only relevant for bolsc, brandomc and bbsc). Note, that the function applyFolds expects folds that give weights per trajectory without considering integration weights.

The function validateFDboost is especially suited to models with functional response. Using the option refitSmoothOffset the offset is refitted on each fold. Note, that the function validateFDboost expects folds that give weights per trajectory without considering integration weights. The integration weights of object are used to compute the empirical risk as integral. The argument response can be useful in simulation studies where the true value of the response is known but for the model fit the response is used with noise.

The function cvrisk.FDboost is a wrapper for cvrisk in package mboost. It overrides the default for the folds, so that the folds are sampled on the level of curves (not on the level of single observations, which does not make sense for functional response). Note that the smooth offset and the computation of the identifiability constraints are not part of the refitting if cvrisk is used. Per default the integration weights of the model fit are used to compute the prediction errors (as the integration weights are part of the default folds). Note that in cvrisk the weights are rescaled to sum up to one.

The functions cvMa and cvLong can be used to build an appropriate weight matrix for functional response to be used with cvrisk as sampling is done on the level of curves. The probability for each trajectory to enter a fold is equal over all trajectories. The function cvMa takes the dimensions of the response matrix as input argument and thus can only be used for regularly observed response. The function cvLong takes the id variable and the weights as arguments and thus can be used for responses in long format that are potentially observed irregularly.

If strata is defined sampling is performed in each stratum separately thus preserving the distribution of the strata variable in each fold.

cvMa and cvLong return a matrix of weights to be used in cvrisk.

The functions applyFolds and cvrisk.FDboost return a cvrisk-object, which is a matrix of the computed out-of-bag risk. The matrix has the folds in rows and the number of boosting iteratins in columns. Furhtermore, the matrix has attributes including:

`risk`	name of the applied risk function
`call`	model call of the model object
`mstop`	gird of stopping iterations that is used
`type`	name for the type of folds

The function validateFDboost returns a validateFDboost-object, which is a named list containing:

`response`	the response
`yind`	the observation points of the response
`id`	the id variable of the response
`folds`	folds that were used
`grid`	grid of possible numbers of boosting iterations
`coefCV`	if `getCoefCV` is `TRUE` the estimated coefficient functions in the folds
`predCV`	if `getCoefCV` is `TRUE` the out-of-bag predicted values of the response
`oobpreds`	if the type of folds is curves the out-of-bag predictions for each trajectory
`oobrisk`	the out-of-bag risk
`oobriskMean`	the out-of-bag risk at the minimal mean risk
`oobmse`	the out-of-bag mean squared error (MSE)
`oobrelMSE`	the out-of-bag relative mean squared error (relMSE)
`oobmrd`	the out-of-bag mean relative deviation (MRD)
`oobrisk0`	the out-of-bag risk without consideration of integration weights
`oobmse0`	the out-of-bag mean squared error (MSE) without consideration of integration weights
`oobmrd0`	the out-of-bag mean relative deviation (MRD) without consideration of integration weights
`format`	one of "FDboostLong" or "FDboost" depending on the class of the object
`fun_ret`	list of what fun returns if fun was specified

Use argument mc.cores = 1L to set the numbers of cores that is used in parallel computation. On Windows only 1 core is possible, mc.cores = 1, which is the default.

cvrisk to perform cross-validation with scalar response.

Ytest <- matrix(rnorm(15), ncol = 3) # 5 trajectories, each with 3 observations
Ylong <- as.vector(Ytest)
## 4-folds for bootstrap for the response in long format without integration weights
cvMa(ydim = c(5,3), type = "bootstrap", B = 4)
cvLong(id = rep(1:5, times = 3), type = "bootstrap", B = 4)

if(require(fda)){
 ## load the data
 data("CanadianWeather", package = "fda")

 ## use data on a daily basis
 canada <- with(CanadianWeather,
                list(temp = t(dailyAv[ , , "Temperature.C"]),
                     l10precip = t(dailyAv[ , , "log10precip"]),
                     l10precip_mean = log(colMeans(dailyAv[ , , "Precipitation.mm"]), base = 10),
                     lat = coordinates[ , "N.latitude"],
                     lon = coordinates[ , "W.longitude"],
                     region = factor(region),
                     place = factor(place),
                     day = 1:365,  ## corresponds to t: evaluation points of the fun. response
                     day_s = 1:365))  ## corresponds to s: evaluation points of the fun. covariate

## center temperature curves per day
canada$tempRaw <- canada$temp
canada$temp <- scale(canada$temp, scale = FALSE)
rownames(canada$temp) <- NULL ## delete row-names

## fit the model
mod <- FDboost(l10precip ~ 1 + bolsc(region, df = 4) +
                 bsignal(temp, s = day_s, cyclic = TRUE, boundary.knots = c(0.5, 365.5)),
               timeformula = ~ bbs(day, cyclic = TRUE, boundary.knots = c(0.5, 365.5)),
               data = canada)
mod <- mod[75]

## Not run: 
  #### create folds for 3-fold bootstrap: one weight for each curve
  set.seed(123)
  folds_bs <- cv(weights = rep(1, mod$ydim[1]), type = "bootstrap", B = 3)

  ## compute out-of-bag risk on the 3 folds for 1 to 75 boosting iterations
  cvr <- applyFolds(mod, folds = folds_bs, grid = 1:75)

  ## compute out-of-bag risk and coefficient estimates on folds
  cvr2 <- validateFDboost(mod, folds = folds_bs, grid = 1:75)

  ## weights per observation point
  folds_bs_long <- folds_bs[rep(1:nrow(folds_bs), times = mod$ydim[2]), ]
  attr(folds_bs_long, "type") <- "3-fold bootstrap"
  ## compute out-of-bag risk on the 3 folds for 1 to 75 boosting iterations
  cvr3 <- cvrisk(mod, folds = folds_bs_long, grid = 1:75)

## End(Not run)

## Not run: 
  ## plot the out-of-bag risk
  par(mfrow = c(1,3))
  plot(cvr); legend("topright", lty=2, paste(mstop(cvr)))
  plot(cvr2)
  plot(cvr3); legend("topright", lty=2, paste(mstop(cvr3)))

## End(Not run)

## Not run: 
  ## plot the estimated coefficients per fold
  ## more meaningful for higher number of folds, e.g., B = 100
  par(mfrow = c(2,2))
  plotPredCoef(cvr2, terms = FALSE, which = 2)
  plotPredCoef(cvr2, terms = FALSE, which = 3)

## End(Not run)

## Not run: 
  ## compute out-of-bag risk and predictions for leaving-one-curve-out cross-validation
  cvr_jackknife <- validateFDboost(mod, folds = cvLong(unique(mod$id),
                                   type = "curves"), grid = 1:75)
  plot(cvr_jackknife)
  ## plot oob predictions per fold for 3rd effect
  plotPredCoef(cvr_jackknife, which = 3)
  ## plot coefficients per fold for 2nd effect
  plotPredCoef(cvr_jackknife, which = 2, terms = FALSE)

## End(Not run)

}