gbm.cverr: Tune gbm via cross-validation to find the best set of...
In patr1ckm/mvtboost: Tree Boosting for Multivariate Outcomes

Description Usage Arguments Details Value Author(s) Examples

This function serves two purposes. First, it computes the cross-validation error across a grid of gbm metaparameters input by the user, allowing the model to be easily tuned for a given problem. This process can be executed in parallel on linux-based machines. Second, it alleviates the burden of selecting a maximum number of trees for a given set of metaparameters by allowing the algorithm to run until the best number of trees has been selected according to the cross-validation error, in contrast to the standard approach to gbm, in which a maximum n.trees must also be tuned. This allows users to avoid two types of problem associated with an inappropriate selection for n.trees: (1) failing to specify enough trees and therefore using a sub-optimal model, and (2) specifying far more trees than are necessary, therefore making the gbm run for far more time than necessary.

gbm.cverr(x, y, distribution = "gaussian", cv.folds = 5, fit.best = T,
  nt.start = 1000, nt.inc = 1000, verbose = T, w = NULL,
  var.monotone = NULL, interaction.depth = 1, n.minobsinnode = 10,
  shrinkage = 0.001, bag.fraction = 0.5, n.cores = 1, max.time = NULL,
  seed = NULL)

`x`	A n x p matrix or data frame of predictors.
`y`	A n x 1 matrix or vector corresponding to the observed outcome.
`distribution`	The distribution to use when fitting each `gbm` model. For continuous outcomes, the available distributions are "gaussian" (squared error) and "laplace" (absolute loss). For dichotomous outcomes, the available distributions are "bernoulli" (logistic regression for 0-1 outcomes) and "adaboost" (the AdaBoost exponential loss for 0-1 outcomes). Finally, the "poisson" distribution is available for count outcomes.
`cv.folds`	Number of cross-validation folds to perform.
`fit.best`	Logical variable indicating whether or not the best set of metaparameters (estimated according to cross-validation error) will be utilized to fit and return a `gbm.fit` object to the complete data.
`nt.start`	Initial number of trees used to model y.
`nt.inc`	Number of trees incrementally added until the cross-validation error is minimized or until `max.time` is reached (see below).
`verbose`	If TRUE, then `gbm.cverr` will print status information to the console.
`w`	a vector of weights of the same length as y. NOTE: to evaluate the effect of different weight vectors, a list can be passed to w in which each element follows the structure described above.
`var.monotone`	an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. NOTE: to evaluate the effect of different monotonicity constraints, a list can be passed to var.monotone in which each element follows the structure described above.
`interaction.depth`	The maximum depth of variable interactions: 1 builds an additive model, 2 builds a model with up to two-way interactions, etc. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple interaction depths.
`n.minobsinnode`	The minimum number of observations (not total weights) in the terminal nodes of the trees. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple minimum node sizes.
`shrinkage`	A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple shrinkage penalties.
`bag.fraction`	The fraction of independent training observations (or patients) randomly selected to propose the next tree in the expansion, depending on the obs.id vector multiple training data rows may belong to a single 'patient'. This introduces randomness into the model fit. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple bag fractions.
`n.cores`	Number of cores that will be used to estimate cross-validation folds in parallel. Only available on linux-based machines.
`max.time`	Maximum number of seconds that the model will continue adding trees for a given set of metaparameters. This optional argument allows users to find the best possible solution in scenarios characterized by limited computational resources.
`seed`	Seed that will guarantee `gbm.cverr` to produce identical results across multiple runs. Utilizing `set.seed` prior to calling `gbm.cverr` does NOT ensure equal results if `bag.fraction < 1`

The main output of gbm.cverr is a data frame with rows corresponding to sets of metaparameters and columns corresponding to (1) the values defining each set of metaparameters, (2) the minimum cross-validation error corresponding to each row, and (3) the number of trees that yielded the minimum cross-validation error in each row. These results are intended to allow users to make informed decisions about the metaparameters passed to gbm when fitting the model that will be interpreted and/or used for prediction in the future.

Note that the metaparamter values passed to w, var.monotone, interaction.depth, n.minobsinnode, shrinkage, and bag.fraction will be fully crossed and evaluated.

An object with 2-5 elements and a summary function. The elements of gbm.cverr.res are,

`gbm.fit`	If `fit.best` was `TRUE`, then this element is the `gbm.fit` object fit to `x` and `y` using the best set of metaparameters identified by `gbm.cverr`.
`w`	List of the optional weight vectors provided by the user. Will not be returned if `w` was left `NULL` when calling `gbm.cverr`.
`var.montone`	List of the optional monotonoicity parameters proivided by the user. Will not be returned if `var.monotone` was left `NULL` when calling `gbm.cverr`.
`cv.err`	A list with length corresponding to the number of metaparameter combinations that were evaluated by `gbm.cverr`. Each element is a vector quantifying the cross-validation error across all trees corresponding to the given set of metaparameters.
`res`	A data frame with ten columns and as many rows as there were unique combinations of metaparameters. This data frame is the basis of the summary function for `gbm.cverr.res` objects (see below), but it differs from the summary object in two ways: (1) it is not sorted in terms of the minimum cross-validaiton error, but rather according to the order in which the metaparameters were passed to `gbm.cverr`, and (2) it contains two additional columns. The column `best.meta` is a dummy variable that simply indexes the best set of metaparameters, and `timer.end` is a dummy variable indicating whether or not the optimal number of trees was found for a given set of metaparameters (FALSE) or whether the user-specified maximum search time was reached prior to minimizing the cross-validation error (TRUE). If the timer ran out, then the estimated optimal number of trees is likely underestimated. If this occurred for metaparameter set `k`, then in order to evaluate the extent to which the error was still decresaing when the timer ended, we recommend investigating a plot of `gbm.cverr.res$cv.err[k]`. The rest of the elements of `res` are discussed below.

Calling summary(gbm.cverr.res) produces a data frame with rows corresponding to sets of metaparameters and columns that denote for each row,

`min.cv.error`	Minimum cross-validation error resulting from the given set of metaparameters.
`w.index`	The index of the (optional) list of weight vectors corresponding to the given set of metaparameters. This will be omitted if a list of weights was not provided to `gbm.cverr` through the input parameter `w`.
`var.monotone.index`	The index of the (optional) list of monotonicity vectors corresponding to the given set of metaparameters. This will be omitted if a list of weights was not provided to `gbm.cverr` through the input parameter `var.monotone`.
`interaction.depth`	The interaction depth corresponding to the given set of metaparameters.
`n.minobsinnode`	Minimum number of observations in the terminal nodes of the trees for the given set of metaparameters.
`shrinkage`	The shrinkage parameter corresponding to the given set of metaparameters.
`bag.fraction`	The fraction of independent training observations randomly selected to propose the next tree corresponding to the given set of metaparameters.
`n.trees`	The optimum number of trees to utilize given the set of metaprameters denoted in the row. Note that entries in this column will be marked with '>=' if the boosting procedure was terminated due to time running out for this set of metaparameters, determined by the user-specified `max.time` passed to `gbm.cverr`

In the summary object and output, sets of metaparameters (rows) are ordered from best (top row) to worst (last row) in terms of the resulting cross-validation error.

Daniel B. McArtor (dmcartor@nd.edu)

data(wellbeing)
y <- wellbeing[,25]
x <- wellbeing[,1:20]

mm <- gbm.cverr(x = x, y = y, 
               distribution = 'gaussian', 
               cv.folds = 2, 
               
               nt.start = 100, 
               nt.inc = 100, 
               max.time = 1, 
               
               seed = 12345,
               interaction.depth = c(1, 5), 
               shrinkage = 0.01,
               n.minobsinnode = c(5, 50), 
               verbose = TRUE)

summary(mm)

# Investigate gbm results based on the best set of metaparameters
mm$gbm.fit
summary(mm$gbm.fit)