gbm.cverr: Tune gbm via cross-validation to find the best set of...

Description Usage Arguments Details Value Author(s) Examples

View source: R/gbm_cverr.R

Description

This function serves two purposes. First, it computes the cross-validation error across a grid of gbm metaparameters input by the user, allowing the model to be easily tuned for a given problem. This process can be executed in parallel on linux-based machines. Second, it alleviates the burden of selecting a maximum number of trees for a given set of metaparameters by allowing the algorithm to run until the best number of trees has been selected according to the cross-validation error, in contrast to the standard approach to gbm, in which a maximum n.trees must also be tuned. This allows users to avoid two types of problem associated with an inappropriate selection for n.trees: (1) failing to specify enough trees and therefore using a sub-optimal model, and (2) specifying far more trees than are necessary, therefore making the gbm run for far more time than necessary.

Usage

1
2
3
4
5
gbm.cverr(x, y, distribution = "gaussian", cv.folds = 5, fit.best = T,
  nt.start = 1000, nt.inc = 1000, verbose = T, w = NULL,
  var.monotone = NULL, interaction.depth = 1, n.minobsinnode = 10,
  shrinkage = 0.001, bag.fraction = 0.5, n.cores = 1, max.time = NULL,
  seed = NULL)

Arguments

x

A n x p matrix or data frame of predictors.

y

A n x 1 matrix or vector corresponding to the observed outcome.

distribution

The distribution to use when fitting each gbm model. For continuous outcomes, the available distributions are "gaussian" (squared error) and "laplace" (absolute loss). For dichotomous outcomes, the available distributions are "bernoulli" (logistic regression for 0-1 outcomes) and "adaboost" (the AdaBoost exponential loss for 0-1 outcomes). Finally, the "poisson" distribution is available for count outcomes.

cv.folds

Number of cross-validation folds to perform.

fit.best

Logical variable indicating whether or not the best set of metaparameters (estimated according to cross-validation error) will be utilized to fit and return a gbm.fit object to the complete data.

nt.start

Initial number of trees used to model y.

nt.inc

Number of trees incrementally added until the cross-validation error is minimized or until max.time is reached (see below).

verbose

If TRUE, then gbm.cverr will print status information to the console.

w

a vector of weights of the same length as y. NOTE: to evaluate the effect of different weight vectors, a list can be passed to w in which each element follows the structure described above.

var.monotone

an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. NOTE: to evaluate the effect of different monotonicity constraints, a list can be passed to var.monotone in which each element follows the structure described above.

interaction.depth

The maximum depth of variable interactions: 1 builds an additive model, 2 builds a model with up to two-way interactions, etc. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple interaction depths.

n.minobsinnode

The minimum number of observations (not total weights) in the terminal nodes of the trees. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple minimum node sizes.

shrinkage

A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple shrinkage penalties.

bag.fraction

The fraction of independent training observations (or patients) randomly selected to propose the next tree in the expansion, depending on the obs.id vector multiple training data rows may belong to a single 'patient'. This introduces randomness into the model fit. NOTE: Multiple values can be passed in a vector to evaluate the cross-validation error using multiple bag fractions.

n.cores

Number of cores that will be used to estimate cross-validation folds in parallel. Only available on linux-based machines.

max.time

Maximum number of seconds that the model will continue adding trees for a given set of metaparameters. This optional argument allows users to find the best possible solution in scenarios characterized by limited computational resources.

seed

Seed that will guarantee gbm.cverr to produce identical results across multiple runs. Utilizing set.seed prior to calling gbm.cverr does NOT ensure equal results if bag.fraction < 1

Details

The main output of gbm.cverr is a data frame with rows corresponding to sets of metaparameters and columns corresponding to (1) the values defining each set of metaparameters, (2) the minimum cross-validation error corresponding to each row, and (3) the number of trees that yielded the minimum cross-validation error in each row. These results are intended to allow users to make informed decisions about the metaparameters passed to gbm when fitting the model that will be interpreted and/or used for prediction in the future.

Note that the metaparamter values passed to w, var.monotone, interaction.depth, n.minobsinnode, shrinkage, and bag.fraction will be fully crossed and evaluated.

Value

An object with 2-5 elements and a summary function. The elements of gbm.cverr.res are,

gbm.fit

If fit.best was TRUE, then this element is the gbm.fit object fit to x and y using the best set of metaparameters identified by gbm.cverr.

w

List of the optional weight vectors provided by the user. Will not be returned if w was left NULL when calling gbm.cverr.

var.montone

List of the optional monotonoicity parameters proivided by the user. Will not be returned if var.monotone was left NULL when calling gbm.cverr.

cv.err

A list with length corresponding to the number of metaparameter combinations that were evaluated by gbm.cverr. Each element is a vector quantifying the cross-validation error across all trees corresponding to the given set of metaparameters.

res

A data frame with ten columns and as many rows as there were unique combinations of metaparameters. This data frame is the basis of the summary function for gbm.cverr.res objects (see below), but it differs from the summary object in two ways: (1) it is not sorted in terms of the minimum cross-validaiton error, but rather according to the order in which the metaparameters were passed to gbm.cverr, and (2) it contains two additional columns. The column best.meta is a dummy variable that simply indexes the best set of metaparameters, and timer.end is a dummy variable indicating whether or not the optimal number of trees was found for a given set of metaparameters (FALSE) or whether the user-specified maximum search time was reached prior to minimizing the cross-validation error (TRUE). If the timer ran out, then the estimated optimal number of trees is likely underestimated. If this occurred for metaparameter set k, then in order to evaluate the extent to which the error was still decresaing when the timer ended, we recommend investigating a plot of gbm.cverr.res$cv.err[k]. The rest of the elements of res are discussed below.

Calling summary(gbm.cverr.res) produces a data frame with rows corresponding to sets of metaparameters and columns that denote for each row,

min.cv.error

Minimum cross-validation error resulting from the given set of metaparameters.

w.index

The index of the (optional) list of weight vectors corresponding to the given set of metaparameters. This will be omitted if a list of weights was not provided to gbm.cverr through the input parameter w.

var.monotone.index

The index of the (optional) list of monotonicity vectors corresponding to the given set of metaparameters. This will be omitted if a list of weights was not provided to gbm.cverr through the input parameter var.monotone.

interaction.depth

The interaction depth corresponding to the given set of metaparameters.

n.minobsinnode

Minimum number of observations in the terminal nodes of the trees for the given set of metaparameters.

shrinkage

The shrinkage parameter corresponding to the given set of metaparameters.

bag.fraction

The fraction of independent training observations randomly selected to propose the next tree corresponding to the given set of metaparameters.

n.trees

The optimum number of trees to utilize given the set of metaprameters denoted in the row. Note that entries in this column will be marked with '>=' if the boosting procedure was terminated due to time running out for this set of metaparameters, determined by the user-specified max.time passed to gbm.cverr

In the summary object and output, sets of metaparameters (rows) are ordered from best (top row) to worst (last row) in terms of the resulting cross-validation error.

Author(s)

Daniel B. McArtor (dmcartor@nd.edu)

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
data(wellbeing)
y <- wellbeing[,25]
x <- wellbeing[,1:20]

mm <- gbm.cverr(x = x, y = y, 
               distribution = 'gaussian', 
               cv.folds = 2, 
               
               nt.start = 100, 
               nt.inc = 100, 
               max.time = 1, 
               
               seed = 12345,
               interaction.depth = c(1, 5), 
               shrinkage = 0.01,
               n.minobsinnode = c(5, 50), 
               verbose = TRUE)

summary(mm)

# Investigate gbm results based on the best set of metaparameters
mm$gbm.fit
summary(mm$gbm.fit)

patr1ckm/mvtboost documentation built on May 24, 2019, 8:21 p.m.