gbm.step: gbm step

Description Usage Arguments Value Note Author(s) References Examples

Description

Function to assess the optimal number of boosting trees using k-fold cross validation. This is an implementation of the cross-validation procedure described on page 215 of Hastie et al (2001).

The data is divided into 10 subsets, with stratification by prevalence if required for presence/absence data. The function then fits a gbm model of increasing complexity along the sequence from n.trees to n.trees + (n.steps * step.size), calculating the residual deviance at each step along the way. After each fold processed, the function calculates the average holdout residual deviance and its standard error and then identifies the optimal number of trees as that at which the holdout deviance is minimised. It fits a model with this number of trees, returning it as a gbm model along with additional information from the cross-validation selection process.

Usage

1
2
3
4
5
6
7
gbm.step(data, gbm.x, gbm.y, offset = NULL, fold.vector = NULL, tree.complexity = 1,
 learning.rate = 0.01, bag.fraction = 0.75, site.weights = rep(1, nrow(data)), 
 var.monotone = rep(0, length(gbm.x)), n.folds = 10, prev.stratify = TRUE, 
 family = "bernoulli", n.trees = 50, step.size = n.trees, max.trees = 10000,
 tolerance.method = "auto", tolerance = 0.001, plot.main = TRUE, plot.folds = FALSE,
 verbose = TRUE, silent = FALSE, keep.fold.models = FALSE, keep.fold.vector = FALSE, 
 keep.fold.fit = FALSE, ...)

Arguments

data

input data.frame

gbm.x

indices or names of predictor variables in data

gbm.y

index or name of response variable in data

offset

offset

fold.vector

a fold vector to be read in for cross validation with offsets

tree.complexity

sets the complexity of individual trees

learning.rate

sets the weight applied to inidivudal trees

bag.fraction

sets the proportion of observations used in selecting variables

site.weights

allows varying weighting for sites

var.monotone

restricts responses to individual predictors to monotone

n.folds

number of folds

prev.stratify

prevalence stratify the folds - only for presence/absence data

family

family - bernoulli (=binomial), poisson, laplace or gaussian

n.trees

number of initial trees to fit

step.size

numbers of trees to add at each cycle

max.trees

max number of trees to fit before stopping

tolerance.method

method to use in deciding to stop - "fixed" or "auto"

tolerance

tolerance value to use - if method == fixed is absolute, if auto is multiplier * total mean deviance

plot.main

Logical. plot hold-out deviance curve

plot.folds

Logical. plot the individual folds as well

verbose

Logical. control amount of screen reporting

silent

Logical. to allow running with no output for simplifying model)

keep.fold.models

Logical. keep the fold models from cross valiation

keep.fold.vector

Logical. allows the vector defining fold membership to be kept

keep.fold.fit

Logical. allows the predicted values for observations from cross-validation to be kept

...

Logical. allows for any additional plotting parameters

Value

object of S3 class gbm

Note

This and other boosted regression trees (BRT) functions in the dismo package do not work if you use only one predictor. There is an easy work around: make a dummy variable with a constant value and then fit a model with two predictors, the one of interest and the dummy variable, which will be ignored by the model fitting as it has no useful information.

Author(s)

John R. Leathwick and Jane Elith

References

Hastie, T., R. Tibshirani, and J.H. Friedman, 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York Elith, J., J.R. Leathwick and T. Hastie, 2009. A working guide to boosted regression trees. Journal of Animal Ecology 77: 802-81

Examples

1
2
3
4
5
data(Anguilla_train)
# reduce data set to speed things up a bit
Anguilla_train = Anguilla_train[1:200,]
angaus.tc5.lr01 <- gbm.step(data=Anguilla_train, gbm.x = 3:14, gbm.y = 2, family = "bernoulli",
       tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5)

Example output

Loading required package: raster
Loading required package: sp
Loading required namespace: gbm

 
 GBM STEP - version 2.9 
 
Performing cross-validation optimisation of a boosted regression tree model 
for Angaus and using a family of bernoulli 
Using 200 observations and 12 predictors 
creating 10 initial models of 50 trees 

 folds are stratified by prevalence 
total mean deviance =  1.0905 
tolerance is fixed at  0.0011 
ntrees resid. dev. 
50    0.9151 
now adding trees... 
100   0.8337 
150   0.7882 
200   0.7651 
250   0.7539 
300   0.7519 
350   0.7554 
400   0.7604 
450   0.7655 
500   0.7693 
550   0.7729 
600   0.7831 
650   0.7884 
700   0.7961 
750   0.8087 
800   0.8143 
850   0.8263 
900   0.8391 
950   0.8534 
1000   0.8601 
fitting final gbm model with a fixed number of 300 trees for Angaus

mean total deviance = 1.09 
mean residual deviance = 0.417 
 
estimated cv deviance = 0.752 ; se = 0.056 
 
training data correlation = 0.85 
cv correlation =  0.568 ; se = 0.053 
 
training data AUC score = 0.985 
cv AUC score = 0.871 ; se = 0.024 
 
elapsed time -  0.06 minutes 

dismo documentation built on May 2, 2019, 6:07 p.m.