Description Usage Arguments Value Note Author(s) References Examples
Function to assess the optimal number of boosting trees using k-fold cross validation. This is an implementation of the cross-validation procedure described on page 215 of Hastie et al (2001).
The data is divided into 10 subsets, with stratification by prevalence if required for presence/absence data. The function then fits a gbm model of increasing complexity along the sequence from n.trees
to n.trees + (n.steps * step.size)
, calculating the residual deviance at each step along the way. After each fold processed, the function calculates the average holdout residual deviance and its standard error and then identifies the optimal number of trees as that at which the holdout deviance is minimised. It fits a model with this number of trees, returning it as a gbm model along with additional information from the cross-validation selection process.
1 2 3 4 5 6 7 | gbm.step(data, gbm.x, gbm.y, offset = NULL, fold.vector = NULL, tree.complexity = 1,
learning.rate = 0.01, bag.fraction = 0.75, site.weights = rep(1, nrow(data)),
var.monotone = rep(0, length(gbm.x)), n.folds = 10, prev.stratify = TRUE,
family = "bernoulli", n.trees = 50, step.size = n.trees, max.trees = 10000,
tolerance.method = "auto", tolerance = 0.001, plot.main = TRUE, plot.folds = FALSE,
verbose = TRUE, silent = FALSE, keep.fold.models = FALSE, keep.fold.vector = FALSE,
keep.fold.fit = FALSE, ...)
|
data |
input data.frame |
gbm.x |
indices or names of predictor variables in |
gbm.y |
index or name of response variable in |
offset |
offset |
fold.vector |
a fold vector to be read in for cross validation with offsets |
tree.complexity |
sets the complexity of individual trees |
learning.rate |
sets the weight applied to inidivudal trees |
bag.fraction |
sets the proportion of observations used in selecting variables |
site.weights |
allows varying weighting for sites |
var.monotone |
restricts responses to individual predictors to monotone |
n.folds |
number of folds |
prev.stratify |
prevalence stratify the folds - only for presence/absence data |
family |
family - bernoulli (=binomial), poisson, laplace or gaussian |
n.trees |
number of initial trees to fit |
step.size |
numbers of trees to add at each cycle |
max.trees |
max number of trees to fit before stopping |
tolerance.method |
method to use in deciding to stop - "fixed" or "auto" |
tolerance |
tolerance value to use - if method == fixed is absolute, if auto is multiplier * total mean deviance |
plot.main |
Logical. plot hold-out deviance curve |
plot.folds |
Logical. plot the individual folds as well |
verbose |
Logical. control amount of screen reporting |
silent |
Logical. to allow running with no output for simplifying model) |
keep.fold.models |
Logical. keep the fold models from cross valiation |
keep.fold.vector |
Logical. allows the vector defining fold membership to be kept |
keep.fold.fit |
Logical. allows the predicted values for observations from cross-validation to be kept |
... |
Logical. allows for any additional plotting parameters |
object of S3 class gbm
This and other boosted regression trees (BRT) functions in the dismo package do not work if you use only one predictor. There is an easy work around: make a dummy variable with a constant value and then fit a model with two predictors, the one of interest and the dummy variable, which will be ignored by the model fitting as it has no useful information.
John R. Leathwick and Jane Elith
Hastie, T., R. Tibshirani, and J.H. Friedman, 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York Elith, J., J.R. Leathwick and T. Hastie, 2009. A working guide to boosted regression trees. Journal of Animal Ecology 77: 802-81
1 2 3 4 5 | data(Anguilla_train)
# reduce data set to speed things up a bit
Anguilla_train = Anguilla_train[1:200,]
angaus.tc5.lr01 <- gbm.step(data=Anguilla_train, gbm.x = 3:14, gbm.y = 2, family = "bernoulli",
tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5)
|
Loading required package: raster
Loading required package: sp
Loading required namespace: gbm
GBM STEP - version 2.9
Performing cross-validation optimisation of a boosted regression tree model
for Angaus and using a family of bernoulli
Using 200 observations and 12 predictors
creating 10 initial models of 50 trees
folds are stratified by prevalence
total mean deviance = 1.0905
tolerance is fixed at 0.0011
ntrees resid. dev.
50 0.9151
now adding trees...
100 0.8337
150 0.7882
200 0.7651
250 0.7539
300 0.7519
350 0.7554
400 0.7604
450 0.7655
500 0.7693
550 0.7729
600 0.7831
650 0.7884
700 0.7961
750 0.8087
800 0.8143
850 0.8263
900 0.8391
950 0.8534
1000 0.8601
fitting final gbm model with a fixed number of 300 trees for Angaus
mean total deviance = 1.09
mean residual deviance = 0.417
estimated cv deviance = 0.752 ; se = 0.056
training data correlation = 0.85
cv correlation = 0.568 ; se = 0.053
training data AUC score = 0.985
cv AUC score = 0.871 ; se = 0.024
elapsed time - 0.06 minutes
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.