gbmt: GBMT

View source: R/gbmt.r

gbmtR Documentation

GBMT

Description

Fits generalized boosted regression models - new API. This prepares the inputs, performing tasks such as creating cv folds, before calling gbmt_fit to call the underlying C++ and fit a generalized boosting model.

Usage

gbmt(
  formula,
  distribution = gbm_dist("Gaussian"),
  data,
  weights = rep(1, nrow(data)),
  offset = rep(0, nrow(data)),
  train_params = training_params(num_trees = 2000, interaction_depth = 3,
    min_num_obs_in_node = 10, shrinkage = 0.001, bag_fraction = 0.5, id =
    seq_len(nrow(data)), num_train = round(0.5 * nrow(data)), num_features = ncol(data) -
    1),
  var_monotone = NULL,
  var_names = NULL,
  cv_folds = 1,
  cv_class_stratify = FALSE,
  fold_id = NULL,
  keep_gbm_data = FALSE,
  par_details = getOption("gbm.parallel"),
  is_verbose = FALSE
)

Arguments

formula

a symbolic description of the model to be fit. The formula may include an offset term (e.g. y~offset(n) + x).

distribution

a GBMDist object specifying the distribution and any additional parameters needed. If not specified then the distribution will be guessed.

data

a data frame containing the variables in the model. By default, the variables are taken from the environment.

weights

optional vector of weights used in the fitting process. These weights must be positive but need not be normalized. By default they are set to 1 for each data row.

offset

optional vector specifying the model offset; must be positive. This defaults to a vector of 0's, the length of which is equal to the number rows of data.

train_params

a GBMTrainParams object which specifies the parameters used in growing decision trees.

var_monotone

optional vector, the same length as the number of predictors, indicating the relationship each variable has with the outcome. It have a monotone increasing (+1) or decreasing (-1) or an arbitrary relationship.

var_names

a vector of strings of containing the names of the predictor variables.

cv_folds

a positive integer specifying the number of folds to be used in cross-validation of the gbm fit. If cv_folds > 1 then cross-validation is performed; the default of cv_folds is 1.

cv_class_stratify

a bool specifying whether or not to stratify via response outcome. Currently only applies to "Bernoulli" distribution and defaults to false.

fold_id

An optional vector of values identifying what fold each observation is in. If supplied, cv_folds can be missing. Note: Multiple rows of the same observation must have the same fold_id.

keep_gbm_data

a bool specifying whether or not the gbm_data object created in this method should be stored in the results.

par_details

Details of the parallelization to use in the core algorithm (gbmParallel).

is_verbose

if TRUE, gbmt will print out progress and performance of the fit.

Value

a GBMFit object.

Examples

## create some data
N <- 1000
X1 <- runif(N)
X2 <- runif(N)
X3 <- factor(sample(letters[1:4],N,replace=TRUE))
mu <- c(-1,0,1,2)[as.numeric(X3)]

p <- 1/(1+exp(-(sin(3*X1) - 4*X2 + mu)))
Y <- rbinom(N,1,p)

# random weights if you want to experiment with them
w <- rexp(N)
w <- N*w/sum(w)

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3)


# takes longer, but num_trees=3000 preferable
train_params <-
     training_params(num_trees = 3000,
                     shrinkage = 0.001,
                     bag_fraction = 0.5,
                     num_train = N/2,
                     id=seq_len(nrow(data)),
                     min_num_obs_in_node = 10,
                     interaction_depth = 3,
                     num_features = 3)


# for the example to run quickly, num_trees=100
train_params <-
     training_params(num_trees = 100,
                     shrinkage = 0.001,
                     bag_fraction = 0.5,
                     num_train = N/2,
                     id=seq_len(nrow(data)),
                     min_num_obs_in_node = 10,
                     interaction_depth = 3,
                     num_features = 3)
 
# fit initial model
gbm1 <- gbmt(Y~X1+X2+X3,                # formula
             data=data,                 # dataset
             weights=w,
             var_monotone=c(0,0,0),     # -1: monotone decrease,
                                        # +1: monotone increase, 
                                        #  0: no monotone restrictions
             distribution=gbm_dist("Bernoulli"),
             train_params = train_params,
             cv_folds=5,                # do 5-fold cross-validation
             is_verbose = FALSE)           # don't print progress

# plot the performance
#   returns out-of-bag estimated best number of trees
best.iter.oob <- gbmt_performance(gbm1,method="OOB")  
plot(best.iter.oob)
print(best.iter.oob)

# returns 5-fold cv estimate of best number of trees
best.iter.cv <- gbmt_performance(gbm1,method="cv")   
plot(best.iter.cv)
print(best.iter.cv)

# returns test set estimate of best number of trees
best.iter.test <- gbmt_performance(gbm1,method="test") 
plot(best.iter.cv)
print(best.iter.test)

best.iter <- best.iter.test

# plot variable influence
summary(gbm1,num_trees=1)         # based on first tree
summary(gbm1,num_trees=best.iter) # based on  estimated best number of trees

# create marginal plots
# plot variable X1,X2,X3 after "best" iterations
oldpar <- par(no.readonly = TRUE)
par(mfrow=c(1,3))
plot(gbm1,1,best.iter)
plot(gbm1,2,best.iter)
plot(gbm1,3,best.iter)
par(mfrow=c(1,1))
plot(gbm1,1:2,best.iter) # contour plot vars 1 & 2 after "best" num iterations
plot(gbm1,2:3,best.iter) # lattice plot vars 2 & 3 after "best" num iterations

# 3-way plot
plot(gbm1,1:3,best.iter)

# print the first and last trees
print(pretty_gbm_tree(gbm1,1))
print(pretty_gbm_tree(gbm1, gbm1$params$num_trees))
par(oldpar) # reset graphics options to previous settings

gbm-developers/gbm3 documentation built on March 8, 2024, 4:48 p.m.