library("gbm3")
The gbm
package implements gradient tree boosting for a wide variety of statistical models and loss functions; including those commonly used in statistics, such as the Cox model, but not usually associated with boosted models. When presented with data, the gbm
package offers the user the ability to build predictive models and explore the influence of different variables on the response, akin to a data mining or exploration task.
To get started a user must:
data.frame
.Once these steps have been completed, a user can fit their gbm
model by a call to gbmt
and subsequently: evaluate its performance, make predictions, fit additional trees and plot the relative influence of the various predictor variables
To begin, this vignette will work with simulated data where the response obeys a Bernoulli
distribution and is determined by 3 predictor variables. Every row of the data corresponds to an individual observation and to these observations random weights are assigned. These weights determine the importance of the associated observation in the fit. At this stage it is also assumed that the fit will be made directly to the response and no offset is supplied. If the offset vector for each observation was non-zero the gbm
fit would be to the observation response + offset
rather than the bare response. Currently the package supports responses which either consist of: numerics, factors (i. e. Bernoulli
) or are a survival object (more specifically a counting or right centered survival object) in the case of Cox proportional hazards models. NB: Predictor variables may have missing data (specified by NA
) but the responses must have no missing observations.
# create some binomial response data - N observations N <- 1000 X1 <- runif(N) X2 <- runif(N) X3 <- factor(sample(letters[1:4],N,replace=T)) mu <- c(-1,0,1,2)[as.numeric(X3)] p <- 1/(1+exp(-(sin(3*X1) - 4*X2 + mu))) Y <- rbinom(N,1,p) # response # Normalized weights for each observation w <- rexp(N) w <- N*w/sum(w) # Offset is set to 0 offset <- rep(0, N) # Collate data data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3)
The gbm
package has a number of distributions available to user. To view what distributions are currently available simply call available_distributions()
.
available_distributions()
From the list of available distributions a default GBMDist
object for each distribution can be created with using the function gbm_dist
and passing the desired distribution's name as follows:
# Create a default Bernoulli bern_dist <- gbm_dist("Bernoulli")
Certain distributions have distribution specific parameters, such as the number of degrees of freedom associated with the distribution. These parameters are passed as arguments to gbm_dist
and how this is done is described in another vignette entitled "model-specific-parameters".
Once the data has been imported and the distribution object created, the next task is to specify the
training parameters for the gbm
package. These parameters are passed to gbmt
in the form of a GBMTrainParams
object. This S3 object can be created using the training_params
constructor and passing it the appropriate data to the fields. The named parameters in the constructor are as follows:
num_trees
: the number of iterations to use in the fitting.interaction_depth
: the max number of terminal nodes in each tree - the larger this is the more likely overfitting is to occur - depths of ~ 3-10 are usually recommended.min_num_obs_in_node
: the minimum number of observations that a node must have to be considered in fitting a tree. The default of 10 observations which is usually adequate but if your dataset is small then this could be reduced.shrinkage
: acts as regularization for additional iterations - the smaller the shrinkage generally the better the performance of the fit. However, smaller shrinkage implies that the number of trees may need to be increased to achieve a certain performance. In practice the default of 0.001
usually suffices.bag_fraction
: the fraction of the training observations randomly sub-sampled to fit a tree in each iteration. This has the effect of reducing the variance of the boosted model. The default bag_fraction=0.5
is recommended.num_train
: the number of observations to be used in training the model. This defaults to the minimum value such that the terminal nodes of a fitted tree can be created.id
: a sequence of integers specifying the id of each observation in the data. This defaults to seq_len(num_train)
.num_features
: the number of features randomly selected, on each iteration, from the data when growing a tree. Once the parameters have been decided the appropriate object is created. For further details on how these parameters and what they represent, please view: The Elements of Statistical Learning by Friedman et al.
# Creating an appropriate training parameters object train_params <- training_params(num_trees = 2000, interaction_depth = 3, num_train=nrow(data), num_features = 3)
Before fitting our boosted model there are some other parameters to consider. The monotonicity of each predictor variable with the outcome variable can be specified with through the var_monotone
parameter. In this case the relationship is assumed to be arbitrary and the parameter is unspecified. The names of the predictor variables may also be specified via var_names
.
Other parameters relating to cross-validation of the model may be provided. Firstly, the number of folds to be used in cross-validation is set by cv_folds
, by default no cross-validation is done by gbmt
. An optional vector specifying what fold each observation belongs to can also be specified through fold_id
and if the distribution selected is Bernoulli
the cross-validation can be stratified across the response.
The core algorithm can be parallelized across numerous threads by passing the output of gbmParallel()
to the par_details
handle. In this example, more than one thread for processing is not required and so the default is used.
# Example of a gbmParallel with more threads par_detail <- gbmParallel(num_threads = 10) # Pass to par_details in gbmt
With the data defined, the distribution object created, the training parameters and additional parameters specified, it is now time to fit the model. This is done by passing these objects to gbmt
along with the model formula.
# Create a gbm fit fit <- gbmt(Y ~ X1 + X2 + X3, distribution=bern_dist, data=data, weights = w, offset=offset, train_params = train_params, cv_folds=5, keep_gbm_data=TRUE)
The keep_gbm_data
flag indicates that the outputted GBMFit
object will contain the data passed to the fit within a GBMData
object.
The process outlined above is slightly cumbersome. Thankfully, gbmt
has numerous default behaviours that are very useful. Firstly, the distribution will be "guessed" if none is provided - currently it can identify Bernoulli and Cox models, otherwise it defaults to Gaussian. The "weights" parameter defaults to be uniform across all the data rows and as the offset also defaults to 0 across all data rows. The training parameters also specify 2000 trees, an interaction depth of 3, identify one row per observation, use half the available data in training and bag half the training data as well as use all of the predictors in the fitting. Moreover, the weights are normalized by the routine if the distribution is not "Pairwise".
# A default fit to gaussian data # Create data N <- 1000 X1 <- runif(N) X2 <- 2*runif(N) X3 <- factor(sample(letters[1:4],N,replace=T)) X4 <- ordered(sample(letters[1:6],N,replace=T)) X5 <- factor(sample(letters[1:3],N,replace=T)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] SNR <- 10 # signal-to-noise ratio Y <- X1**1.5 + 2 * (X2**.5) + mu sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(N,0,sigma) # create a bunch of missing values X1[sample(1:N,size=100)] <- NA X3[sample(1:N,size=300)] <- NA # Gaussian data gauss_data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # Perform a cross-validated fit gauss_fit <- gbmt(Y ~ X1 + X2 + X3 + X4 + X5 + X6, data=gauss_data, cv_folds =5, keep_gbm_data = TRUE)
A fitted model contains a number of measures of its performance. Using these measures, the optimal iteration of a gbm
fit can be identified. The package offers three methods for identifying the optimal iteration of the fit, these are: 'test'
, 'cv'
and 'OOB'
. The 'test'
method identifies the optimal iteration using the performance of the fit on a set-aside independent test set, this is only available if the number of observations used to fit the model is less than the number of observations in the original data. The 'cv'
method determines the best iteration number using the cross-validation error provided and finally 'OOB'
selects the best iteration using the out-of-bag error estimate. Again this is only available if bag_fraction > 0
and it will under-estimate the optimal number of iterations in the fit. The optimal iteration number as well as a plot of the performance can then be obtained through a call to gbmt_performance
. In these performance plots the cross-validation error, test set error and out-of-bag improvement are represented by green, red and blue lines respectively. The error on the training set is shown as a black line and the optimal iteration is marked with a dashed blue line.
# Use gaussian fit and determine the performance # red line is testing error # green line is cv error # blue line is OOB error best_iter_test <- gbmt_performance(gauss_fit, method='test') best_iter_cv <- gbmt_performance(gauss_fit, method='cv') best_iter_oob <- gbmt_performance(gauss_fit, method='OOB')
After evaluating the performance it may be the case that the optimal number of iterations is the num_trees
parameter set. In this instance additional trees may want to be fitted to the original dataset; this will not alter the cross-validation fit but will update the test set and OOB errors.
# Fit additional trees to gaussian fit and recalculate optimal number of iterations gauss_fit_more <- gbm_more(gauss_fit, num_new_trees = 5000) # This is a large number of #extra trees! best_iter_test_more <- gbmt_performance(gauss_fit_more, method='test')
With the model fitted and the optimal number of iterations determined, the user can now make predictions on other datasets using the gbm
package's predict
functionality.
# Given additional data - of the same shape as our gaussian example X1 <- runif(N) X2 <- 2*runif(N) X3 <- factor(sample(letters[1:4],N,replace=TRUE)) X4 <- ordered(sample(letters[1:6],N,replace=TRUE)) X5 <- factor(sample(letters[1:3],N,replace=TRUE)) X6 <- 3*runif(N) mu <- c(-1,0,1,2)[as.numeric(X3)] Y <- X1**1.5 + 2 * (X2**.5) + mu Y <- Y + rnorm(N,0,sigma) data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6) # Then predictions can be made preds <- predict(gauss_fit_more, newdata=data2, n.trees=best_iter_test_more)
gbm
also offers a measure of the relative influence of each predictor in the fitted model. The influence of a variable $x_j$ is determined by the sum across all iterations of the improvements of a tree fit when split on that variable.
# Relative influence of predictors - both examples print(relative_influence(gauss_fit_more, best_iter_test_more)) print(relative_influence(fit, gbmt_performance(fit, method='cv')))
As well as calculating the relative influence from the fitted trees, gbm
also offers the ability to calculate the predictions of specific variables and make the appropriate marginal plots using plot
.
# Examine the marginal effects of the first two variables - gaussian example # Shows the fitted model predictions as a function of 1st two predictors # with all the others integrated out plot(gauss_fit_more, var_index = 1:2)
Finally there are some additional useful functions within the gbm
package to remember. A GBMFit
object can be summarised and printed using summary
and print
respectively. Each individually fitted tree can be "prettified" using pretty_gbm_tree
and the loss for a given distribution can be calculated using the loss
S3 method, a GBMDist
object and the appropriate data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.