library("gbm3")

The gbm package implements gradient tree boosting for a wide variety of statistical models and loss functions; including those commonly used in statistics, such as the Cox model, but not usually associated with boosted models. When presented with data, the gbm package offers the user the ability to build predictive models and explore the influence of different variables on the response, akin to a data mining or exploration task.

To get started a user must:

Once these steps have been completed, a user can fit their gbm model by a call to gbmt and subsequently: evaluate its performance, make predictions, fit additional trees and plot the relative influence of the various predictor variables

Example dataset

To begin, this vignette will work with simulated data where the response obeys a Bernoulli distribution and is determined by 3 predictor variables. Every row of the data corresponds to an individual observation and to these observations random weights are assigned. These weights determine the importance of the associated observation in the fit. At this stage it is also assumed that the fit will be made directly to the response and no offset is supplied. If the offset vector for each observation was non-zero the gbm fit would be to the observation response + offset rather than the bare response. Currently the package supports responses which either consist of: numerics, factors (i. e. Bernoulli) or are a survival object (more specifically a counting or right centered survival object) in the case of Cox proportional hazards models. NB: Predictor variables may have missing data (specified by NA) but the responses must have no missing observations.

# create some binomial response data - N observations 
N <- 1000
X1 <- runif(N)
X2 <- runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
mu <- c(-1,0,1,2)[as.numeric(X3)]

p <- 1/(1+exp(-(sin(3*X1) - 4*X2 + mu)))
Y <- rbinom(N,1,p) # response

# Normalized weights for each observation
w <- rexp(N)
w <- N*w/sum(w)

# Offset is set to 0
offset <- rep(0, N)

# Collate data
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3)

Creating a gbm distribution object

The gbm package has a number of distributions available to user. To view what distributions are currently available simply call available_distributions().

available_distributions()

From the list of available distributions a default GBMDist object for each distribution can be created with using the function gbm_dist and passing the desired distribution's name as follows:

# Create a default Bernoulli 
bern_dist <- gbm_dist("Bernoulli")

Certain distributions have distribution specific parameters, such as the number of degrees of freedom associated with the distribution. These parameters are passed as arguments to gbm_dist and how this is done is described in another vignette entitled "model-specific-parameters".

Setting training parameters

Once the data has been imported and the distribution object created, the next task is to specify the training parameters for the gbm package. These parameters are passed to gbmt in the form of a GBMTrainParams object. This S3 object can be created using the training_params constructor and passing it the appropriate data to the fields. The named parameters in the constructor are as follows:

Once the parameters have been decided the appropriate object is created. For further details on how these parameters and what they represent, please view: The Elements of Statistical Learning by Friedman et al.

# Creating an appropriate training parameters object
train_params <- training_params(num_trees = 2000, interaction_depth = 3, 
                                num_train=nrow(data), num_features = 3)

Additional variable and fitting parameters

Before fitting our boosted model there are some other parameters to consider. The monotonicity of each predictor variable with the outcome variable can be specified with through the var_monotone parameter. In this case the relationship is assumed to be arbitrary and the parameter is unspecified. The names of the predictor variables may also be specified via var_names.

Cross Validation

Other parameters relating to cross-validation of the model may be provided. Firstly, the number of folds to be used in cross-validation is set by cv_folds, by default no cross-validation is done by gbmt. An optional vector specifying what fold each observation belongs to can also be specified through fold_id and if the distribution selected is Bernoulli the cross-validation can be stratified across the response.

Parallelisation

The core algorithm can be parallelized across numerous threads by passing the output of gbmParallel() to the par_details handle. In this example, more than one thread for processing is not required and so the default is used.

# Example of a gbmParallel with more threads
par_detail <- gbmParallel(num_threads = 10) # Pass to par_details in gbmt

Putting it all together

With the data defined, the distribution object created, the training parameters and additional parameters specified, it is now time to fit the model. This is done by passing these objects to gbmt along with the model formula.

# Create a gbm fit
fit <- gbmt(Y ~ X1 + X2 + X3, distribution=bern_dist, data=data, weights = w, 
            offset=offset, train_params = train_params,
            cv_folds=5, keep_gbm_data=TRUE)

The keep_gbm_data flag indicates that the outputted GBMFit object will contain the data passed to the fit within a GBMData object.

Default behaviour

The process outlined above is slightly cumbersome. Thankfully, gbmt has numerous default behaviours that are very useful. Firstly, the distribution will be "guessed" if none is provided - currently it can identify Bernoulli and Cox models, otherwise it defaults to Gaussian. The "weights" parameter defaults to be uniform across all the data rows and as the offset also defaults to 0 across all data rows. The training parameters also specify 2000 trees, an interaction depth of 3, identify one row per observation, use half the available data in training and bag half the training data as well as use all of the predictors in the fitting. Moreover, the weights are normalized by the routine if the distribution is not "Pairwise".

# A default fit to gaussian data
# Create data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- factor(sample(letters[1:4],N,replace=T))
X4 <- ordered(sample(letters[1:6],N,replace=T))
X5 <- factor(sample(letters[1:3],N,replace=T))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# create a bunch of missing values
X1[sample(1:N,size=100)] <- NA
X3[sample(1:N,size=300)] <- NA

# Gaussian data
gauss_data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# Perform a cross-validated fit
gauss_fit <- gbmt(Y ~ X1 + X2 + X3 + X4 + X5 + X6, 
                  data=gauss_data, cv_folds =5, keep_gbm_data = TRUE)

Identifying the optimal iteration

A fitted model contains a number of measures of its performance. Using these measures, the optimal iteration of a gbm fit can be identified. The package offers three methods for identifying the optimal iteration of the fit, these are: 'test', 'cv' and 'OOB'. The 'test' method identifies the optimal iteration using the performance of the fit on a set-aside independent test set, this is only available if the number of observations used to fit the model is less than the number of observations in the original data. The 'cv' method determines the best iteration number using the cross-validation error provided and finally 'OOB' selects the best iteration using the out-of-bag error estimate. Again this is only available if bag_fraction > 0 and it will under-estimate the optimal number of iterations in the fit. The optimal iteration number as well as a plot of the performance can then be obtained through a call to gbmt_performance. In these performance plots the cross-validation error, test set error and out-of-bag improvement are represented by green, red and blue lines respectively. The error on the training set is shown as a black line and the optimal iteration is marked with a dashed blue line.

# Use gaussian fit and determine the performance
# red line is testing error
# green line is cv error
# blue line is OOB error
best_iter_test <- gbmt_performance(gauss_fit, method='test')
best_iter_cv <- gbmt_performance(gauss_fit, method='cv')
best_iter_oob <- gbmt_performance(gauss_fit, method='OOB')

Fitting additional trees

After evaluating the performance it may be the case that the optimal number of iterations is the num_trees parameter set. In this instance additional trees may want to be fitted to the original dataset; this will not alter the cross-validation fit but will update the test set and OOB errors.

# Fit additional trees to gaussian fit and recalculate optimal number of iterations
gauss_fit_more <- gbm_more(gauss_fit, num_new_trees = 5000) # This is a large number of 
                                                            #extra trees!
best_iter_test_more <- gbmt_performance(gauss_fit_more, method='test')

Predictions on other data

With the model fitted and the optimal number of iterations determined, the user can now make predictions on other datasets using the gbm package's predict functionality.

# Given additional data - of the same shape as our gaussian example
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- factor(sample(letters[1:4],N,replace=TRUE))
X4 <- ordered(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

Y <- X1**1.5 + 2 * (X2**.5) + mu
Y <- Y + rnorm(N,0,sigma)

data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# Then predictions can be made
preds <- predict(gauss_fit_more, newdata=data2, n.trees=best_iter_test_more)

Relative influence of predictors

gbm also offers a measure of the relative influence of each predictor in the fitted model. The influence of a variable $x_j$ is determined by the sum across all iterations of the improvements of a tree fit when split on that variable.

# Relative influence of predictors - both examples
print(relative_influence(gauss_fit_more, best_iter_test_more))
print(relative_influence(fit, gbmt_performance(fit, method='cv')))

Plotting the marginal effects of selected variables

As well as calculating the relative influence from the fitted trees, gbm also offers the ability to calculate the predictions of specific variables and make the appropriate marginal plots using plot.

# Examine the marginal effects of the first two variables - gaussian example
# Shows the fitted model predictions as a function of 1st two predictors
# with all the others integrated out
plot(gauss_fit_more, var_index = 1:2)

Additional useful functions

Finally there are some additional useful functions within the gbm package to remember. A GBMFit object can be summarised and printed using summary and print respectively. Each individually fitted tree can be "prettified" using pretty_gbm_tree and the loss for a given distribution can be calculated using the loss S3 method, a GBMDist object and the appropriate data.



gbm-developers/gbm3 documentation built on April 28, 2024, 10:04 p.m.