library("gbm3")

A number of distributions provided by gbm have model specific parameters associated with them. All distributions have parameters associated with them such as the mean or variance; however, certain distributions require additional data to be defined fully. This additional data is referred to as "model specific parameters". This document describes how to correctly specify these parameters on construction of the associated GBMDist object as well as their default values.

Distributions with model specific parameters

There are 5 distributions within gbm which have additional parameters associated with them. These distributions are: CoxPH, Pairwise, Quantile, TDist and Tweedie.

Cox proportional hazards model

The Cox proportional hazards model has several model specific parameters associated with it. All of them are optional but play important roles in the boosting process.

# Create strata
strats <- sample(1:5, 100, replace=TRUE)

# Create CoxPH dist object
cox_dist <- gbm_dist(name="CoxPH", ties="breslow", 
                     strata=strats, prior_node_coeff_var=100)

Pairwise distribution

The "Pairwise" distribution implements ranking measures following the LamdaMART algorithm. Observations belong to groups, with all pairs of items with different labels but belonging to the same group are used for training. The distribution requires a character vector with the column names of the data that jointly indicate the group an observation belongs to. This character vector is passed to the group argument on construction. When training with a Pairwise distribution a number of information retrieval (IR) metrics are available whose utility is maximised by the tree growing algorithm. The metric parameter stores the selection and currently the IR metrics available are:

The default for group is "query" while metric defaults to "ndcg". If map or mrr are selected the response must be in ${0, 1}$. A cut-off in the ranking of items in a groups can be specified via max_rank, the default for this is 0 (all ranks taken into account) and is only applicable for "ndcg" and "mrr". Finally, the group_index or label can be specified directly - note this is optional and will be calculated by gbmt.

# Create pairwise grouped data
# create query groups, with an average size of 25 items each
N <- 1000
num.queries <- floor(N/25)
query <- sample(1:num.queries, N, replace=TRUE)

# X1 is a variable determined by query group only
query.level <- runif(num.queries)
X1 <- query.level[query]

# X2 varies with each item
X2 <- runif(N)

# X3 is uncorrelated with target
X3 <- runif(N)

# The target
Y <- X1 + X2

# Add some random noise to X2 that is correlated with
# queries, but uncorrelated with items

X2 <- X2 + scale(runif(num.queries))[query]

# Add some random noise to target
SNR <- 5 # signal-to-noise ratio
sigma <- sqrt(var(Y)/SNR)
Y <- Y + runif(N, 0, sigma)

data <- data.frame(Y, query=query, X1, X2, X3)

# Create appropriate Pairwise object
pair_dist <- gbm_dist(name="Pairwise", group="query", max_rank=1, metric="ndcg")

Quantile

To perform quantile regression a QuantileGBMDist object must be passed to gbmt. The quantile to estimate is stored in the parameter alpha and this defaults to 0.25.

# Create a QuantileGBMDist object
quant_dist <- gbm_dist(name="Quantile", alpha=0.1)

TDist

The t-distribution requires its degrees of freedom (df) to be set. The default value for this is four but it can be specified on contruction of the associated GBMDist object.

# Creat a t-distribution object with 7 degrees of freedom
t_dist <- gbm_dist(name="TDist", df=7)

Tweedie

The tweedie distribution relates the variance of the response to its expectation via: $Var(Y) = E[Y]^p$, where p is the power of the distribution. This parameter is specified through the power named argument on calling gbm_dist and its default value is 1.5.

# Create a TweedieGBMDist object with a power of 2 - equivalent to a Gamma distribution
tweedie_dist <- gbm_dist(name="Tweedie", power=2)

Tweedie distributions include various more familiar distributions which can be accessed through setting the power parameter:

Note no Tweedie models exist for 0 < power < 1.



gbm-developers/gbm3 documentation built on April 28, 2024, 10:04 p.m.