library("gbm3")
A number of distributions provided by gbm
have model specific parameters associated with them. All distributions have parameters associated with them such as the mean or variance; however, certain distributions require additional data to be defined fully. This additional data is referred to as "model specific parameters". This document describes how to correctly specify these parameters on construction of the associated GBMDist
object as well as their default values.
There are 5 distributions within gbm
which have additional parameters associated with them. These distributions are: CoxPH
, Pairwise
, Quantile
, TDist
and Tweedie
.
The Cox proportional hazards model has several model specific parameters associated with it. All of them are optional but play important roles in the boosting process.
strata
: a vector of positive integers indicating which strata each row of data belongs to. If there are multiple rows per observation then this should be reflected in the strata
vector. If not specified it is assumed all training data are in the same stratum and all test data are in another stratum.sorted
: a vector specifying how the rows of data are ordered within their strata
and the order within strata is the reverse order of the censored times or start times of the survival data. This vector is completely optional and will be calculated by gbmt
.ties
: a string specifying the method by which ties are broken. Currently the "breslow" and "efron" approximations are implemented, with the latter being the default method taken.prior_node_coeff_var
: a double used to regularize the model predictions in gbm
. It represents the prior on the number of events in the model. The predictions of the GBMFit
are given by $\log("Number of events"/"Expected Number of events")$. Both the number of events in a dataset and the model's expected number of events could be $0$ leading to non-finite behaviour. The inverse of this parameter is added to both the numerator and denominator appearing in the log ratio so as to ensure the predictions are finite. The default value is $1000$, representing a base event number of $1/1000$ events irrespective of the value of the measured or expected number of events.# Create strata strats <- sample(1:5, 100, replace=TRUE) # Create CoxPH dist object cox_dist <- gbm_dist(name="CoxPH", ties="breslow", strata=strats, prior_node_coeff_var=100)
The "Pairwise" distribution implements ranking measures following the LamdaMART algorithm. Observations belong to groups, with all pairs of items with different labels but belonging to the same group are used for training. The distribution requires a character vector with the column names of the data that jointly indicate the group an observation belongs to. This character vector is passed to the group
argument on construction. When training with a Pairwise distribution a number of information retrieval (IR) metrics are available whose utility is maximised by the tree growing algorithm. The metric
parameter stores the selection and currently the IR metrics available are:
The default for group
is "query"
while metric
defaults to "ndcg"
. If map
or mrr
are selected the response must be in ${0, 1}$. A cut-off in the ranking of items in a groups can be specified via max_rank
, the default for this is 0 (all ranks taken into account) and is only applicable for "ndcg" and "mrr". Finally, the group_index
or label can be specified directly - note this is optional and will be calculated by gbmt
.
# Create pairwise grouped data # create query groups, with an average size of 25 items each N <- 1000 num.queries <- floor(N/25) query <- sample(1:num.queries, N, replace=TRUE) # X1 is a variable determined by query group only query.level <- runif(num.queries) X1 <- query.level[query] # X2 varies with each item X2 <- runif(N) # X3 is uncorrelated with target X3 <- runif(N) # The target Y <- X1 + X2 # Add some random noise to X2 that is correlated with # queries, but uncorrelated with items X2 <- X2 + scale(runif(num.queries))[query] # Add some random noise to target SNR <- 5 # signal-to-noise ratio sigma <- sqrt(var(Y)/SNR) Y <- Y + runif(N, 0, sigma) data <- data.frame(Y, query=query, X1, X2, X3) # Create appropriate Pairwise object pair_dist <- gbm_dist(name="Pairwise", group="query", max_rank=1, metric="ndcg")
To perform quantile regression a QuantileGBMDist
object must be passed to gbmt
. The quantile to estimate is stored in the parameter alpha
and this defaults to 0.25
.
# Create a QuantileGBMDist object quant_dist <- gbm_dist(name="Quantile", alpha=0.1)
The t-distribution requires its degrees of freedom (df
) to be set. The default value for this is four but it can be specified on contruction of the associated GBMDist
object.
# Creat a t-distribution object with 7 degrees of freedom t_dist <- gbm_dist(name="TDist", df=7)
The tweedie distribution relates the variance of the response to its expectation via: $Var(Y) = E[Y]^p$, where p
is the power of the distribution. This parameter is specified through the power
named argument on calling gbm_dist
and its default value is 1.5.
# Create a TweedieGBMDist object with a power of 2 - equivalent to a Gamma distribution tweedie_dist <- gbm_dist(name="Tweedie", power=2)
Tweedie distributions include various more familiar distributions which can be accessed through setting the power
parameter:
power=0
.power=1
.1 < power < 2
.power=2
.2 < power < 3
and power > 3
.power=3
.Note no Tweedie models exist for 0 < power < 1
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.