gbm.fit | R Documentation |
Fits generalized boosted regression models.
gbm.fit(
x,
y,
offset = NULL,
distribution = "bernoulli",
w = NULL,
var.monotone = NULL,
n.trees = 100,
interaction.depth = 1,
n.minobsinnode = 10,
shrinkage = 0.001,
bag.fraction = 0.5,
nTrain = NULL,
train.fraction = NULL,
mFeatures = NULL,
keep.data = TRUE,
verbose = TRUE,
var.names = NULL,
response.name = "y",
group = NULL,
tied.times.method = "efron",
prior.node.coeff.var = 1000,
strata = NA,
obs.id = 1:nrow(x)
)
gbm(
formula = formula(data),
distribution = "bernoulli",
data = list(),
weights,
subset = NULL,
offset = NULL,
var.monotone = NULL,
n.trees = 100,
interaction.depth = 1,
n.minobsinnode = 10,
shrinkage = 0.001,
bag.fraction = 0.5,
train.fraction = 1,
mFeatures = NULL,
cv.folds = 0,
keep.data = TRUE,
verbose = FALSE,
class.stratify.cv = NULL,
n.cores = NULL,
par.details = getOption("gbm.parallel"),
fold.id = NULL,
tied.times.method = "efron",
prior.node.coeff.var = 1000,
strata = NA,
obs.id = 1:nrow(data)
)
x , y |
For |
offset |
an optional model offset |
distribution |
either a character string specifying the name
of the distribution to use or a list with a component Available distributions are "gaussian" (squared error), "laplace" (absolute loss), "tdist" (t-distribution loss), "bernoulli" (logistic regression for 0-1 outcomes), "huberized" (Huberized hinge loss for 0-1 outcomes), "adaboost" (the AdaBoost exponential loss for 0-1 outcomes), "poisson" (count outcomes), "coxph" (right censored observations), "quantile", or "pairwise" (ranking measure using the LambdaMART algorithm). If quantile regression is specified, If "tdist" is specified, the default degrees of freedom is four and
this can be controlled by specifying
If "pairwise" regression is specified,
Note that splitting of instances into training and validation sets follows
group boundaries and therefore only approximates the specified
Weights can be used in conjunction with pairwise metrics, however it is assumed that they are constant for instances from the same group. For details and background on the algorithm, see e.g. Burges (2010). |
w |
For |
var.monotone |
an optional vector, the same length as the number of predictors, indicating which variables have a monotone increasing (+1), decreasing (-1), or arbitrary (0) relationship with the outcome. |
n.trees |
the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion. |
interaction.depth |
The maximum depth of variable interactions: 1 builds an additive model, 2 builds a model with up to two-way interactions, etc. |
n.minobsinnode |
minimum number of observations (not total weights) in the terminal nodes of the trees. |
shrinkage |
a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction. |
bag.fraction |
the fraction of independent training
observations (or patients) randomly selected to propose the next
tree in the expansion, depending on the obs.id vector multiple
training data rows may belong to a single 'patient'. This
introduces randomness into the model fit. If |
nTrain |
An integer representing the number of unique patients,
each patient could have multiple rows associated with them, on which
to train. This is the preferred way of specification for
|
train.fraction |
The first |
mFeatures |
Each node will be trained on a random subset of
|
keep.data |
a logical variable indicating whether to keep the
data and an index of the data stored with the object. Keeping the
data and index makes subsequent calls to |
verbose |
If TRUE, gbm will print out progress and performance
indicators. If this option is left unspecified for gbm_more then it uses
|
var.names |
For |
response.name |
For |
group |
|
tied.times.method |
For |
prior.node.coeff.var |
Optional double only used with the
|
strata |
Optional vector of integers (or factors) only used
with the |
obs.id |
Optional vector of integers used to specify which rows of data belong to individual patients. Data is then bagged by patient id; the default sets each row of the data to belong to an individual patient. |
formula |
a symbolic description of the model to be fit. The
formula may include an offset term (e.g. y~offset(n)+x). If
|
data |
an optional data frame containing the variables in the
model. By default the variables are taken from
|
weights |
an optional vector of weights to be used in the
fitting process. The weights must be positive but do not need to be
normalized. If |
subset |
an optional vector defining a subset of the data to be used |
cv.folds |
Number of cross-validation folds to perform. If
|
class.stratify.cv |
whether the cross-validation should be
stratified by class. Is only implemented for
|
n.cores |
number of cores to use for parallelization. Please
use |
par.details |
Details of the parallelization to use in the core algorithm. |
fold.id |
An optional vector of values identifying what fold each observation is in. If supplied, cv.folds can be missing. Note: Multiple observations to the same patient must have the same fold id. |
See the gbm vignette for technical details.
This package implements the generalized boosted modeling framework. Boosting is the process of iteratively adding basis functions in a greedy fashion so that each additional basis function further reduces the selected loss function. This implementation closely follows Friedman's Gradient Boosting Machine (Friedman, 2001).
In addition to many of the features documented in the Gradient
Boosting Machine, gbm
offers additional features including
the out-of-bag estimator for the optimal number of iterations, the
ability to store and manipulate the resulting GBMFit
object,
and a variety of other loss functions that had not previously had
associated boosting algorithms, including the Cox partial
likelihood for censored data, the poisson likelihood for count
outcomes, and a gradient boosting implementation to minimize the
AdaBoost exponential loss function.
gbm
is a deprecated function that now acts as a front-end to
gbmt_fit
that uses the familiar R modeling
formulas. However, model.frame
is very slow if
there are many predictor variables. For power users with many
variables use gbm.fit
over gbm
; however gbmt
and gbmt_fit
are now the current APIs.
gbm
and gbm.fit
return a GBMFit
object.
gbm.fit()
: Core fitting code, for experts only.
James Hickey, Greg Ridgeway gregridgeway@gmail.com
Quantile regression code developed by Brian Kriegler bk@stat.ucla.edu
t-distribution code developed by Harry Southworth and Daniel Edwards
Pairwise code developed by Stefan Schroedl schroedl@a9.com
Y. Freund and R.E. Schapire (1997) “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, 55(1):119-139.
G. Ridgeway (1999). “The state of boosting,” Computing Science and Statistics 31:172-181.
J.H. Friedman, T. Hastie, R. Tibshirani (2000). “Additive Logistic Regression: a Statistical View of Boosting,” Annals of Statistics 28(2):337-374.
J.H. Friedman (2001). “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics 29(5):1189-1232.
J.H. Friedman (2002). “Stochastic Gradient Boosting,” Computational Statistics and Data Analysis 38(4):367-378.
B. Kriegler (2007). Cost-Sensitive Stochastic Gradient Boosting Within a Quantitative Regression Framework. PhD dissertation, UCLA Statistics.
C. Burges (2010). “From RankNet to LambdaRank to LambdaMART: An Overview,” Microsoft Research Technical Report MSR-TR-2010-82.
The MART website.
gbmt
, gbmt_fit
gbmt_performance
,
plot
, predict.GBMFit
,
summary.GBMFit
, pretty_gbm_tree
, gbmParallel
.
# A least squares regression example # create some data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
# 0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
mFeatures = 3, # half of the features are considered at each node
n.minobsinnode = 10, # minimum total weight needed in each node
cv.folds = 3, # do 3-fold cross-validation
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=FALSE # don't print out progress
# , par.details=gbmParallel(num_threads=15) # option for gbm3 to parallelize
)
# check performance using an out-of-bag estimator
# OOB underestimates the optimal number of iterations
best_iter <- gbmt_performance(gbm1,method="OOB")
print(best_iter)
# check performance using a 50% heldout test set
best_iter <- gbmt_performance(gbm1,method="test")
print(best_iter)
# check performance using 3-fold cross-validation
best_iter <- gbmt_performance(gbm1,method="cv")
print(best_iter)
# plot the performance # plot variable influence
summary(gbm1, num_trees=1) # based on the first tree
summary(gbm1, num_trees=best_iter) # based on the estimated best number of trees
# compactly print the first and last trees for curiosity
print(pretty_gbm_tree(gbm1,1))
print(pretty_gbm_tree(gbm1,gbm1$params$num_trees))
# make some new data
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE))
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
Y <- X1**1.5 + 2 * (X2**.5) + mu + rnorm(N,0,sigma)
data2 <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# predict on the new data using "best" number of trees
# f.predict generally will be on the canonical scale (logit,log,etc.)
f.predict <- predict(gbm1,data2,best_iter)
# least squares error
print(sum((data2$Y-f.predict)^2))
# create marginal plots
# plot variable X1,X2,X3 after "best" iterations
oldpar <- par(no.readonly = TRUE)
par(mfrow=c(1,3))
plot(gbm1,1,best_iter)
plot(gbm1,2,best_iter)
plot(gbm1,3,best_iter)
par(mfrow=c(1,1))
# contour plot of variables 1 and 2 after "best" iterations
plot(gbm1,1:2,best_iter)
# lattice plot of variables 2 and 3
plot(gbm1,2:3,best_iter)
# lattice plot of variables 3 and 4
plot(gbm1,3:4,best_iter)
# 3-way plots
plot(gbm1,c(1,2,6),best_iter,cont=20)
plot(gbm1,1:3,best_iter)
plot(gbm1,2:4,best_iter)
plot(gbm1,3:5,best_iter)
par(oldpar) # reset graphics options to previous settings
# do another 100 iterations
gbm2 <- gbm_more(gbm1,100,
is_verbose=FALSE) # stop printing detailed progress
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.