Nothing
#' Generalized Boosted Regression Modeling (GBM)
#'
#' Fits generalized boosted regression models. For technical details, see the
#' vignette: \code{utils::browseVignettes("gbm")}.
#'
#' \code{gbm.fit} provides the link between R and the C++ gbm engine.
#' \code{gbm} is a front-end to \code{gbm.fit} that uses the familiar R
#' modeling formulas. However, \code{\link[stats]{model.frame}} is very slow if
#' there are many predictor variables. For power-users with many variables use
#' \code{gbm.fit}. For general practice \code{gbm} is preferable.
#'
#' @param formula A symbolic description of the model to be fit. The formula
#' may include an offset term (e.g. y~offset(n)+x). If
#' \code{keep.data = FALSE} in the initial call to \code{gbm} then it is the
#' user's responsibility to resupply the offset to \code{\link{gbm.more}}.
#'
#' @param distribution Either a character string specifying the name of the
#' distribution to use or a list with a component \code{name} specifying the
#' distribution and any additional parameters needed. If not specified,
#' \code{gbm} will try to guess: if the response has only 2 unique values,
#' bernoulli is assumed; otherwise, if the response is a factor, multinomial is
#' assumed; otherwise, if the response has class \code{"Surv"}, coxph is
#' assumed; otherwise, gaussian is assumed.
#'
#' Currently available options are \code{"gaussian"} (squared error),
#' \code{"laplace"} (absolute loss), \code{"tdist"} (t-distribution loss),
#' \code{"bernoulli"} (logistic regression for 0-1 outcomes),
#' \code{"huberized"} (huberized hinge loss for 0-1 outcomes),
#' \code{"adaboost"} (the AdaBoost exponential loss for 0-1 outcomes),
#' \code{"poisson"} (count outcomes), \code{"coxph"} (right censored
#' observations), \code{"quantile"}, or \code{"pairwise"} (ranking measure
#' using the LambdaMart algorithm).
#'
#' If quantile regression is specified, \code{distribution} must be a list of
#' the form \code{list(name = "quantile", alpha = 0.25)} where \code{alpha} is
#' the quantile to estimate. The current version's quantile regression method
#' does not handle non-constant weights and will stop.
#'
#' If \code{"tdist"} is specified, the default degrees of freedom is 4 and
#' this can be controlled by specifying
#' \code{distribution = list(name = "tdist", df = DF)} where \code{DF} is your
#' chosen degrees of freedom.
#'
#' If "pairwise" regression is specified, \code{distribution} must be a list of
#' the form \code{list(name="pairwise",group=...,metric=...,max.rank=...)}
#' (\code{metric} and \code{max.rank} are optional, see below). \code{group} is
#' a character vector with the column names of \code{data} that jointly
#' indicate the group an instance belongs to (typically a query in Information
#' Retrieval applications). For training, only pairs of instances from the same
#' group and with different target labels can be considered. \code{metric} is
#' the IR measure to use, one of
#' \describe{
#' \item{list("conc")}{Fraction of concordant pairs; for binary labels, this
#' is equivalent to the Area under the ROC Curve}
#' \item{:}{Fraction of concordant pairs; for binary labels, this is
#' equivalent to the Area under the ROC Curve}
#' \item{list("mrr")}{Mean reciprocal rank of the highest-ranked positive
#' instance}
#' \item{:}{Mean reciprocal rank of the highest-ranked positive instance}
#' \item{list("map")}{Mean average precision, a generalization of \code{mrr}
#' to multiple positive instances}\item{:}{Mean average precision, a
#' generalization of \code{mrr} to multiple positive instances}
#' \item{list("ndcg:")}{Normalized discounted cumulative gain. The score is
#' the weighted sum (DCG) of the user-supplied target values, weighted
#' by log(rank+1), and normalized to the maximum achievable value. This
#' is the default if the user did not specify a metric.}
#' }
#'
#' \code{ndcg} and \code{conc} allow arbitrary target values, while binary
#' targets \{0,1\} are expected for \code{map} and \code{mrr}. For \code{ndcg}
#' and \code{mrr}, a cut-off can be chosen using a positive integer parameter
#' \code{max.rank}. If left unspecified, all ranks are taken into account.
#'
#' Note that splitting of instances into training and validation sets follows
#' group boundaries and therefore only approximates the specified
#' \code{train.fraction} ratio (the same applies to cross-validation folds).
#' Internally, queries are randomly shuffled before training, to avoid bias.
#'
#' Weights can be used in conjunction with pairwise metrics, however it is
#' assumed that they are constant for instances from the same group.
#'
#' For details and background on the algorithm, see e.g. Burges (2010).
#'
#' @param data an optional data frame containing the variables in the model. By
#' default the variables are taken from \code{environment(formula)}, typically
#' the environment from which \code{gbm} is called. If \code{keep.data=TRUE} in
#' the initial call to \code{gbm} then \code{gbm} stores a copy with the
#' object. If \code{keep.data=FALSE} then subsequent calls to
#' \code{\link{gbm.more}} must resupply the same dataset. It becomes the user's
#' responsibility to resupply the same data at this point.
#'
#' @param weights an optional vector of weights to be used in the fitting
#' process. Must be positive but do not need to be normalized. If
#' \code{keep.data=FALSE} in the initial call to \code{gbm} then it is the
#' user's responsibility to resupply the weights to \code{\link{gbm.more}}.
#'
#' @param var.monotone an optional vector, the same length as the number of
#' predictors, indicating which variables have a monotone increasing (+1),
#' decreasing (-1), or arbitrary (0) relationship with the outcome.
#'
#' @param n.trees Integer specifying the total number of trees to fit. This is
#' equivalent to the number of iterations and the number of basis functions in
#' the additive expansion. Default is 100.
#'
#' @param interaction.depth Integer specifying the maximum depth of each tree
#' (i.e., the highest level of variable interactions allowed). A value of 1
#' implies an additive model, a value of 2 implies a model with up to 2-way
#' interactions, etc. Default is 1.
#'
#' @param n.minobsinnode Integer specifying the minimum number of observations
#' in the terminal nodes of the trees. Note that this is the actual number of
#' observations, not the total weight.
#'
#' @param shrinkage a shrinkage parameter applied to each tree in the
#' expansion. Also known as the learning rate or step-size reduction; 0.001 to
#' 0.1 usually work, but a smaller learning rate typically requires more trees.
#' Default is 0.1.
#'
#' @param bag.fraction the fraction of the training set observations randomly
#' selected to propose the next tree in the expansion. This introduces
#' randomnesses into the model fit. If \code{bag.fraction} < 1 then running the
#' same model twice will result in similar but different fits. \code{gbm} uses
#' the R random number generator so \code{set.seed} can ensure that the model
#' can be reconstructed. Preferably, the user can save the returned
#' \code{\link{gbm.object}} using \code{\link{save}}. Default is 0.5.
#'
#' @param train.fraction The first \code{train.fraction * nrows(data)}
#' observations are used to fit the \code{gbm} and the remainder are used for
#' computing out-of-sample estimates of the loss function.
#'
#' @param cv.folds Number of cross-validation folds to perform. If
#' \code{cv.folds}>1 then \code{gbm}, in addition to the usual fit, will
#' perform a cross-validation, calculate an estimate of generalization error
#' returned in \code{cv.error}.
#'
#' @param keep.data a logical variable indicating whether to keep the data and
#' an index of the data stored with the object. Keeping the data and index
#' makes subsequent calls to \code{\link{gbm.more}} faster at the cost of
#' storing an extra copy of the dataset.
#'
#' @param verbose Logical indicating whether or not to print out progress and
#' performance indicators (\code{TRUE}). If this option is left unspecified for
#' \code{gbm.more}, then it uses \code{verbose} from \code{object}. Default is
#' \code{FALSE}.
#'
#' @param class.stratify.cv Logical indicating whether or not the
#' cross-validation should be stratified by class. Defaults to \code{TRUE} for
#' \code{distribution = "multinomial"} and is only implemented for
#' \code{"multinomial"} and \code{"bernoulli"}. The purpose of stratifying the
#' cross-validation is to help avoiding situations in which training sets do
#' not contain all classes.
#'
#' @param n.cores The number of CPU cores to use. The cross-validation loop
#' will attempt to send different CV folds off to different cores. If
#' \code{n.cores} is not specified by the user, it is guessed using the
#' \code{detectCores} function in the \code{parallel} package. Note that the
#' documentation for \code{detectCores} makes clear that it is not failsafe and
#' could return a spurious number of available cores.
#'
#' @return A \code{\link{gbm.object}} object.
#'
#' @details
#' This package implements the generalized boosted modeling framework. Boosting
#' is the process of iteratively adding basis functions in a greedy fashion so
#' that each additional basis function further reduces the selected loss
#' function. This implementation closely follows Friedman's Gradient Boosting
#' Machine (Friedman, 2001).
#'
#' In addition to many of the features documented in the Gradient Boosting
#' Machine, \code{gbm} offers additional features including the out-of-bag
#' estimator for the optimal number of iterations, the ability to store and
#' manipulate the resulting \code{gbm} object, and a variety of other loss
#' functions that had not previously had associated boosting algorithms,
#' including the Cox partial likelihood for censored data, the poisson
#' likelihood for count outcomes, and a gradient boosting implementation to
#' minimize the AdaBoost exponential loss function. This gbm package is no
#' longer under further development. Consider
#' https://github.com/gbm-developers/gbm3 for the latest version.
#'
#' @author Greg Ridgeway \email{gregridgeway@@gmail.com}
#'
#' Quantile regression code developed by Brian Kriegler
#' \email{bk@@stat.ucla.edu}
#'
#' t-distribution, and multinomial code developed by Harry Southworth and
#' Daniel Edwards
#'
#' Pairwise code developed by Stefan Schroedl \email{schroedl@@a9.com}
#'
#' @seealso \code{\link{gbm.object}}, \code{\link{gbm.perf}},
#' \code{\link{plot.gbm}}, \code{\link{predict.gbm}}, \code{\link{summary.gbm}},
#' and \code{\link{pretty.gbm.tree}}.
#'
#' @references
#' Y. Freund and R.E. Schapire (1997) \dQuote{A decision-theoretic
#' generalization of on-line learning and an application to boosting,}
#' \emph{Journal of Computer and System Sciences,} 55(1):119-139.
#'
#' G. Ridgeway (1999). \dQuote{The state of boosting,} \emph{Computing Science
#' and Statistics} 31:172-181.
#'
#' J.H. Friedman, T. Hastie, R. Tibshirani (2000). \dQuote{Additive Logistic
#' Regression: a Statistical View of Boosting,} \emph{Annals of Statistics}
#' 28(2):337-374.
#'
#' J.H. Friedman (2001). \dQuote{Greedy Function Approximation: A Gradient
#' Boosting Machine,} \emph{Annals of Statistics} 29(5):1189-1232.
#'
#' J.H. Friedman (2002). \dQuote{Stochastic Gradient Boosting,}
#' \emph{Computational Statistics and Data Analysis} 38(4):367-378.
#'
#' B. Kriegler (2007). Cost-Sensitive Stochastic Gradient Boosting Within a
#' Quantitative Regression Framework. Ph.D. Dissertation. University of
#' California at Los Angeles, Los Angeles, CA, USA. Advisor(s) Richard A. Berk.
#' \url{https://dl.acm.org/doi/book/10.5555/1354603}.
#'
#' C. Burges (2010). \dQuote{From RankNet to LambdaRank to LambdaMART: An
#' Overview,} Microsoft Research Technical Report MSR-TR-2010-82.
#'
#' @export
#'
#' @examples
#' #
#' # A least squares regression example
#' #
#'
#' # Simulate data
#' set.seed(101) # for reproducibility
#' N <- 1000
#' X1 <- runif(N)
#' X2 <- 2 * runif(N)
#' X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1])
#' X4 <- factor(sample(letters[1:6], N, replace = TRUE))
#' X5 <- factor(sample(letters[1:3], N, replace = TRUE))
#' X6 <- 3 * runif(N)
#' mu <- c(-1, 0, 1, 2)[as.numeric(X3)]
#' SNR <- 10 # signal-to-noise ratio
#' Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu
#' sigma <- sqrt(var(Y) / SNR)
#' Y <- Y + rnorm(N, 0, sigma)
#' X1[sample(1:N,size=500)] <- NA # introduce some missing values
#' X4[sample(1:N,size=300)] <- NA # introduce some missing values
#' data <- data.frame(Y, X1, X2, X3, X4, X5, X6)
#'
#' # Fit a GBM
#' set.seed(102) # for reproducibility
#' gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0),
#' distribution = "gaussian", n.trees = 100, shrinkage = 0.1,
#' interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,
#' n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE,
#' verbose = FALSE, n.cores = 1)
#'
#' # Check performance using the out-of-bag (OOB) error; the OOB error typically
#' # underestimates the optimal number of iterations
#' best.iter <- gbm.perf(gbm1, method = "OOB")
#' print(best.iter)
#'
#' # Check performance using the 50% heldout test set
#' best.iter <- gbm.perf(gbm1, method = "test")
#' print(best.iter)
#'
#' # Check performance using 5-fold cross-validation
#' best.iter <- gbm.perf(gbm1, method = "cv")
#' print(best.iter)
#'
#' # Plot relative influence of each variable
#' par(mfrow = c(1, 2))
#' summary(gbm1, n.trees = 1) # using first tree
#' summary(gbm1, n.trees = best.iter) # using estimated best number of trees
#'
#' # Compactly print the first and last trees for curiosity
#' print(pretty.gbm.tree(gbm1, i.tree = 1))
#' print(pretty.gbm.tree(gbm1, i.tree = gbm1$n.trees))
#'
#' # Simulate new data
#' set.seed(103) # for reproducibility
#' N <- 1000
#' X1 <- runif(N)
#' X2 <- 2 * runif(N)
#' X3 <- ordered(sample(letters[1:4], N, replace = TRUE))
#' X4 <- factor(sample(letters[1:6], N, replace = TRUE))
#' X5 <- factor(sample(letters[1:3], N, replace = TRUE))
#' X6 <- 3 * runif(N)
#' mu <- c(-1, 0, 1, 2)[as.numeric(X3)]
#' Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu + rnorm(N, 0, sigma)
#' data2 <- data.frame(Y, X1, X2, X3, X4, X5, X6)
#'
#' # Predict on the new data using the "best" number of trees; by default,
#' # predictions will be on the link scale
#' Yhat <- predict(gbm1, newdata = data2, n.trees = best.iter, type = "link")
#'
#' # least squares error
#' print(sum((data2$Y - Yhat)^2))
#'
#' # Construct univariate partial dependence plots
#' plot(gbm1, i.var = 1, n.trees = best.iter)
#' plot(gbm1, i.var = 2, n.trees = best.iter)
#' plot(gbm1, i.var = "X3", n.trees = best.iter) # can use index or name
#'
#' # Construct bivariate partial dependence plots
#' plot(gbm1, i.var = 1:2, n.trees = best.iter)
#' plot(gbm1, i.var = c("X2", "X3"), n.trees = best.iter)
#' plot(gbm1, i.var = 3:4, n.trees = best.iter)
#'
#' # Construct trivariate partial dependence plots
#' plot(gbm1, i.var = c(1, 2, 6), n.trees = best.iter,
#' continuous.resolution = 20)
#' plot(gbm1, i.var = 1:3, n.trees = best.iter)
#' plot(gbm1, i.var = 2:4, n.trees = best.iter)
#' plot(gbm1, i.var = 3:5, n.trees = best.iter)
#'
#' # Add more (i.e., 100) boosting iterations to the ensemble
#' gbm2 <- gbm.more(gbm1, n.new.trees = 100, verbose = FALSE)
gbm <- function(formula = formula(data), distribution = "bernoulli",
data = list(), weights, var.monotone = NULL, n.trees = 100,
interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.1,
bag.fraction = 0.5, train.fraction = 1.0, cv.folds = 0,
keep.data = TRUE, verbose = FALSE, class.stratify.cv = NULL,
n.cores = NULL) {
# Match the call to gbm
mcall <- match.call()
# Verbose output?
lVerbose <- if (!is.logical(verbose)) {
FALSE
} else {
verbose
}
# Construct model frame, terms object, weights, and offset
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "weights", "offset"), names(mf), 0)
mf <- mf[c(1, m)]
mf$drop.unused.levels <- TRUE
mf$na.action <- na.pass
mf[[1]] <- as.name("model.frame")
m <- mf
mf <- eval(mf, parent.frame())
Terms <- attr(mf, "terms")
w <- model.weights(mf)
offset <- model.offset(mf)
y <- model.response(mf) # extract response values
# Determine and check response distribution
if (missing(distribution)) {
# y <- data[, all.vars(formula)[1L], drop = TRUE]
distribution <- guessDist(y)
}
if (is.character(distribution)) {
distribution <- list(name = distribution)
}
if (!is.element(distribution$name, getAvailableDistributions())) {
stop("Distribution ", distribution$name, " is not supported.")
}
if (distribution$name == "multinomial") {
warning("Setting `distribution = \"multinomial\"` is ill-advised as it is ",
"currently broken. It exists only for backwards compatibility. ",
"Use at your own risk.", call. = FALSE)
}
# Construct data frame of predictor values
var.names <- attributes(Terms)$term.labels
x <- model.frame(terms(reformulate(var.names)), data = data,
na.action = na.pass)
# Extract response name as a character string
response.name <- as.character(formula[[2L]])
# Stratify cross-validation by class (only for bernoulli and multinomial)
class.stratify.cv <- getStratify(class.stratify.cv, d = distribution)
# Groups (for pairwise distribution only)
group <- NULL
num.groups <- 0
# Determine number of training instances
if (distribution$name != "pairwise"){
# Number of training instances
nTrain <- floor(train.fraction * nrow(x))
} else {
# Sampling is by group, so we need to calculate them here
distribution.group <- distribution[["group"]]
if (is.null(distribution.group)) {
stop(paste("For pairwise regression, `distribution` must be a list of",
"the form `list(name = \"pairwise\", group = c(\"date\",",
"\"session\", \"category\", \"keywords\"))`."))
}
# Check if group names are valid
i <- match(distribution.group, colnames(data))
if (any(is.na(i))) {
stop("Group column does not occur in data: ",
distribution.group[is.na(i)], ".")
}
# Construct group index
group <- factor(
do.call(paste, c(data[, distribution.group, drop = FALSE], sep = ":"))
)
# Check that weights are constant across groups
if ((!missing(weights)) && (!is.null(weights))) {
w.min <- tapply(w, INDEX = group, FUN = min)
w.max <- tapply(w, INDEX = group, FUN = max)
if (any(w.min != w.max)) {
stop("For `distribution = \"pairwise\"`, all instances for the same ",
"group must have the same weight.")
}
w <- w * length(w.min) / sum(w.min) # normalize across groups
}
# Shuffle groups to remove bias when split into train/test sets and/or CV
# folds
perm.levels <- levels(group)[sample(1:nlevels(group))]
group <- factor(group, levels = perm.levels)
# The C function expects instances to be sorted by group and descending by
# target
ord.group <- order(group, -y)
group <- group[ord.group]
y <- y[ord.group]
x <- x[ord.group, , drop = FALSE]
w <- w[ord.group]
# Split into train and validation sets at group boundary
num.groups.train <- max(1, round(train.fraction * nlevels(group)))
# Include all groups up to the num.groups.train
nTrain <- max(which(group==levels(group)[num.groups.train]))
Misc <- group
}
# Set up for k-fold cross-validation
cv.error <- NULL
# FIXME: Is there a better way to handle this?
if (cv.folds == 1) {
cv.folds <- 0 # o/w, an uninformative error is thrown
}
if(cv.folds > 1) {
cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain,
n.cores = n.cores,
class.stratify.cv = class.stratify.cv,
data = data, x = x, y = y, offset = offset,
distribution = distribution, w = w,
var.monotone = var.monotone, n.trees = n.trees,
interaction.depth = interaction.depth,
n.minobsinnode = n.minobsinnode,
shrinkage = shrinkage,
bag.fraction = bag.fraction,
var.names = var.names,
response.name = response.name, group = group)
cv.error <- cv.results$error
p <- cv.results$predictions
}
# Fit a GBM
gbm.obj <- gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
w = w, var.monotone = var.monotone, n.trees = n.trees,
interaction.depth = interaction.depth,
n.minobsinnode = n.minobsinnode, shrinkage = shrinkage,
bag.fraction = bag.fraction, nTrain = nTrain,
keep.data = keep.data, verbose = lVerbose,
var.names = var.names, response.name = response.name,
group = group)
# Attach further components
gbm.obj$train.fraction <- train.fraction
gbm.obj$Terms <- Terms
gbm.obj$cv.error <- cv.error
gbm.obj$cv.folds <- cv.folds
gbm.obj$call <- mcall
gbm.obj$m <- m
if (cv.folds > 1) { # FIXME: Was previously `cv.folds > 0`?
gbm.obj$cv.fitted <- p
}
if (distribution$name == "pairwise") {
# Data has been reordered according to queries. We need to permute the
# fitted values so that they correspond to the original order.
gbm.obj$ord.group <- ord.group
gbm.obj$fit <- gbm.obj$fit[order(ord.group)]
}
# Return "gbm" object
gbm.obj
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.