gbm: Generalized Boosted Regression Models

Documented in gbm

#' Generalized Boosted Regression Modeling (GBM)
#' 
#' Fits generalized boosted regression models. For technical details, see the 
#' vignette: \code{utils::browseVignettes("gbm")}.
#' 
#' \code{gbm.fit} provides the link between R and the C++ gbm engine.
#' \code{gbm} is a front-end to \code{gbm.fit} that uses the familiar R
#' modeling formulas. However, \code{\link[stats]{model.frame}} is very slow if
#' there are many predictor variables. For power-users with many variables use
#' \code{gbm.fit}. For general practice \code{gbm} is preferable.
#' 
#' @param formula A symbolic description of the model to be fit. The formula
#' may include an offset term (e.g. y~offset(n)+x). If 
#' \code{keep.data = FALSE} in the initial call to \code{gbm} then it is the 
#' user's responsibility to resupply the offset to \code{\link{gbm.more}}.
#'   
#' @param distribution Either a character string specifying the name of the
#' distribution to use or a list with a component \code{name} specifying the
#' distribution and any additional parameters needed. If not specified,
#' \code{gbm} will try to guess: if the response has only 2 unique values,
#' bernoulli is assumed; otherwise, if the response is a factor, multinomial is
#' assumed; otherwise, if the response has class \code{"Surv"}, coxph is 
#' assumed; otherwise, gaussian is assumed.
#' 
#' Currently available options are \code{"gaussian"} (squared error),
#' \code{"laplace"} (absolute loss), \code{"tdist"} (t-distribution loss),
#' \code{"bernoulli"} (logistic regression for 0-1 outcomes), 
#' \code{"huberized"} (huberized hinge loss for 0-1 outcomes),
#' \code{"adaboost"} (the AdaBoost exponential loss for 0-1 outcomes),
#' \code{"poisson"} (count outcomes), \code{"coxph"} (right censored 
#' observations), \code{"quantile"}, or \code{"pairwise"} (ranking measure 
#' using the LambdaMart algorithm).
#' 
#' If quantile regression is specified, \code{distribution} must be a list of
#' the form \code{list(name = "quantile", alpha = 0.25)} where \code{alpha} is 
#' the quantile to estimate. The current version's quantile regression method 
#' does not handle non-constant weights and will stop.
#' 
#' If \code{"tdist"} is specified, the default degrees of freedom is 4 and 
#' this can be controlled by specifying 
#' \code{distribution = list(name = "tdist", df = DF)} where \code{DF} is your 
#' chosen degrees of freedom.
#' 
#' If "pairwise" regression is specified, \code{distribution} must be a list of
#' the form \code{list(name="pairwise",group=...,metric=...,max.rank=...)}
#' (\code{metric} and \code{max.rank} are optional, see below). \code{group} is
#' a character vector with the column names of \code{data} that jointly
#' indicate the group an instance belongs to (typically a query in Information
#' Retrieval applications). For training, only pairs of instances from the same
#' group and with different target labels can be considered. \code{metric} is
#' the IR measure to use, one of 
#' \describe{ 
#'   \item{list("conc")}{Fraction of concordant pairs; for binary labels, this 
#'         is equivalent to the Area under the ROC Curve}
#'   \item{:}{Fraction of concordant pairs; for binary labels, this is 
#'            equivalent to the Area under the ROC Curve} 
#'   \item{list("mrr")}{Mean reciprocal rank of the highest-ranked positive 
#'         instance}
#'   \item{:}{Mean reciprocal rank of the highest-ranked positive instance}
#'   \item{list("map")}{Mean average precision, a generalization of \code{mrr} 
#'         to multiple positive instances}\item{:}{Mean average precision, a
#'         generalization of \code{mrr} to multiple positive instances}
#'   \item{list("ndcg:")}{Normalized discounted cumulative gain. The score is 
#'         the weighted sum (DCG) of the user-supplied target values, weighted 
#'         by log(rank+1), and normalized to the maximum achievable value. This 
#'         is the default if the user did not specify a metric.} 
#' }
#' 
#' \code{ndcg} and \code{conc} allow arbitrary target values, while binary
#' targets \{0,1\} are expected for \code{map} and \code{mrr}. For \code{ndcg}
#' and \code{mrr}, a cut-off can be chosen using a positive integer parameter
#' \code{max.rank}. If left unspecified, all ranks are taken into account.
#' 
#' Note that splitting of instances into training and validation sets follows
#' group boundaries and therefore only approximates the specified
#' \code{train.fraction} ratio (the same applies to cross-validation folds).
#' Internally, queries are randomly shuffled before training, to avoid bias.
#' 
#' Weights can be used in conjunction with pairwise metrics, however it is
#' assumed that they are constant for instances from the same group.
#' 
#' For details and background on the algorithm, see e.g. Burges (2010).
#' 
#' @param data an optional data frame containing the variables in the model. By
#' default the variables are taken from \code{environment(formula)}, typically
#' the environment from which \code{gbm} is called. If \code{keep.data=TRUE} in
#' the initial call to \code{gbm} then \code{gbm} stores a copy with the
#' object. If \code{keep.data=FALSE} then subsequent calls to
#' \code{\link{gbm.more}} must resupply the same dataset. It becomes the user's
#' responsibility to resupply the same data at this point.
#' 
#' @param weights an optional vector of weights to be used in the fitting
#' process. Must be positive but do not need to be normalized. If
#' \code{keep.data=FALSE} in the initial call to \code{gbm} then it is the
#' user's responsibility to resupply the weights to \code{\link{gbm.more}}.
#' 
#' @param var.monotone an optional vector, the same length as the number of
#' predictors, indicating which variables have a monotone increasing (+1),
#' decreasing (-1), or arbitrary (0) relationship with the outcome.
#' 
#' @param n.trees Integer specifying the total number of trees to fit. This is 
#' equivalent to the number of iterations and the number of basis functions in 
#' the additive expansion. Default is 100.
#' 
#' @param interaction.depth Integer specifying the maximum depth of each tree 
#' (i.e., the highest level of variable interactions allowed). A value of 1 
#' implies an additive model, a value of 2 implies a model with up to 2-way 
#' interactions, etc. Default is 1.
#' 
#' @param n.minobsinnode Integer specifying the minimum number of observations 
#' in the terminal nodes of the trees. Note that this is the actual number of 
#' observations, not the total weight.
#' 
#' @param shrinkage a shrinkage parameter applied to each tree in the
#' expansion. Also known as the learning rate or step-size reduction; 0.001 to 
#' 0.1 usually work, but a smaller learning rate typically requires more trees.
#' Default is 0.1.
#' 
#' @param bag.fraction the fraction of the training set observations randomly
#' selected to propose the next tree in the expansion. This introduces
#' randomnesses into the model fit. If \code{bag.fraction} < 1 then running the
#' same model twice will result in similar but different fits. \code{gbm} uses
#' the R random number generator so \code{set.seed} can ensure that the model
#' can be reconstructed. Preferably, the user can save the returned
#' \code{\link{gbm.object}} using \code{\link{save}}. Default is 0.5.
#' 
#' @param train.fraction The first \code{train.fraction * nrows(data)}
#' observations are used to fit the \code{gbm} and the remainder are used for
#' computing out-of-sample estimates of the loss function.
#' 
#' @param cv.folds Number of cross-validation folds to perform. If
#' \code{cv.folds}>1 then \code{gbm}, in addition to the usual fit, will
#' perform a cross-validation, calculate an estimate of generalization error
#' returned in \code{cv.error}.
#' 
#' @param keep.data a logical variable indicating whether to keep the data and
#' an index of the data stored with the object. Keeping the data and index
#' makes subsequent calls to \code{\link{gbm.more}} faster at the cost of
#' storing an extra copy of the dataset.
#' 
#' @param verbose Logical indicating whether or not to print out progress and 
#' performance indicators (\code{TRUE}). If this option is left unspecified for 
#' \code{gbm.more}, then it uses \code{verbose} from \code{object}. Default is
#' \code{FALSE}.
#' 
#' @param class.stratify.cv Logical indicating whether or not the 
#' cross-validation should be stratified by class. Defaults to \code{TRUE} for
#' \code{distribution = "multinomial"} and is only implemented for
#' \code{"multinomial"} and \code{"bernoulli"}. The purpose of stratifying the
#' cross-validation is to help avoiding situations in which training sets do
#' not contain all classes.
#' 
#' @param n.cores The number of CPU cores to use. The cross-validation loop
#' will attempt to send different CV folds off to different cores. If
#' \code{n.cores} is not specified by the user, it is guessed using the
#' \code{detectCores} function in the \code{parallel} package. Note that the
#' documentation for \code{detectCores} makes clear that it is not failsafe and
#' could return a spurious number of available cores.
#' 
#' @return A \code{\link{gbm.object}} object.
#' 
#' @details 
#' This package implements the generalized boosted modeling framework. Boosting
#' is the process of iteratively adding basis functions in a greedy fashion so
#' that each additional basis function further reduces the selected loss
#' function. This implementation closely follows Friedman's Gradient Boosting
#' Machine (Friedman, 2001).
#' 
#' In addition to many of the features documented in the Gradient Boosting
#' Machine, \code{gbm} offers additional features including the out-of-bag
#' estimator for the optimal number of iterations, the ability to store and
#' manipulate the resulting \code{gbm} object, and a variety of other loss
#' functions that had not previously had associated boosting algorithms,
#' including the Cox partial likelihood for censored data, the poisson
#' likelihood for count outcomes, and a gradient boosting implementation to
#' minimize the AdaBoost exponential loss function. This gbm package is no
#' longer under further development. Consider
#' https://github.com/gbm-developers/gbm3 for the latest version.
#' 
#' @author Greg Ridgeway \email{gregridgeway@@gmail.com}
#' 
#' Quantile regression code developed by Brian Kriegler
#' \email{bk@@stat.ucla.edu}
#' 
#' t-distribution, and multinomial code developed by Harry Southworth and
#' Daniel Edwards
#' 
#' Pairwise code developed by Stefan Schroedl \email{schroedl@@a9.com}
#' 
#' @seealso \code{\link{gbm.object}}, \code{\link{gbm.perf}}, 
#' \code{\link{plot.gbm}}, \code{\link{predict.gbm}}, \code{\link{summary.gbm}}, 
#' and \code{\link{pretty.gbm.tree}}.
#' 
#' @references 
#' Y. Freund and R.E. Schapire (1997) \dQuote{A decision-theoretic
#' generalization of on-line learning and an application to boosting,}
#' \emph{Journal of Computer and System Sciences,} 55(1):119-139.
#' 
#' G. Ridgeway (1999). \dQuote{The state of boosting,} \emph{Computing Science
#' and Statistics} 31:172-181.
#' 
#' J.H. Friedman, T. Hastie, R. Tibshirani (2000). \dQuote{Additive Logistic
#' Regression: a Statistical View of Boosting,} \emph{Annals of Statistics}
#' 28(2):337-374.
#' 
#' J.H. Friedman (2001). \dQuote{Greedy Function Approximation: A Gradient
#' Boosting Machine,} \emph{Annals of Statistics} 29(5):1189-1232.
#' 
#' J.H. Friedman (2002). \dQuote{Stochastic Gradient Boosting,}
#' \emph{Computational Statistics and Data Analysis} 38(4):367-378.
#' 
#' B. Kriegler (2007). Cost-Sensitive Stochastic Gradient Boosting Within a 
#' Quantitative Regression Framework. Ph.D. Dissertation. University of 
#' California at Los Angeles, Los Angeles, CA, USA. Advisor(s) Richard A. Berk. 
#' \url{https://dl.acm.org/doi/book/10.5555/1354603}.
#' 
#' C. Burges (2010). \dQuote{From RankNet to LambdaRank to LambdaMART: An
#' Overview,} Microsoft Research Technical Report MSR-TR-2010-82.
#'
#' @export
#' 
#' @examples
#' #
#' # A least squares regression example 
#' #
#' 
#' # Simulate data
#' set.seed(101)  # for reproducibility
#' N <- 1000
#' X1 <- runif(N)
#' X2 <- 2 * runif(N)
#' X3 <- ordered(sample(letters[1:4], N, replace = TRUE), levels = letters[4:1])
#' X4 <- factor(sample(letters[1:6], N, replace = TRUE))
#' X5 <- factor(sample(letters[1:3], N, replace = TRUE))
#' X6 <- 3 * runif(N) 
#' mu <- c(-1, 0, 1, 2)[as.numeric(X3)]
#' SNR <- 10  # signal-to-noise ratio
#' Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu
#' sigma <- sqrt(var(Y) / SNR)
#' Y <- Y + rnorm(N, 0, sigma)
#' X1[sample(1:N,size=500)] <- NA  # introduce some missing values
#' X4[sample(1:N,size=300)] <- NA  # introduce some missing values
#' data <- data.frame(Y, X1, X2, X3, X4, X5, X6)
#' 
#' # Fit a GBM
#' set.seed(102)  # for reproducibility
#' gbm1 <- gbm(Y ~ ., data = data, var.monotone = c(0, 0, 0, 0, 0, 0),
#'             distribution = "gaussian", n.trees = 100, shrinkage = 0.1,             
#'             interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,  
#'             n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE, 
#'             verbose = FALSE, n.cores = 1)  
#' 
#' # Check performance using the out-of-bag (OOB) error; the OOB error typically
#' # underestimates the optimal number of iterations
#' best.iter <- gbm.perf(gbm1, method = "OOB")
#' print(best.iter)
#' 
#' # Check performance using the 50% heldout test set
#' best.iter <- gbm.perf(gbm1, method = "test")
#' print(best.iter)
#' 
#' # Check performance using 5-fold cross-validation
#' best.iter <- gbm.perf(gbm1, method = "cv")
#' print(best.iter)
#' 
#' # Plot relative influence of each variable
#' par(mfrow = c(1, 2))
#' summary(gbm1, n.trees = 1)          # using first tree
#' summary(gbm1, n.trees = best.iter)  # using estimated best number of trees
#' 
#' # Compactly print the first and last trees for curiosity
#' print(pretty.gbm.tree(gbm1, i.tree = 1))
#' print(pretty.gbm.tree(gbm1, i.tree = gbm1$n.trees))
#' 
#' # Simulate new data
#' set.seed(103)  # for reproducibility
#' N <- 1000
#' X1 <- runif(N)
#' X2 <- 2 * runif(N)
#' X3 <- ordered(sample(letters[1:4], N, replace = TRUE))
#' X4 <- factor(sample(letters[1:6], N, replace = TRUE))
#' X5 <- factor(sample(letters[1:3], N, replace = TRUE))
#' X6 <- 3 * runif(N) 
#' mu <- c(-1, 0, 1, 2)[as.numeric(X3)]
#' Y <- X1 ^ 1.5 + 2 * (X2 ^ 0.5) + mu + rnorm(N, 0, sigma)
#' data2 <- data.frame(Y, X1, X2, X3, X4, X5, X6)
#' 
#' # Predict on the new data using the "best" number of trees; by default,
#' # predictions will be on the link scale
#' Yhat <- predict(gbm1, newdata = data2, n.trees = best.iter, type = "link")
#' 
#' # least squares error
#' print(sum((data2$Y - Yhat)^2))
#' 
#' # Construct univariate partial dependence plots
#' plot(gbm1, i.var = 1, n.trees = best.iter)
#' plot(gbm1, i.var = 2, n.trees = best.iter)
#' plot(gbm1, i.var = "X3", n.trees = best.iter)  # can use index or name
#' 
#' # Construct bivariate partial dependence plots
#' plot(gbm1, i.var = 1:2, n.trees = best.iter)
#' plot(gbm1, i.var = c("X2", "X3"), n.trees = best.iter)
#' plot(gbm1, i.var = 3:4, n.trees = best.iter)
#' 
#' # Construct trivariate partial dependence plots
#' plot(gbm1, i.var = c(1, 2, 6), n.trees = best.iter, 
#'      continuous.resolution = 20)
#' plot(gbm1, i.var = 1:3, n.trees = best.iter)
#' plot(gbm1, i.var = 2:4, n.trees = best.iter)
#' plot(gbm1, i.var = 3:5, n.trees = best.iter)
#' 
#' # Add more (i.e., 100) boosting iterations to the ensemble
#' gbm2 <- gbm.more(gbm1, n.new.trees = 100, verbose = FALSE)
gbm <- function(formula = formula(data), distribution = "bernoulli", 
                data = list(), weights, var.monotone = NULL, n.trees = 100,
                interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.1,
                bag.fraction = 0.5, train.fraction = 1.0, cv.folds = 0,
                keep.data = TRUE, verbose = FALSE, class.stratify.cv = NULL,
                n.cores = NULL) {
  
  # Match the call to gbm
  mcall <- match.call()
  
  # Verbose output?
  lVerbose <- if (!is.logical(verbose)) { 
    FALSE 
  } else { 
    verbose 
  }
  
  # Construct model frame, terms object, weights, and offset
  mf <- match.call(expand.dots = FALSE)
  m <- match(c("formula", "data", "weights", "offset"), names(mf), 0)
  mf <- mf[c(1, m)]
  mf$drop.unused.levels <- TRUE
  mf$na.action <- na.pass
  mf[[1]] <- as.name("model.frame")
  m <- mf
  mf <- eval(mf, parent.frame())
  Terms <- attr(mf, "terms")
  w <- model.weights(mf)
  offset <- model.offset(mf)
  y <- model.response(mf)  # extract response values
  
  # Determine and check response distribution
  if (missing(distribution)) {
    # y <- data[, all.vars(formula)[1L], drop = TRUE]
    distribution <- guessDist(y)
  }
  if (is.character(distribution)) { 
    distribution <- list(name = distribution) 
  } 
  if (!is.element(distribution$name, getAvailableDistributions())) {
    stop("Distribution ", distribution$name, " is not supported.")
  }
  if (distribution$name == "multinomial") {
    warning("Setting `distribution = \"multinomial\"` is ill-advised as it is ",
            "currently broken. It exists only for backwards compatibility. ",
            "Use at your own risk.", call. = FALSE)
  }
  
  # Construct data frame of predictor values
  var.names <- attributes(Terms)$term.labels
  x <- model.frame(terms(reformulate(var.names)), data = data, 
                   na.action = na.pass)
  
  # Extract response name as a character string
  response.name <- as.character(formula[[2L]])
  
  # Stratify cross-validation by class (only for bernoulli and multinomial)
  class.stratify.cv <- getStratify(class.stratify.cv, d = distribution)
  
  # Groups (for pairwise distribution only)
  group <- NULL
  num.groups <- 0
  
  # Determine number of training instances
  if (distribution$name != "pairwise"){
    
    # Number of training instances
    nTrain <- floor(train.fraction * nrow(x))
    
  } else {
    
    # Sampling is by group, so we need to calculate them here
    distribution.group <- distribution[["group"]]
    if (is.null(distribution.group)) {
      stop(paste("For pairwise regression, `distribution` must be a list of",
                 "the form `list(name = \"pairwise\", group = c(\"date\",",
                 "\"session\", \"category\", \"keywords\"))`."))
    }
    
    # Check if group names are valid
    i <- match(distribution.group, colnames(data))
    if (any(is.na(i))) {
      stop("Group column does not occur in data: ", 
           distribution.group[is.na(i)], ".")
    }
    
    # Construct group index
    group <- factor(
      do.call(paste, c(data[, distribution.group, drop = FALSE], sep = ":"))
    )
    
    # Check that weights are constant across groups
    if ((!missing(weights)) && (!is.null(weights))) {
      w.min <- tapply(w, INDEX = group, FUN = min)
      w.max <- tapply(w, INDEX = group, FUN = max)
      if (any(w.min != w.max)) {
        stop("For `distribution = \"pairwise\"`, all instances for the same ",
             "group must have the same weight.")
      }
      w <- w * length(w.min) / sum(w.min)  # normalize across groups
    }
    
    # Shuffle groups to remove bias when split into train/test sets and/or CV 
    # folds
    perm.levels  <- levels(group)[sample(1:nlevels(group))]
    group <- factor(group, levels = perm.levels)
    
    # The C function expects instances to be sorted by group and descending by 
    # target
    ord.group <- order(group, -y)
    group <- group[ord.group]
    y <- y[ord.group]
    x <- x[ord.group, , drop = FALSE]
    w <- w[ord.group]
    
    # Split into train and validation sets at group boundary
    num.groups.train <- max(1, round(train.fraction * nlevels(group)))
    
    # Include all groups up to the num.groups.train
    nTrain <- max(which(group==levels(group)[num.groups.train]))
    Misc <- group
    
  }

  # Set up for k-fold cross-validation
  cv.error <- NULL
  # FIXME: Is there a better way to handle this?
  if (cv.folds == 1) {
    cv.folds <- 0  # o/w, an uninformative error is thrown
  }
  if(cv.folds > 1) {
    cv.results <- gbmCrossVal(cv.folds = cv.folds, nTrain = nTrain, 
                              n.cores = n.cores, 
                              class.stratify.cv = class.stratify.cv, 
                              data = data, x = x, y = y, offset = offset, 
                              distribution = distribution, w = w, 
                              var.monotone = var.monotone, n.trees = n.trees, 
                              interaction.depth = interaction.depth, 
                              n.minobsinnode = n.minobsinnode,
                              shrinkage = shrinkage, 
                              bag.fraction = bag.fraction, 
                              var.names = var.names, 
                              response.name = response.name, group = group)
    cv.error <- cv.results$error
    p <- cv.results$predictions
  }
  
  # Fit a GBM
  gbm.obj <- gbm.fit(x = x, y = y, offset = offset, distribution = distribution,
                     w = w, var.monotone = var.monotone, n.trees = n.trees,
                     interaction.depth = interaction.depth, 
                     n.minobsinnode = n.minobsinnode, shrinkage = shrinkage,
                     bag.fraction = bag.fraction, nTrain = nTrain,
                     keep.data = keep.data, verbose = lVerbose,
                     var.names = var.names, response.name = response.name,
                     group = group)
  
  # Attach further components
  gbm.obj$train.fraction <- train.fraction
  gbm.obj$Terms <- Terms
  gbm.obj$cv.error <- cv.error
  gbm.obj$cv.folds <- cv.folds
  gbm.obj$call <- mcall
  gbm.obj$m <- m
  if (cv.folds > 1) {  # FIXME: Was previously `cv.folds > 0`?
    gbm.obj$cv.fitted <- p 
  }
  if (distribution$name == "pairwise") {
    # Data has been reordered according to queries. We need to permute the 
    # fitted values so that they correspond to the original order.
    gbm.obj$ord.group <- ord.group
    gbm.obj$fit <- gbm.obj$fit[order(ord.group)]
  }
  
  # Return "gbm" object
  gbm.obj
  
}