R/Lorenz-bootstrap.R
In LorenzRegression: Lorenz and Penalized Lorenz Regressions

Documented in Lorenz.boot

#' Bootstrap for the (penalized) Lorenz regression
#'
#' \code{Lorenz.boot} performs bootstrap estimation for the vector of coefficients of the single-index model, the explained Gini coefficient, and the Lorenz-\eqn{R^2}. In the penalized case, it also provides a selection method.
#'
#' @param object An object of class \code{"LR"} or \code{"PLR"}, i.e., the output of a call to \code{\link{Lorenz.Reg}}.
#' @param R An integer specifying the number of bootstrap replicates.
#' @param boot_out_only A logical value indicating whether the function should return only the raw bootstrap output. This advanced feature can help save computation time in specific use cases. See Details.
#' @param store_LC A logical determining whether explained Lorenz curves ordinates should be stored for each bootstrap sample. The default is \code{FALSE} since it might require storing large objects. If set to \code{TRUE}, ordinates are stored and plots of the explained Lorenz curve will include confidence bands, see \code{\link{plot.LR}} and \code{\link{plot.PLR}}.
#' @param show_progress A logical. If \code{TRUE} (default), a progress bar is displayed during bootstrap computation. Set to \code{FALSE} to disable it. Progress is not shown when parallel computing is used.
#' @param ... Additional arguments passed to either the bootstrap function \code{\link[boot]{boot}} from the \pkg{boot} package or the underlying fit functions (\code{\link{Lorenz.GA}}, \code{\link{Lorenz.FABS}}, or \code{\link{Lorenz.SCADFABS}}). By default, the fit function uses the same parameters as in the original call to \code{Lorenz.Reg}, but these can be overridden by explicitly passing them in \code{...}.
#'
#' @return An object of class \code{c("LR_boot", "LR")} or \code{c("PLR_boot", "PLR")}, depending on whether a non-penalized or penalized regression was fitted.
#'
#' The methods \code{\link{confint.LR}} and \code{\link{confint.PLR}} can be used on objects of class \code{"LR_boot"} or \code{"PLR_boot"} to construct confidence intervals for the model parameters.
#'
#' For the non-penalized Lorenz regression, the returned object is a list containing:
#' \describe{
#'    \item{\code{theta}}{The estimated vector of parameters. In the penalized case, this is a matrix where each row corresponds to a different selection method (e.g., BIC, bootstrap, cross-validation).}
#'    \item{\code{Gi.expl}}{The estimated explained Gini coefficient. In the penalized case, this is a vector, where each element corresponds to a different selection method.}
#'    \item{\code{LR2}}{The Lorenz-\eqn{R^2} of the regression. In the penalized case, this is a vector, where each element corresponds to a different selection method.}
#'    \item{\code{boot_out}}{An object of class \code{"boot"} containing the raw bootstrap output.}
#' }
#' For the penalized Lorenz regression, the returned object includes:
#' \describe{
#'    \item{\code{path}}{See \code{\link{Lorenz.Reg}} for the original path. The out-of-bag (OOB) score is added.}
#'    \item{\code{lambda.idx}}{A vector indicating the index of the optimal lambda obtained by each selection method.}
#'    \item{\code{grid.idx}}{A vector indicating the index of the optimal grid parameter obtained by each selection method.}
#' }
#' Note: In the penalized case, the returned object may have additional classes such as \code{"PLR_cv"} if cross-validation was performed and used for selection.
#'
#' @details
#' The function supports parallel computing in two ways:
#' \enumerate{
#'   \item Using the built-in parallelization options of \code{\link[boot]{boot}}, which can be controlled via the \code{...} arguments such as \code{parallel}, \code{ncpus}, and \code{cl}.
#'   \item Running multiple independent instances of \code{Lorenz.boot()}, each handling a subset of the bootstrap samples. In this case, setting \code{boot_out_only = TRUE} ensures that the function only returns the raw bootstrap results. These results can be merged using \code{\link{Lorenz.boot.combine}}.
#' }
#'
#' \strong{Handling of additional arguments (\code{...}):}
#' The function allows for two types of arguments through \code{...}:
#' \itemize{
#'   \item Arguments for \code{\link[boot]{boot}}, used to control the bootstrap procedure.
#'   \item Arguments for the underlying fit functions (\code{\link{Lorenz.GA}}, \code{\link{Lorenz.FABS}}, or \code{\link{Lorenz.SCADFABS}}). By default, the function retrieves these parameters from the original \code{Lorenz.Reg} call. However, users can override them by explicitly specifying new values in \code{...}.
#' }
#'
#' @seealso \code{\link{Lorenz.Reg}}, \code{\link{Lorenz.GA}}, \code{\link{Lorenz.SCADFABS}}, \code{\link{Lorenz.FABS}}, \code{\link{PLR.CV}}, \code{\link[boot]{boot}}
#'
#' @section References:
#' Heuchenne, C. and A. Jacquemain (2022). Inference for monotone single-index conditional means: A Lorenz regression approach. \emph{Computational Statistics & Data Analysis 167(C)}.
#'
#' Jacquemain, A., C. Heuchenne, and E. Pircalabelu (2024). A penalized bootstrap estimation procedure for the explained Gini coefficient. \emph{Electronic Journal of Statistics 18(1) 247-300}.
#'
#' @examples
#' \dontshow{
#' utils::example(Lorenz.Reg, echo = FALSE)
#' }
#' # Non-penalized regression example (not run due to execution time)
#' \dontrun{
#' set.seed(123)
#' NPLR_boot <- Lorenz.boot(NPLR, R = 30)
#' confint(NPLR_boot) # Confidence intervals
#' summary(NPLR_boot)
#' }
#'
#' # Penalized regression example:
#' set.seed(123)
#' PLR_boot <- Lorenz.boot(PLR, R = 20)
#' print(PLR_boot)
#' summary(PLR_boot)
#' coef(PLR_boot, pars.idx = "Boot")
#' predict(PLR_boot, pars.idx = "Boot")
#' plot(PLR_boot)
#' plot(PLR_boot, type = "diagnostic")
#'
#' # Confidence intervals for different selection methods:
#' confint(PLR_boot, pars.idx = "BIC")  # Using BIC-selected tuning parameters
#' confint(PLR_boot, pars.idx = "Boot") # Using bootstrap-selected tuning parameters
#'
#' @importFrom boot boot
#' @importFrom stats setNames
#' @importFrom utils modifyList
#' @importFrom progress progress_bar
#'
#' @export

Lorenz.boot <- function(object, R, boot_out_only = FALSE, store_LC = FALSE, show_progress = TRUE, ...){

  # 0. Checks ----
  if(!inherits(object,c("LR","PLR"))) stop("object must be the output of a (penalized) Lorenz regression.")

  if(inherits(object,"PLR")){
    method <- "PLR"
  }else{
    method <- "LR"
  }

  args <- list(...)

  object$store_LC <- store_LC

  # 1. Arguments of the bootstrap ----
  data <- cbind(object$y, object$x)
  boot_args <- args[names(args) %in% names(formals(boot))]

  if(object$penalty == "none"){
    fit_fun <- Lorenz.GA
  }else{
    fit_fun <- PLR.fit
    PLR_args <- list("penalty"=object$penalty, "grid.arg"=object$grid.arg,
                     "grid.value"=object$grid.value, "lambda.list"=object$lambda.list)
  }
  fit_formals <- switch(object$penalty,
                        "none" = names(formals(fit_fun)),
                        "LASSO" = names(formals(Lorenz.FABS)),
                        "SCAD" = names(formals(Lorenz.SCADFABS)))
  fit_args <- modifyList(object$fit_args, args[names(args) %in% fit_formals])
  if(object$penalty != "none") fit_args <- c(fit_args, PLR_args)
  if(object$penalty == "none") fit_args$parallel.GA <- FALSE

  # 2. statistic in boot() ----
  boot.f <- function(data, indices, prog){
    # Construction similar to the "Boot" function in library "car".
    # We want to avoid recomputation on the original sample
    first <- all(indices == seq(length(indices)))
    if(first){
      result <- object
    }else{
      x.boot <- data[indices,-1,drop=FALSE]
      y.boot <- data[indices,1]
      if(!is.null(object$weights)){
        w.boot <- object$weights[indices]
      }else{
        w.boot <- NULL
      }
      if(method == "PLR"){
        x.oob <- data[-unique(indices),-1,drop=FALSE]
        y.oob <- data[-unique(indices),1]
        if(!is.null(object$weights)){
          w.oob <- object$weights[-unique(indices)]
        }else{
          w.oob <- NULL
        }
      }
      boot.LR <- do.call(fit_fun, c(list(y = y.boot, x = x.boot, weights = w.boot), fit_args))
      if(method == "PLR"){
        # With penalized reg, the algorithm may stop sooner than in the original sample.
        # Therefore the paths would be shorter and the objects would not have the same size
        compare.paths <- function(path.long,path.short){
          lth.diff <- ncol(path.long) - ncol(path.short)
          if(lth.diff > 0) path.short <- cbind(path.short,replicate(lth.diff,path.short[,ncol(path.short)]))
          return(path.short)
        }
        boot.LR$path <- lapply(1:length(object$path),function(i)compare.paths(object$path[[i]],boot.LR$path[[i]]))
        # Computation of the OOB score
        theta.boot <- lapply(boot.LR$path,function(x)x[(nrow(x)-ncol(object$x)+1):nrow(x),])
        OOB.score <- PLR.scores(y.oob,x.oob,w.oob,theta.boot)
      }
      result <- boot.LR
    }
    # All objects that require bootstrapping are stacked in a vector
    if(method == "LR"){
      boot.vec <- c("Gi.expl"=result$Gi.expl,
                    "LR2"=result$LR2,
                    result$theta)
    }else{
      Gi.vec <- unlist(sapply(result$path,function(x)x["Explained Gini",]))
      LR2.vec <- unlist(sapply(result$path,function(x)x["Lorenz-R2",]))
      if(first){
        OOB.vec <- rep(0,length(Gi.vec))
      }else{
        OOB.vec <- unlist(OOB.score)
      }
      boot.vec <- c(Gi.vec,LR2.vec,OOB.vec)
    }
    # If store_LC, we also add LC ordinates
    if (store_LC){
      pi <- seq(from = 0, to = 1, length.out = 100)
      if (method == "LR"){
        if (first){
          LC.vec <- Lorenz.curve(y = object$y, x = object$x%*%result$theta, weights = object$weights, ties.method = "mean")(pi)
        }else{
          LC.vec <- Lorenz.curve(y = y.boot, x = x.boot%*%result$theta, weights = w.boot, ties.method = "mean")(pi)
        }
      }else if (method == "PLR"){
        if (first){
          theta.list <- lapply(object$path,function(x)x[(nrow(x)-ncol(object$x)+1):nrow(x),])
          LC.list <- lapply(theta.list, function(theta.mat)apply(theta.mat,2,
                                                                 function(theta.vec)Lorenz.curve(y = object$y,
                                                                                                 x = object$x%*%theta.vec,
                                                                                                 weights = object$weights,
                                                                                                 ties.method = "mean")(pi)))
        }else{
          LC.list <- lapply(theta.boot, function(theta.mat)apply(theta.mat,2,
                                                                 function(theta.vec)Lorenz.curve(y = y.boot,
                                                                                                 x = x.boot%*%theta.vec,
                                                                                                 weights = w.boot,
                                                                                                 ties.method = "mean")(pi)))
        }
        LC.vec <- unlist(lapply(LC.list, function(x) as.vector(x)))
      }
      boot.vec <- c(boot.vec, LC.vec)
    }
    # Display progress
    if(show_progress){
      if(first){
        prog$tick(0)
      }else{
        prog$tick()
      }
    }
    # returning the content of bootstrap
    return(boot.vec)

  }

  # 3. boot() ----
  if(show_progress){
    pb <- progress_bar$new(total = R)
  }else{
    pb <- NULL
  }
  boot_out <- do.call(boot, c(list(data = data, statistic = boot.f, R = R, prog = pb), boot_args))
  boot_out <- boot_out[!(names(boot_out) %in% c("statistic", "data", "call"))]
  class(boot_out) <- "boot"
  object$boot_out <- boot_out

  if(!boot_out_only){

    # 4. PLR specifics ----
    if(method == "PLR"){

      # Indices to retrieve info on all bootstrap elements
      path.sizes <- sapply(object$path,ncol)    # Number of lambda values for each grid_param
      path.size <- sum(path.sizes)              # Number of (lambda,grid_param) combinations
      lth.path <- length(path.sizes)            # Number of grid_param values

      # the OOB score is the mean of the OOB scores across the bootstrap samples
      idx_OOB <- (2*path.size + 1) : (3*path.size)
      OOB_matrix <- boot_out$t[,idx_OOB]
      OOB_total <- colMeans(OOB_matrix)
      # Adding OOB score to the path
      idx <- lapply(1:lth.path,function(i)(cumsum(path.sizes)-path.sizes+1)[i]:cumsum(path.sizes)[i])
      val.OOB <- lapply(idx,function(i)OOB_total[i])
      lth.theta <- ncol(object$x)
      lth <- nrow(object$path[[1]]) # Same for all anyway (what changes is ncol)
      for (i in 1:lth.path){
        path.tmp <- rbind(object$path[[i]][1:(lth-lth.theta),],
                          "OOB score" = val.OOB[[i]])
        object$path[[i]] <- rbind(path.tmp,
                                  object$path[[i]][(lth-lth.theta+1):lth,])
      }
      lth <- lth + 1
      # Best pair (grid,lambda) in terms of OOB score
      path.wl <- unlist(sapply(path.sizes,function(x)1:x))
      path.wt <- rep(1:lth.path,times=path.sizes)
      wl <- path.wl[which.max(OOB_total)]
      wt <- path.wt[which.max(OOB_total)]
      object$grid.idx <- c(object$grid.idx,"Boot"=wt)
      object$lambda.idx <- c(object$lambda.idx,"Boot"=wl)

    }

    # 5. Class ----
    if(method == "LR"){
      class(object) <- c("LR_boot",class(object))
    }else{
      lth.class <- length(class(object))
      if(lth.class==1){
        class(object) <- c("PLR_boot",class(object))
      }else{
        # PLR_boot must come right after "PLR"
        class(object) <- c(class(object)[-lth.class],"PLR_boot","PLR")
      }
    }

  }

  return(object)

}