R/VariableSelection.R
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

Documented in coef.BranchGLMVS plot.BranchGLMVS predict.BranchGLMVS print.BranchGLMVS VariableSelection VariableSelection.BranchGLM VariableSelection.formula

#' Variable Selection for GLMs 
#' @description Performs forward selection, multiple different variants of backward 
#' elimination, and efficient best subset variable selection with information 
#' criterion for generalized linear models (GLMs). Best subset selection is 
#' performed with branch and bound algorithms to greatly speed up the process 
#' and backward elimination can be performed with bounding algorithms to speed it up.
#' @param object a formula or a `BranchGLM` object.
#' @param ... further arguments.
#' @param data a data.frame, list or environment (or object coercible by 
#' [as.data.frame] to a data.frame), containing the variables in formula. 
#' Neither a matrix nor an array will be accepted. If not found in `data`, the 
#' variables are taken from `environment(formula)`, typically the environment from 
#' which `VariableSelection` is called.
#' @param family the distribution used to model the data, one of "gaussian", 
#' "gamma", "binomial", or "poisson". A `family` object may also be supplied for 
#' one of the accepted distributions.
#' @param link the link used to link the mean structure to the linear predictors. One of
#' "identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log". This only needs to be 
#' supplied if a string is supplied for `family`. If a [family] object is supplied 
#' for the `family` argument, then the link function is taken from that [family] 
#' object.
#' @param offset the offset vector, by default the zero vector is used.
#' @param method one of "Fisher", "BFGS", or "LBFGS". Fisher's scoring is recommended
#' for forward selection and the branch and bound algorithms since they will typically 
#' fit many models with a small number of covariates.
#' @param type one of "forward", "backward", "fast backward", "double backward", "fast double backward", 
#' "branch and bound", "backward branch and bound", or "switch branch and bound" 
#' to indicate the type of variable selection to perform. The default value is 
#' "switch branch and bound". See more about these algorithms in details
#' @param metric the metric used to choose the best models, the default is "AIC", 
#' but "BIC" and "HQIC" are also available. AIC is the Akaike information criterion, 
#' BIC is the Bayesian information criterion, and HQIC is the Hannan-Quinn information 
#' criterion. 
#' @param bestmodels a positive integer to indicate the number of the best models to 
#' find according to the chosen metric or NULL. If this is NULL, then cutoff is 
#' used instead. This is only used for the branch and bound algorithms.
#' @param cutoff a non-negative number which indicates that the function 
#' should return all models that have a metric value within cutoff of the 
#' best metric value or NULL. Only one of this or bestmodels should be specified and 
#' when both are NULL a cutoff of 0 is used. This is only used for the branch 
#' and bound algorithms.
#' @param keep a character vector of names to denote variables that must be in the models.
#' @param keepintercept a logical value to indicate whether to keep the intercept in 
#' all models, only used if an intercept is included in the formula.
#' @param maxsize a positive integer to denote the maximum number of variables to 
#' consider in a single model, the default is the total number of variables. 
#' This number adds onto any variables specified in keep. This argument only works 
#' for `type = "forward"` and `type = "branch and bound"`. This argument is now 
#' deprecated.
#' @param grads a positive integer to denote the number of gradients used to 
#' approximate the inverse information with, only for `method = "LBFGS"`.
#' @param parallel a logical value to indicate if parallelization should be used.
#' @param nthreads a positive integer to denote the number of threads used with OpenMP, 
#' only used if `parallel = TRUE`.
#' @param tol a positive number to denote the tolerance used to determine model convergence.
#' @param maxit a positive integer to denote the maximum number of iterations performed. 
#' The default for Fisher's scoring is 50 and for the other methods the default is 200.
#' @param showprogress a logical value to indicate whether to show progress updates 
#' for branch and bound algorithms.
#' @param contrasts see `contrasts.arg` of `model.matrix.default`.
#' @seealso [plot.BranchGLMVS], [coef.BranchGLMVS], [predict.BranchGLMVS], 
#' [summary.BranchGLMVS]
#' @details 
#' 
#' ## Variable Selection Details
#' The supplied formula or the formula from the fitted model is 
#' treated as the upper model. The variables specified in keep along with an 
#' intercept (if included in formula and keepintercept = TRUE) is the lower model. 
#' Factor variables are either kept in their entirety or entirely removed and 
#' interaction terms are properly handled. All observations that have any missing 
#' values in the upper model are removed.
#' 
#' ## Stepwise Methods
#' There are 5 different stepwise variable selection algorithms that are available. 
#' These are forward selection, backward elimination, fast backward elimination, 
#' double backward elimination, and fast double backward elimination. All of these 
#' are heuristic algorithms, so the best model found by them may not be the optimal 
#' model. 
#' 
#' ### Backward Elimination
#' 
#' Fast backward elimination should give the same results as backward elimination, but it makes 
#' use of the bounding techniques used by the branch and bound algorithms to make 
#' it faster. Fast backward elimination can give slightly different results than 
#' backward elimination if the GLM solver has difficulties fitting some of the 
#' larger models.
#' 
#' ### Double Backward Elimination
#' 
#' Double backward elimination and fast double backward elimination are a variant of 
#' backward elimination where up to 2 variables can be removed in one step instead of 
#' just 1. This typically results in higher quality models, but can also be much 
#' slower. The bounding algorithm used in fast double backward elimination makes it 
#' much faster.
#' 
#' ## Branch and Bound Methods
#' The branch and bound algorithm is an efficient algorithm used to find the optimal 
#' models. The backward branch and bound algorithm is very similar to 
#' the branch and bound algorithm, except it tends to be faster when the best models 
#' contain most of the variables. The switch branch and bound algorithm is a 
#' combination of the two algorithms and is typically the fastest of the 3 branch and 
#' bound algorithms. All of the branch and bound algorithms are guaranteed to find 
#' the optimal models (up to numerical precision).
#' 
#' ## GLM Fitting
#' 
#' Fisher's scoring is recommended for branch and bound selection and forward selection.
#' L-BFGS may be faster for the backward elimination and double backward elimination 
#' algorithms, especially when there are many variables.
#' 
#' @return A `BranchGLMVS` object which is a list with the following components
#' \item{`initmodel`}{ the `BranchGLM` object corresponding to the upper model}
#' \item{`numchecked`}{ number of models fit}
#' \item{`names`}{ character vector of the names of the predictor variables}
#' \item{`order`}{ the order the variables were added to the model or removed from the model, this is only included for the stepwise algorithms}
#' \item{`type`}{ type of variable selection employed}
#' \item{`optType`}{ whether the type specified used a heuristic or exact algorithm}
#' \item{`metric`}{ metric used to select the best models}
#' \item{`bestmodels`}{ numeric matrix used to describe the best models for the branch and bound algorithms 
#' or a numeric matrix describing the models along the path taken for stepwise algorithms}
#' \item{`bestmetrics`}{ numeric vector with the best metrics found in the search for the branch and bound algorithms 
#' or a numeric vector with the metric values along the path taken for stepwise algorithms}
#' \item{`beta`}{ numeric matrix of beta coefficients for the models in bestmodels}
#' \item{`cutoff`}{ the cutoff that was used, this is set to -1 if bestmodels was used instead or if 
#' a stepwise algorithm was used}
#' \item{`keep`}{ vector of which variables were kept through the selection process}
#' \item{`keepintercept`}{ a boolean value denoting whether to keep the intercept through the selection process or not}
#' @name VariableSelection
#' 
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", 
#' link = "identity")
#' 
#' # Doing branch and bound selection 
#' VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
#' bestmodels = 10, showprogress = FALSE)
#' VS
#' 
#' ## Plotting the BIC of the best models
#' plot(VS, type = "b")
#' 
#' ## Getting the coefficients of the best model according to BIC
#' FinalModel <- coef(VS, which = 1)
#' FinalModel
#' 
#' # Now doing it in parallel (although it isn't necessary for this dataset)
#' parVS <- VariableSelection(Fit, type = "branch and bound", parallel = TRUE, 
#' metric = "BIC", bestmodels = 10, showprogress = FALSE)
#' 
#' ## Getting the coefficients of the best model according to BIC
#' FinalModel <- coef(parVS, which = 1)
#' FinalModel
#' 
#' # Using a formula
#' formVS <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
#' link = "identity", metric = "BIC", type = "branch and bound", bestmodels = 10, 
#' showprogress = FALSE)
#' 
#' ## Getting the coefficients of the best model according to BIC
#' FinalModel <- coef(formVS, which = 1)
#' FinalModel
#' 
#' # Using the keep argument
#' keepVS <- VariableSelection(Fit, type = "branch and bound", 
#' keep = c("Species", "Petal.Width"), metric = "BIC", bestmodels = 4, 
#' showprogress = FALSE)
#' keepVS
#' 
#' ## Getting the coefficients from the fourth best model according to BIC when 
#' ## keeping Petal.Width and Species in every model
#' FinalModel <- coef(keepVS, which = 4)
#' FinalModel
#' 
#' # Treating categorical variable beta parameters separately
#' ## This function automatically groups together parameters from a categorical variable
#' ## to avoid this, you need to create the indicator variables yourself
#' x <- model.matrix(Sepal.Length ~ ., data = iris)
#' Sepal.Length <- iris$Sepal.Length
#' Data <- cbind.data.frame(Sepal.Length, x[, -1])
#' VSCat <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
#' link = "identity", metric = "BIC", bestmodels = 10, showprogress = FALSE)
#' VSCat
#' 
#' ## Plotting results
#' plot(VSCat, cex.names = 0.75)
#' 
#' @export
#' 
VariableSelection <- function(object, ...) {
  UseMethod("VariableSelection")
}

#'@rdname VariableSelection
#'@export

VariableSelection.formula <- function(object, data = NULL, family, link, offset = NULL,
                                      method = "Fisher", type = "switch branch and bound", 
                                      metric = "AIC",
                                      bestmodels = NULL, cutoff = NULL, 
                                      keep = NULL, keepintercept = TRUE, maxsize = NULL,
                                      grads = 10, parallel = FALSE, 
                                      nthreads = 8, tol = 1e-6, maxit = NULL,
                                      contrasts = NULL,
                                      showprogress = TRUE, ...){
  ### Performing variable selection
  ### Getting BranchGLM object
  mf <- match.call(expand.dots = FALSE)
  m <- match(c("data", "family", "link", "offset",
               "method", "grads", "tol", "maxit", "contrasts"), names(mf), 0L)
  mf <- mf[c(1L, m)]
  mf[[1L]] <- quote(BranchGLM)
  mf$fit <- FALSE
  mf$formula <- object
  fit <- eval(mf, parent.frame())
  
  VariableSelection(fit, type = type, metric = metric, 
                    bestmodels = bestmodels, cutoff = cutoff,
                    keep = keep, keepintercept = keepintercept, 
                    maxsize = maxsize, parallel = parallel, 
                    nthreads = nthreads,
                    showprogress = showprogress, ...)
}

#'@rdname VariableSelection
#'@export

VariableSelection.BranchGLM <- function(object, type = "switch branch and bound", 
                                        metric = "AIC",
                                        bestmodels = NULL, cutoff = NULL, 
                                        keep = NULL, keepintercept = TRUE, maxsize = NULL,
                                        parallel = FALSE, nthreads = 8,
                                        showprogress = TRUE, ...){
  ## converting metric to upper and type to lower
  type <- tolower(type)
  metric <- toupper(metric)
  
  ## Checking if supplied BranchGLM object has x and data
  if(is.null(object$x)){
    stop("the supplied model must have an x component")
  }else if(nrow(object$x) == 0){
    stop("the design matrix in object has 0 rows")
  }
  ## Checking if supplied BranchGLM object has y
  if(is.null(object$y)){
    stop("the supplied model must have a y component")
  }else if(length(object$y) == 0){
    stop("the y component in object has 0 rows")
  }
    
  ## Validating supplied arguments
  if(length(nthreads) != 1 || !is.numeric(nthreads) || is.na(nthreads) || nthreads <= 0){
    stop("nthreads must be a positive integer")
  }
  if(length(parallel) != 1 || !is.logical(parallel) || is.na(parallel)){
    stop("parallel must be either TRUE or FALSE")
  }
  
  ### Checking showprogress
  if(length(showprogress) != 1 || !is.logical(showprogress)){
    stop("showprogress must be a logical value")
  }
  
  ### Checking metric
  if(length(metric) > 1 || !is.character(metric)){
    stop("metric must be one of 'AIC','BIC', or 'HQIC'")
  }else if(!(metric %in% c("AIC", "BIC", "HQIC"))){
    stop("metric must be one of 'AIC','BIC', or 'HQIC'")
  }
  
  ### Checking type
  if(length(type) != 1 || !is.character(type)){
    stop("type not supported, please see documentation for valid types")
  }
  
  ### Checking bestmodels
  if(is.null(bestmodels)){
    
  }else if(length(bestmodels) != 1 || !is.numeric(bestmodels) || 
     bestmodels <= 0 || bestmodels != as.integer(bestmodels)){
    stop("bestmodels must be a positive integer")
  }else if(!is.null(cutoff) && !is.null(bestmodels)){
    stop("only one of bestmodels or cutoff can be specified")
  }
  
  ### Checking cutoff
  if(is.null(cutoff)){
    
  }else if(length(cutoff) != 1 || !is.numeric(cutoff) || cutoff < 0){
    stop("cutoff must be a non-negative number")
  }
  
  if(is.null(cutoff)){
    if(is.null(bestmodels)){
      cutoff <- 0
      bestmodels <- 1
    }else{
      cutoff <- -1
    }
  }else if(is.null(bestmodels)){
    bestmodels <- 1
  }
  
  indices <- attr(object$x, "assign")  
  counts <- table(indices)
  interactions <- attr(object$terms, "factors")[-1L, ]
  
  ## Removing rows with all zeros
  if(is.matrix(interactions)){
    interactions <- interactions[apply(interactions, 1, function(x){sum(x) > 0}),]
  }else{
    ### This only happens when only 1 variable is included
    interactions <- matrix(1, nrow = 1, ncol = 1)
  }
  
  ## Checking for intercept
  if(colnames(object$x)[1] == "(Intercept)"){
    intercept <- TRUE
    interactions <- rbind(0, interactions)
    interactions <- cbind(0, interactions)
  }else{
    intercept <- FALSE
    indices <- indices - 1
  }

  if(is.null(maxsize)){
    maxsize <- length(counts)
  }else if(length(maxsize) != 1 || !is.numeric(maxsize) || maxsize <= 0){
    warning("maxsize argument is deprecated")
    stop("maxsize must be a positive integer specifying the max size of the models") 
  }else{
    warning("maxsize argument is deprecated")
  }
  
  ## Setting starting model and saving keep1 for later use since keep is modified
  ### Checking keep
  CurNames <- colnames(attributes(terms(object$formula, data = object$data))$factors)
  if(!is.character(keep) && !is.null(keep)){
    stop("keep must be a character vector or NULL")
  }else if(!is.null(keep) && !all(keep %in% CurNames)){
    keep <- keep[!(keep %in% CurNames)]
    stop(paste0("the following elements were found in keep, but are not variable names: ",
                paste0(keep, collapse = ", ")))
  }
  ### Checking keepintercept
  if(length(keepintercept) != 1 || !is.logical(keepintercept)){
    stop("keepintercept must be a logical value")
  }
  
  keep1 <- keep
  if(!intercept){
    # Changing keepintercept to FALSE since there is no intercept
    keepintercept <- FALSE
  }
  if(is.null(keep) && type != "backward"){
    keep <- rep(0, length(counts))
    if(intercept && keepintercept){
      keep[1] <- -1
    }
  }else if(is.null(keep) && type == "backward"){
    keep <- rep(1, length(counts))
    if(intercept && keepintercept){
      keep[1] <- -1
    }
  }else{
    keep <- (CurNames %in% keep) * -1
    if(type == "backward"){
      keep[keep == 0] <- 1
    }
    if(intercept && keepintercept){
      keep <- c(-1, keep) 
    }else if(intercept){
      keep <- c(0, keep)
    }
  }
  
  ## Checking for parallel
  if(!parallel){
    nthreads <- 1
  }
  
  ## Getting penalties
  if(metric == "AIC"){
    pen <- as.vector(counts) * 2
    penalty <- 2
  }else if(metric == "BIC"){
    pen <- as.vector(counts) * log(nrow(object$x))
    penalty <- log(nrow(object$x))
  }else if(metric == "HQIC"){
    pen <- as.vector(counts) * 2 * log(log(nrow(object$x)))
    penalty <- 2 * log(log(nrow(object$x)))
  }
  
  ## Performing variable selection
  if(type == "forward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("forward selection only finds 1 final model")
    }
    df <- ForwardCpp(object$x, object$y, object$offset, indices, counts, 
                     interactions, object$method, object$grads, object$link, 
                     object$family, nthreads, object$tol, object$maxit, keep, 
                     maxsize, pen)
    optType <- "heuristic"
  }else if(type == "backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("backward elimination only finds 1 final model")
    }
    df <- BackwardCpp(object$x, object$y, object$offset, indices, counts, 
                      interactions, object$method, object$grads,
                      object$link, object$family, nthreads, object$tol, object$maxit, 
                      keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "fast backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("fast backward elimination only finds 1 final model")
    }
    df <- FastBackwardCpp(object$x, object$y, object$offset, indices, counts, 
                          interactions, object$method, object$grads,
                          object$link, object$family, nthreads, object$tol, object$maxit, 
                          keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "double backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("double backward elimination only finds 1 final model")
    }
    df <- DoubleBackwardCpp(object$x, object$y, object$offset, indices, counts, 
                            interactions, object$method, object$grads,
                            object$link, object$family, nthreads, object$tol, object$maxit, 
                            keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "fast double backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("fast double backward elimination only finds 1 final model")
    }
    df <- FastDoubleBackwardCpp(object$x, object$y, object$offset, indices, counts, 
                                interactions, object$method, object$grads,
                                object$link, object$family, nthreads, object$tol, object$maxit, 
                                keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "branch and bound"){
    df <- BranchAndBoundCpp(object$x, object$y, object$offset, indices, counts, 
                            interactions, object$method, object$grads,
                            object$link, object$family, nthreads,
                            object$tol, object$maxit, keep, maxsize,
                            pen, showprogress, bestmodels, cutoff)
    optType <- "exact"
  }else if(type == "backward branch and bound"){
    df <- BackwardBranchAndBoundCpp(object$x, object$y, object$offset, indices, 
                                    counts, interactions, object$method, object$grads,
                                    object$link, object$family, nthreads, object$tol, 
                                    object$maxit, keep, 
                                    pen, showprogress, bestmodels, cutoff)
    optType <- "exact"
  }else if(type == "switch branch and bound"){
    df <- SwitchBranchAndBoundCpp(object$x, object$y, object$offset, indices, counts, 
                                  interactions, object$method, object$grads,
                                  object$link, object$family, nthreads, 
                                  object$tol, object$maxit, keep, 
                                  pen, showprogress, bestmodels, cutoff)
    optType <- "exact"
  }else{
    stop("type not supported, please see documentation for valid types")
  }
  
  # Creating coefficient names
  names <- object$names
  if(intercept){
    names <- c("(Intercept)", names)
  }
  
  if(optType == "heuristic"){
    # Checking for infinite best metric value
    if(all(is.infinite(df$bestmetrics))){
      stop("no models were found that had an invertible fisher information")
    }
    
    # Adding penalty to gaussian and gamma families
    if(object$family %in% c("gaussian", "gamma")){
      df$bestmetrics <- df$bestmetrics + penalty
    }
    if(!intercept){
      df$order <- df$order + 1
    }
    if(is.matrix(df$order)){
      df$order <- apply(df$order, 1, function(x){
        if(x[1] > 0){
          mystr <- object$names[x[1]]
          if(x[2] > 0){
            mystr <- paste0(mystr, " and ", object$names[x[2]])
          }
          return(mystr)
        }else{
          return(NA_character_)
        }
      })
      df$order <- df$order[!is.na(df$order)]
    }else{
      df$order <- df$order[df$order > 0]
      df$order <- object$names[df$order]
    }
    
    # Making betas and bestmodels
    modelOrder <- rev(which(!is.infinite(df$bestmetrics)))
    bestmetrics <- df$bestmetrics[modelOrder]
    betas <- df$betas[, modelOrder, drop = FALSE]
    bestmodels <- df$bestmodel[, modelOrder, drop = FALSE]
    rownames(bestmodels) <- names
    rownames(betas) <- colnames(object$x)
    
    FinalList <- list("numchecked" = df$numchecked,
                      "order" = df$order,
                      "type" = type, 
                      "metric" = metric,
                      "bestmodels" = bestmodels,
                      "bestmetrics" = bestmetrics,
                      "beta" = betas,
                      "names" = names, 
                      "initmodel" = object,
                      "cutoff" = -1,
                      "keep" = keep1, 
                      "keepintercept" = keepintercept, 
                      "optType" = optType)
  }else{
    # Adding penalty to gaussian and gamma families
    if(object$family %in% c("gaussian", "gamma")){
      df$bestmetrics <- df$bestmetrics + penalty
    }
    # Checking for infinite best metric values
    if(all(is.infinite(df$bestmetrics))){
      stop("no models were found that had an invertible fisher information")
    }
    
    # Only returning best models that have a finite metric value
    newInd <- colSums(df$bestmodels != 0) != 0
    bestInd <- is.finite(df$bestmetrics)
    bestInd <- (newInd + bestInd) == 2
    bestmodels <- df$bestmodels[, bestInd, drop = FALSE]
    
    # Only returning best models that are not the null model
    bestmodels <- sapply(1:length(keep), function(i){
      ind <- which((indices + 1) == i)
      temp <- bestmodels[ind, , drop = FALSE]
      apply(temp, 2, function(x)all(x != 0) * (keep[i] + 0.5) * 2)
    })
    
    
    if(is.vector(bestmodels)){
      bestmodels <- matrix(bestmodels, ncol = 1)
    }else{
      bestmodels <- t(bestmodels)
    }
    beta <- df$bestmodels[, bestInd, drop = FALSE]
    rownames(bestmodels) <- names
    rownames(beta) <- colnames(object$x)
    FinalList <- list("numchecked" = df$numchecked,
                      "type" = type, 
                      "metric" = metric,
                      "bestmodels" = bestmodels,
                      "bestmetrics" = df$bestmetrics[bestInd],
                      "beta" = beta,
                      "names" = names, 
                      "initmodel" = object, 
                      "cutoff" = cutoff, 
                      "keep" = keep1,
                      "keepintercept" = keepintercept, 
                      "optType" = optType)
  }
  structure(FinalList, class = "BranchGLMVS")
}

#' @rdname plot.summary.BranchGLMVS 
#' @export
plot.BranchGLMVS <- function(x, ptype = "both", marnames = 7, addLines = TRUE, 
                             type = "b", horiz = FALSE,
                             cex.names = 1, cex.lab = 1, 
                             cex.axis = 1, cex.legend = 1,
                             cols = c("deepskyblue", "indianred", "forestgreen"), 
                             ...){
  plot(summary(x), ptype = ptype, marnames = marnames, 
       addLines = addLines, type = type, horiz = horiz, 
       cex.names = cex.names, cex.lab = cex.lab, 
       cex.axis = cex.axis, cex.legend = cex.legend,
       cols = cols, ...)
}

#' Extract Coefficients from BranchGLMVS or summary.BranchGLMVS Objects
#' @description Extracts beta coefficients from BranchGLMVS or summary.BranchGLMVS objects.
#' @param object a `BranchGLMVS` or `summary.BranchGLMVS` object.
#' @param which a numeric vector of indices or "all" to indicate which models to 
#' get coefficients from, the default is 1 which is used for the best model. For 
#' the branch and bound algorithms the number k is used for the kth best model 
#' and for the stepwise algorithms the number k is used for the model that is k - 1 steps 
#' away from the final model.
#' @param ... ignored.
#' @return A numeric matrix with the corresponding coefficient estimates.
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, 
#' family = "gaussian", link = "identity")
#' 
#' # Doing branch and bound selection 
#' VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
#' bestmodels = 10, showprogress = FALSE)
#' 
#' ## Getting coefficients from best model
#' coef(VS, which = 1)
#' 
#' ## Getting coefficients from all best models
#' coef(VS, which = "all")
#' 
#' @export
coef.BranchGLMVS <- function(object, which = 1, ...){
  ## Checking which
  if(!is.numeric(which) && is.character(which) && length(which) == 1){
    if(tolower(which) == "all"){
      which <- 1:NCOL(object$beta)
    }
    else{
      stop("which must be a numeric vector or 'all'")
    }
  }else if(!is.numeric(which)){
    stop("which must be a numeric vector or 'all'")
  }else if(any(which < 1)){
    stop("integers provided in which must be positive")
  }else if(any(which > NCOL(object$bestmodels))){
    stop("integers provided in which must be less than or equal to the number of best models")
  }
  
  ## Getting coefficients from all models in which
  allcoefs <- object$beta[, which, drop = FALSE]
  rownames(allcoefs) <- colnames(object$initmodel$x)
  
  ## Adding column names to identify each model
  colnames(allcoefs) <- paste0("Model", which)
  return(allcoefs)
}

#' Predict Method for BranchGLMVS or summary.BranchGLMVS Objects
#' @description Obtains predictions from BranchGLMVS or summary.BranchGLMVS objects.
#' @param object a `BranchGLMVS` or `summary.BranchGLMVS` object.
#' @param which a positive integer to indicate which model to get predictions from, 
#' the default is the best model.
#' @param which a positive integer to indicate which model to get predictions from, 
#' the default is 1 which is used for the best model. For the branch and bound 
#' algorithms the number k is used for the kth best model and for the stepwise algorithms 
#' the number k is used for the model that is k - 1 steps away from the final model.
#' @param ... further arguments passed to [predict.BranchGLM].
#' @seealso [predict.BranchGLM]
#' @return A numeric vector of predictions.
#' @export
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, 
#' family = "gamma", link = "log")
#' 
#' # Doing branch and bound selection 
#' VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
#' bestmodels = 10, showprogress = FALSE)
#' 
#' ## Getting predictions from best model
#' predict(VS, which = 1)
#' 
#' ## Getting linear predictors from 5th best model
#' predict(VS, which = 5, type = "linpreds")
#' 
predict.BranchGLMVS <- function(object, which = 1, ...){
  ## Checking which
  if(!is.numeric(which) || length(which) != 1){
    stop("which must be a positive integer")
  }
  
  ### Getting BranchGLM object
  myfit <- object$initmodel
  myfit$coefficients[, 1] <- coef(object, which = which)
  
  ### Getting predictions
  predict(myfit, ...)
}

#' Print Method for BranchGLMVS Objects
#' @description Print method for BranchGLMVS objects.
#' @param x a `BranchGLMVS` object.
#' @param digits number of digits to display.
#' @param ... further arguments passed to other methods.
#' @return The supplied `BranchGLMVS` object.
#' @export

print.BranchGLMVS <- function(x, digits = 2, ...){
  cat("Variable Selection Info:\n")
  cat(paste0(rep("-", 24), collapse = ""))
  cat("\n")
  if(!(x$type %in% c("backward", "fast backward", "double backward", "fast double backward"))){
    cat(paste0("Variables were selected using ", x$type, " selection with ", x$metric, "\n"))
  }else{
    cat(paste0("Variables were selected using ", x$type, " elimination with ", x$metric, "\n"))
  }
  
  if(x$optType == "exact"){
    if(x$cutoff >= 0){
      if(length(x$bestmetrics) == 1){
        cat(paste0("Found 1 model within ", round(x$cutoff, digits), " " , x$metric, 
                   " of the best ", x$metric, "(", round(x$bestmetrics[1], digits = digits), ")\n"))
      }else{
        cat(paste0("Found ", length(x$bestmetrics), " models within ", 
                   round(x$cutoff, digits), " " , x$metric, 
                   " of the best ", x$metric, "(", round(x$bestmetrics[1], digits = digits), ")\n"))
      }
    }else{
      if(length(x$bestmetrics) == 1){
        cat(paste0("Found the top model with ", x$metric, " = ", round(x$bestmetrics[1], digits = digits), "\n"))
      }else{
        cat(paste0("The range of ", x$metric, " values for the top ", length(x$bestmetrics), 
                   " models is (", round(x$bestmetrics[1], digits = digits), 
                   ", ", round(x$bestmetrics[length(x$bestmetrics)], digits = digits), ")\n"))
      }
    }
  }else{
    cat(paste0("The top model found had ", x$metric, " = ", round(x$bestmetrics[1], digits = digits), "\n"))
  }
  cat(paste0("Number of models fit: ", x$numchecked))
  cat("\n")
  if(!is.null(x$keep) || x$keepintercept){
    temp <- x$keep
    if(x$keepintercept){
      temp <- c("(Intercept)", temp)
    }
    cat("Variables that were kept in each model: ", paste0(temp, collapse = ", "))
  }
  cat("\n")
  if(length(x$order) == 0){
    if(x$type == "forward"){
      cat("No variables were added to the model")
    }else if(x$type %in% c("backward", "fast backward", "double backward", "fast double backward")){
      cat("No variables were removed from the model")
    }
  }else if(x$type == "forward" ){
    cat("Order the variables were added to the model:\n")
  }else if(x$type %in% c("backward", "fast backward", "double backward", "fast double backward")){
    cat("Order the variables were removed from the model:\n")
  }
  cat("\n")
  if(length(x$order) > 0){
    for(i in 1:length(x$order)){
      cat(paste0(i, "). ", x$order[i], "\n"))
    }
  }
  
  invisible(x)
}
Any scripts or data that you put into this service are public.
BranchGLM documentation built on Sept. 11, 2024, 5:09 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
BranchGLM
Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

R/VariableSelection.R
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

Defines functions print.BranchGLMVS predict.BranchGLMVS coef.BranchGLMVS plot.BranchGLMVS VariableSelection.BranchGLM VariableSelection.formula VariableSelection

Documented in coef.BranchGLMVS plot.BranchGLMVS predict.BranchGLMVS print.BranchGLMVS VariableSelection VariableSelection.BranchGLM VariableSelection.formula

Try the BranchGLM package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

BranchGLM Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

R/VariableSelection.R In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

Defines functions print.BranchGLMVS predict.BranchGLMVS coef.BranchGLMVS plot.BranchGLMVS VariableSelection.BranchGLM VariableSelection.formula VariableSelection

Documented in coef.BranchGLMVS plot.BranchGLMVS predict.BranchGLMVS print.BranchGLMVS VariableSelection VariableSelection.BranchGLM VariableSelection.formula

Try the BranchGLM package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

BranchGLM
Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

R/VariableSelection.R
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms