R/VariableSelection.R

Defines functions print.BranchGLMVS predict.BranchGLMVS coef.BranchGLMVS plot.BranchGLMVS VariableSelection.BranchGLM VariableSelection.formula VariableSelection

Documented in coef.BranchGLMVS plot.BranchGLMVS predict.BranchGLMVS print.BranchGLMVS VariableSelection VariableSelection.BranchGLM VariableSelection.formula

#' Variable Selection for GLMs 
#' @description Performs forward selection, multiple different variants of backward 
#' elimination, and efficient best subset variable selection with information 
#' criterion for generalized linear models (GLMs). Best subset selection is 
#' performed with branch and bound algorithms to greatly speed up the process 
#' and backward elimination can be performed with bounding algorithms to speed it up.
#' @param object a formula or a `BranchGLM` object.
#' @param ... further arguments.
#' @param data a data.frame, list or environment (or object coercible by 
#' [as.data.frame] to a data.frame), containing the variables in formula. 
#' Neither a matrix nor an array will be accepted. If not found in `data`, the 
#' variables are taken from `environment(formula)`, typically the environment from 
#' which `VariableSelection` is called.
#' @param family the distribution used to model the data, one of "gaussian", 
#' "gamma", "binomial", or "poisson". A `family` object may also be supplied for 
#' one of the accepted distributions.
#' @param link the link used to link the mean structure to the linear predictors. One of
#' "identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log". This only needs to be 
#' supplied if a string is supplied for `family`. If a [family] object is supplied 
#' for the `family` argument, then the link function is taken from that [family] 
#' object.
#' @param offset the offset vector, by default the zero vector is used.
#' @param method one of "Fisher", "BFGS", or "LBFGS". Fisher's scoring is recommended
#' for forward selection and the branch and bound algorithms since they will typically 
#' fit many models with a small number of covariates.
#' @param type one of "forward", "backward", "fast backward", "double backward", "fast double backward", 
#' "branch and bound", "backward branch and bound", or "switch branch and bound" 
#' to indicate the type of variable selection to perform. The default value is 
#' "switch branch and bound". See more about these algorithms in details
#' @param metric the metric used to choose the best models, the default is "AIC", 
#' but "BIC" and "HQIC" are also available. AIC is the Akaike information criterion, 
#' BIC is the Bayesian information criterion, and HQIC is the Hannan-Quinn information 
#' criterion. 
#' @param bestmodels a positive integer to indicate the number of the best models to 
#' find according to the chosen metric or NULL. If this is NULL, then cutoff is 
#' used instead. This is only used for the branch and bound algorithms.
#' @param cutoff a non-negative number which indicates that the function 
#' should return all models that have a metric value within cutoff of the 
#' best metric value or NULL. Only one of this or bestmodels should be specified and 
#' when both are NULL a cutoff of 0 is used. This is only used for the branch 
#' and bound algorithms.
#' @param keep a character vector of names to denote variables that must be in the models.
#' @param keepintercept a logical value to indicate whether to keep the intercept in 
#' all models, only used if an intercept is included in the formula.
#' @param maxsize a positive integer to denote the maximum number of variables to 
#' consider in a single model, the default is the total number of variables. 
#' This number adds onto any variables specified in keep. This argument only works 
#' for `type = "forward"` and `type = "branch and bound"`. This argument is now 
#' deprecated.
#' @param grads a positive integer to denote the number of gradients used to 
#' approximate the inverse information with, only for `method = "LBFGS"`.
#' @param parallel a logical value to indicate if parallelization should be used.
#' @param nthreads a positive integer to denote the number of threads used with OpenMP, 
#' only used if `parallel = TRUE`.
#' @param tol a positive number to denote the tolerance used to determine model convergence.
#' @param maxit a positive integer to denote the maximum number of iterations performed. 
#' The default for Fisher's scoring is 50 and for the other methods the default is 200.
#' @param showprogress a logical value to indicate whether to show progress updates 
#' for branch and bound algorithms.
#' @param contrasts see `contrasts.arg` of `model.matrix.default`.
#' @seealso [plot.BranchGLMVS], [coef.BranchGLMVS], [predict.BranchGLMVS], 
#' [summary.BranchGLMVS]
#' @details 
#' 
#' ## Variable Selection Details
#' The supplied formula or the formula from the fitted model is 
#' treated as the upper model. The variables specified in keep along with an 
#' intercept (if included in formula and keepintercept = TRUE) is the lower model. 
#' Factor variables are either kept in their entirety or entirely removed and 
#' interaction terms are properly handled. All observations that have any missing 
#' values in the upper model are removed.
#' 
#' ## Stepwise Methods
#' There are 5 different stepwise variable selection algorithms that are available. 
#' These are forward selection, backward elimination, fast backward elimination, 
#' double backward elimination, and fast double backward elimination. All of these 
#' are heuristic algorithms, so the best model found by them may not be the optimal 
#' model. 
#' 
#' ### Backward Elimination
#' 
#' Fast backward elimination should give the same results as backward elimination, but it makes 
#' use of the bounding techniques used by the branch and bound algorithms to make 
#' it faster. Fast backward elimination can give slightly different results than 
#' backward elimination if the GLM solver has difficulties fitting some of the 
#' larger models.
#' 
#' ### Double Backward Elimination
#' 
#' Double backward elimination and fast double backward elimination are a variant of 
#' backward elimination where up to 2 variables can be removed in one step instead of 
#' just 1. This typically results in higher quality models, but can also be much 
#' slower. The bounding algorithm used in fast double backward elimination makes it 
#' much faster.
#' 
#' ## Branch and Bound Methods
#' The branch and bound algorithm is an efficient algorithm used to find the optimal 
#' models. The backward branch and bound algorithm is very similar to 
#' the branch and bound algorithm, except it tends to be faster when the best models 
#' contain most of the variables. The switch branch and bound algorithm is a 
#' combination of the two algorithms and is typically the fastest of the 3 branch and 
#' bound algorithms. All of the branch and bound algorithms are guaranteed to find 
#' the optimal models (up to numerical precision).
#' 
#' ## GLM Fitting
#' 
#' Fisher's scoring is recommended for branch and bound selection and forward selection.
#' L-BFGS may be faster for the backward elimination and double backward elimination 
#' algorithms, especially when there are many variables.
#' 
#' @return A `BranchGLMVS` object which is a list with the following components
#' \item{`initmodel`}{ the `BranchGLM` object corresponding to the upper model}
#' \item{`numchecked`}{ number of models fit}
#' \item{`names`}{ character vector of the names of the predictor variables}
#' \item{`order`}{ the order the variables were added to the model or removed from the model, this is only included for the stepwise algorithms}
#' \item{`type`}{ type of variable selection employed}
#' \item{`optType`}{ whether the type specified used a heuristic or exact algorithm}
#' \item{`metric`}{ metric used to select the best models}
#' \item{`bestmodels`}{ numeric matrix used to describe the best models for the branch and bound algorithms 
#' or a numeric matrix describing the models along the path taken for stepwise algorithms}
#' \item{`bestmetrics`}{ numeric vector with the best metrics found in the search for the branch and bound algorithms 
#' or a numeric vector with the metric values along the path taken for stepwise algorithms}
#' \item{`beta`}{ numeric matrix of beta coefficients for the models in bestmodels}
#' \item{`cutoff`}{ the cutoff that was used, this is set to -1 if bestmodels was used instead or if 
#' a stepwise algorithm was used}
#' \item{`keep`}{ vector of which variables were kept through the selection process}
#' \item{`keepintercept`}{ a boolean value denoting whether to keep the intercept through the selection process or not}
#' @name VariableSelection
#' 
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", 
#' link = "identity")
#' 
#' # Doing branch and bound selection 
#' VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
#' bestmodels = 10, showprogress = FALSE)
#' VS
#' 
#' ## Plotting the BIC of the best models
#' plot(VS, type = "b")
#' 
#' ## Getting the coefficients of the best model according to BIC
#' FinalModel <- coef(VS, which = 1)
#' FinalModel
#' 
#' # Now doing it in parallel (although it isn't necessary for this dataset)
#' parVS <- VariableSelection(Fit, type = "branch and bound", parallel = TRUE, 
#' metric = "BIC", bestmodels = 10, showprogress = FALSE)
#' 
#' ## Getting the coefficients of the best model according to BIC
#' FinalModel <- coef(parVS, which = 1)
#' FinalModel
#' 
#' # Using a formula
#' formVS <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
#' link = "identity", metric = "BIC", type = "branch and bound", bestmodels = 10, 
#' showprogress = FALSE)
#' 
#' ## Getting the coefficients of the best model according to BIC
#' FinalModel <- coef(formVS, which = 1)
#' FinalModel
#' 
#' # Using the keep argument
#' keepVS <- VariableSelection(Fit, type = "branch and bound", 
#' keep = c("Species", "Petal.Width"), metric = "BIC", bestmodels = 4, 
#' showprogress = FALSE)
#' keepVS
#' 
#' ## Getting the coefficients from the fourth best model according to BIC when 
#' ## keeping Petal.Width and Species in every model
#' FinalModel <- coef(keepVS, which = 4)
#' FinalModel
#' 
#' # Treating categorical variable beta parameters separately
#' ## This function automatically groups together parameters from a categorical variable
#' ## to avoid this, you need to create the indicator variables yourself
#' x <- model.matrix(Sepal.Length ~ ., data = iris)
#' Sepal.Length <- iris$Sepal.Length
#' Data <- cbind.data.frame(Sepal.Length, x[, -1])
#' VSCat <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
#' link = "identity", metric = "BIC", bestmodels = 10, showprogress = FALSE)
#' VSCat
#' 
#' ## Plotting results
#' plot(VSCat, cex.names = 0.75)
#' 
#' @export
#' 
VariableSelection <- function(object, ...) {
  UseMethod("VariableSelection")
}

#'@rdname VariableSelection
#'@export

VariableSelection.formula <- function(object, data = NULL, family, link, offset = NULL,
                                      method = "Fisher", type = "switch branch and bound", 
                                      metric = "AIC",
                                      bestmodels = NULL, cutoff = NULL, 
                                      keep = NULL, keepintercept = TRUE, maxsize = NULL,
                                      grads = 10, parallel = FALSE, 
                                      nthreads = 8, tol = 1e-6, maxit = NULL,
                                      contrasts = NULL,
                                      showprogress = TRUE, ...){
  ### Performing variable selection
  ### Getting BranchGLM object
  mf <- match.call(expand.dots = FALSE)
  m <- match(c("data", "family", "link", "offset",
               "method", "grads", "tol", "maxit", "contrasts"), names(mf), 0L)
  mf <- mf[c(1L, m)]
  mf[[1L]] <- quote(BranchGLM)
  mf$fit <- FALSE
  mf$formula <- object
  fit <- eval(mf, parent.frame())
  
  VariableSelection(fit, type = type, metric = metric, 
                    bestmodels = bestmodels, cutoff = cutoff,
                    keep = keep, keepintercept = keepintercept, 
                    maxsize = maxsize, parallel = parallel, 
                    nthreads = nthreads,
                    showprogress = showprogress, ...)
}

#'@rdname VariableSelection
#'@export

VariableSelection.BranchGLM <- function(object, type = "switch branch and bound", 
                                        metric = "AIC",
                                        bestmodels = NULL, cutoff = NULL, 
                                        keep = NULL, keepintercept = TRUE, maxsize = NULL,
                                        parallel = FALSE, nthreads = 8,
                                        showprogress = TRUE, ...){
  ## converting metric to upper and type to lower
  type <- tolower(type)
  metric <- toupper(metric)
  
  ## Checking if supplied BranchGLM object has x and data
  if(is.null(object$x)){
    stop("the supplied model must have an x component")
  }else if(nrow(object$x) == 0){
    stop("the design matrix in object has 0 rows")
  }
  ## Checking if supplied BranchGLM object has y
  if(is.null(object$y)){
    stop("the supplied model must have a y component")
  }else if(length(object$y) == 0){
    stop("the y component in object has 0 rows")
  }
    
  ## Validating supplied arguments
  if(length(nthreads) != 1 || !is.numeric(nthreads) || is.na(nthreads) || nthreads <= 0){
    stop("nthreads must be a positive integer")
  }
  if(length(parallel) != 1 || !is.logical(parallel) || is.na(parallel)){
    stop("parallel must be either TRUE or FALSE")
  }
  
  ### Checking showprogress
  if(length(showprogress) != 1 || !is.logical(showprogress)){
    stop("showprogress must be a logical value")
  }
  
  ### Checking metric
  if(length(metric) > 1 || !is.character(metric)){
    stop("metric must be one of 'AIC','BIC', or 'HQIC'")
  }else if(!(metric %in% c("AIC", "BIC", "HQIC"))){
    stop("metric must be one of 'AIC','BIC', or 'HQIC'")
  }
  
  ### Checking type
  if(length(type) != 1 || !is.character(type)){
    stop("type not supported, please see documentation for valid types")
  }
  
  ### Checking bestmodels
  if(is.null(bestmodels)){
    
  }else if(length(bestmodels) != 1 || !is.numeric(bestmodels) || 
     bestmodels <= 0 || bestmodels != as.integer(bestmodels)){
    stop("bestmodels must be a positive integer")
  }else if(!is.null(cutoff) && !is.null(bestmodels)){
    stop("only one of bestmodels or cutoff can be specified")
  }
  
  ### Checking cutoff
  if(is.null(cutoff)){
    
  }else if(length(cutoff) != 1 || !is.numeric(cutoff) || cutoff < 0){
    stop("cutoff must be a non-negative number")
  }
  
  if(is.null(cutoff)){
    if(is.null(bestmodels)){
      cutoff <- 0
      bestmodels <- 1
    }else{
      cutoff <- -1
    }
  }else if(is.null(bestmodels)){
    bestmodels <- 1
  }
  
  indices <- attr(object$x, "assign")  
  counts <- table(indices)
  interactions <- attr(object$terms, "factors")[-1L, ]
  
  ## Removing rows with all zeros
  if(is.matrix(interactions)){
    interactions <- interactions[apply(interactions, 1, function(x){sum(x) > 0}),]
  }else{
    ### This only happens when only 1 variable is included
    interactions <- matrix(1, nrow = 1, ncol = 1)
  }
  
  ## Checking for intercept
  if(colnames(object$x)[1] == "(Intercept)"){
    intercept <- TRUE
    interactions <- rbind(0, interactions)
    interactions <- cbind(0, interactions)
  }else{
    intercept <- FALSE
    indices <- indices - 1
  }

  if(is.null(maxsize)){
    maxsize <- length(counts)
  }else if(length(maxsize) != 1 || !is.numeric(maxsize) || maxsize <= 0){
    warning("maxsize argument is deprecated")
    stop("maxsize must be a positive integer specifying the max size of the models") 
  }else{
    warning("maxsize argument is deprecated")
  }
  
  ## Setting starting model and saving keep1 for later use since keep is modified
  ### Checking keep
  CurNames <- colnames(attributes(terms(object$formula, data = object$data))$factors)
  if(!is.character(keep) && !is.null(keep)){
    stop("keep must be a character vector or NULL")
  }else if(!is.null(keep) && !all(keep %in% CurNames)){
    keep <- keep[!(keep %in% CurNames)]
    stop(paste0("the following elements were found in keep, but are not variable names: ",
                paste0(keep, collapse = ", ")))
  }
  ### Checking keepintercept
  if(length(keepintercept) != 1 || !is.logical(keepintercept)){
    stop("keepintercept must be a logical value")
  }
  
  keep1 <- keep
  if(!intercept){
    # Changing keepintercept to FALSE since there is no intercept
    keepintercept <- FALSE
  }
  if(is.null(keep) && type != "backward"){
    keep <- rep(0, length(counts))
    if(intercept && keepintercept){
      keep[1] <- -1
    }
  }else if(is.null(keep) && type == "backward"){
    keep <- rep(1, length(counts))
    if(intercept && keepintercept){
      keep[1] <- -1
    }
  }else{
    keep <- (CurNames %in% keep) * -1
    if(type == "backward"){
      keep[keep == 0] <- 1
    }
    if(intercept && keepintercept){
      keep <- c(-1, keep) 
    }else if(intercept){
      keep <- c(0, keep)
    }
  }
  
  ## Checking for parallel
  if(!parallel){
    nthreads <- 1
  }
  
  ## Getting penalties
  if(metric == "AIC"){
    pen <- as.vector(counts) * 2
    penalty <- 2
  }else if(metric == "BIC"){
    pen <- as.vector(counts) * log(nrow(object$x))
    penalty <- log(nrow(object$x))
  }else if(metric == "HQIC"){
    pen <- as.vector(counts) * 2 * log(log(nrow(object$x)))
    penalty <- 2 * log(log(nrow(object$x)))
  }
  
  ## Performing variable selection
  if(type == "forward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("forward selection only finds 1 final model")
    }
    df <- ForwardCpp(object$x, object$y, object$offset, indices, counts, 
                     interactions, object$method, object$grads, object$link, 
                     object$family, nthreads, object$tol, object$maxit, keep, 
                     maxsize, pen)
    optType <- "heuristic"
  }else if(type == "backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("backward elimination only finds 1 final model")
    }
    df <- BackwardCpp(object$x, object$y, object$offset, indices, counts, 
                      interactions, object$method, object$grads,
                      object$link, object$family, nthreads, object$tol, object$maxit, 
                      keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "fast backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("fast backward elimination only finds 1 final model")
    }
    df <- FastBackwardCpp(object$x, object$y, object$offset, indices, counts, 
                          interactions, object$method, object$grads,
                          object$link, object$family, nthreads, object$tol, object$maxit, 
                          keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "double backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("double backward elimination only finds 1 final model")
    }
    df <- DoubleBackwardCpp(object$x, object$y, object$offset, indices, counts, 
                            interactions, object$method, object$grads,
                            object$link, object$family, nthreads, object$tol, object$maxit, 
                            keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "fast double backward"){
    if(bestmodels > 1 || cutoff > 0){
      warning("fast double backward elimination only finds 1 final model")
    }
    df <- FastDoubleBackwardCpp(object$x, object$y, object$offset, indices, counts, 
                                interactions, object$method, object$grads,
                                object$link, object$family, nthreads, object$tol, object$maxit, 
                                keep, length(counts), pen)
    optType <- "heuristic"
  }else if(type == "branch and bound"){
    df <- BranchAndBoundCpp(object$x, object$y, object$offset, indices, counts, 
                            interactions, object$method, object$grads,
                            object$link, object$family, nthreads,
                            object$tol, object$maxit, keep, maxsize,
                            pen, showprogress, bestmodels, cutoff)
    optType <- "exact"
  }else if(type == "backward branch and bound"){
    df <- BackwardBranchAndBoundCpp(object$x, object$y, object$offset, indices, 
                                    counts, interactions, object$method, object$grads,
                                    object$link, object$family, nthreads, object$tol, 
                                    object$maxit, keep, 
                                    pen, showprogress, bestmodels, cutoff)
    optType <- "exact"
  }else if(type == "switch branch and bound"){
    df <- SwitchBranchAndBoundCpp(object$x, object$y, object$offset, indices, counts, 
                                  interactions, object$method, object$grads,
                                  object$link, object$family, nthreads, 
                                  object$tol, object$maxit, keep, 
                                  pen, showprogress, bestmodels, cutoff)
    optType <- "exact"
  }else{
    stop("type not supported, please see documentation for valid types")
  }
  
  # Creating coefficient names
  names <- object$names
  if(intercept){
    names <- c("(Intercept)", names)
  }
  
  if(optType == "heuristic"){
    # Checking for infinite best metric value
    if(all(is.infinite(df$bestmetrics))){
      stop("no models were found that had an invertible fisher information")
    }
    
    # Adding penalty to gaussian and gamma families
    if(object$family %in% c("gaussian", "gamma")){
      df$bestmetrics <- df$bestmetrics + penalty
    }
    if(!intercept){
      df$order <- df$order + 1
    }
    if(is.matrix(df$order)){
      df$order <- apply(df$order, 1, function(x){
        if(x[1] > 0){
          mystr <- object$names[x[1]]
          if(x[2] > 0){
            mystr <- paste0(mystr, " and ", object$names[x[2]])
          }
          return(mystr)
        }else{
          return(NA_character_)
        }
      })
      df$order <- df$order[!is.na(df$order)]
    }else{
      df$order <- df$order[df$order > 0]
      df$order <- object$names[df$order]
    }
    
    # Making betas and bestmodels
    modelOrder <- rev(which(!is.infinite(df$bestmetrics)))
    bestmetrics <- df$bestmetrics[modelOrder]
    betas <- df$betas[, modelOrder, drop = FALSE]
    bestmodels <- df$bestmodel[, modelOrder, drop = FALSE]
    rownames(bestmodels) <- names
    rownames(betas) <- colnames(object$x)
    
    FinalList <- list("numchecked" = df$numchecked,
                      "order" = df$order,
                      "type" = type, 
                      "metric" = metric,
                      "bestmodels" = bestmodels,
                      "bestmetrics" = bestmetrics,
                      "beta" = betas,
                      "names" = names, 
                      "initmodel" = object,
                      "cutoff" = -1,
                      "keep" = keep1, 
                      "keepintercept" = keepintercept, 
                      "optType" = optType)
  }else{
    # Adding penalty to gaussian and gamma families
    if(object$family %in% c("gaussian", "gamma")){
      df$bestmetrics <- df$bestmetrics + penalty
    }
    # Checking for infinite best metric values
    if(all(is.infinite(df$bestmetrics))){
      stop("no models were found that had an invertible fisher information")
    }
    
    # Only returning best models that have a finite metric value
    newInd <- colSums(df$bestmodels != 0) != 0
    bestInd <- is.finite(df$bestmetrics)
    bestInd <- (newInd + bestInd) == 2
    bestmodels <- df$bestmodels[, bestInd, drop = FALSE]
    
    # Only returning best models that are not the null model
    bestmodels <- sapply(1:length(keep), function(i){
      ind <- which((indices + 1) == i)
      temp <- bestmodels[ind, , drop = FALSE]
      apply(temp, 2, function(x)all(x != 0) * (keep[i] + 0.5) * 2)
    })
    
    
    if(is.vector(bestmodels)){
      bestmodels <- matrix(bestmodels, ncol = 1)
    }else{
      bestmodels <- t(bestmodels)
    }
    beta <- df$bestmodels[, bestInd, drop = FALSE]
    rownames(bestmodels) <- names
    rownames(beta) <- colnames(object$x)
    FinalList <- list("numchecked" = df$numchecked,
                      "type" = type, 
                      "metric" = metric,
                      "bestmodels" = bestmodels,
                      "bestmetrics" = df$bestmetrics[bestInd],
                      "beta" = beta,
                      "names" = names, 
                      "initmodel" = object, 
                      "cutoff" = cutoff, 
                      "keep" = keep1,
                      "keepintercept" = keepintercept, 
                      "optType" = optType)
  }
  structure(FinalList, class = "BranchGLMVS")
}

#' @rdname plot.summary.BranchGLMVS 
#' @export
plot.BranchGLMVS <- function(x, ptype = "both", marnames = 7, addLines = TRUE, 
                             type = "b", horiz = FALSE,
                             cex.names = 1, cex.lab = 1, 
                             cex.axis = 1, cex.legend = 1,
                             cols = c("deepskyblue", "indianred", "forestgreen"), 
                             ...){
  plot(summary(x), ptype = ptype, marnames = marnames, 
       addLines = addLines, type = type, horiz = horiz, 
       cex.names = cex.names, cex.lab = cex.lab, 
       cex.axis = cex.axis, cex.legend = cex.legend,
       cols = cols, ...)
}

#' Extract Coefficients from BranchGLMVS or summary.BranchGLMVS Objects
#' @description Extracts beta coefficients from BranchGLMVS or summary.BranchGLMVS objects.
#' @param object a `BranchGLMVS` or `summary.BranchGLMVS` object.
#' @param which a numeric vector of indices or "all" to indicate which models to 
#' get coefficients from, the default is 1 which is used for the best model. For 
#' the branch and bound algorithms the number k is used for the kth best model 
#' and for the stepwise algorithms the number k is used for the model that is k - 1 steps 
#' away from the final model.
#' @param ... ignored.
#' @return A numeric matrix with the corresponding coefficient estimates.
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, 
#' family = "gaussian", link = "identity")
#' 
#' # Doing branch and bound selection 
#' VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
#' bestmodels = 10, showprogress = FALSE)
#' 
#' ## Getting coefficients from best model
#' coef(VS, which = 1)
#' 
#' ## Getting coefficients from all best models
#' coef(VS, which = "all")
#' 
#' @export
coef.BranchGLMVS <- function(object, which = 1, ...){
  ## Checking which
  if(!is.numeric(which) && is.character(which) && length(which) == 1){
    if(tolower(which) == "all"){
      which <- 1:NCOL(object$beta)
    }
    else{
      stop("which must be a numeric vector or 'all'")
    }
  }else if(!is.numeric(which)){
    stop("which must be a numeric vector or 'all'")
  }else if(any(which < 1)){
    stop("integers provided in which must be positive")
  }else if(any(which > NCOL(object$bestmodels))){
    stop("integers provided in which must be less than or equal to the number of best models")
  }
  
  ## Getting coefficients from all models in which
  allcoefs <- object$beta[, which, drop = FALSE]
  rownames(allcoefs) <- colnames(object$initmodel$x)
  
  ## Adding column names to identify each model
  colnames(allcoefs) <- paste0("Model", which)
  return(allcoefs)
}

#' Predict Method for BranchGLMVS or summary.BranchGLMVS Objects
#' @description Obtains predictions from BranchGLMVS or summary.BranchGLMVS objects.
#' @param object a `BranchGLMVS` or `summary.BranchGLMVS` object.
#' @param which a positive integer to indicate which model to get predictions from, 
#' the default is the best model.
#' @param which a positive integer to indicate which model to get predictions from, 
#' the default is 1 which is used for the best model. For the branch and bound 
#' algorithms the number k is used for the kth best model and for the stepwise algorithms 
#' the number k is used for the model that is k - 1 steps away from the final model.
#' @param ... further arguments passed to [predict.BranchGLM].
#' @seealso [predict.BranchGLM]
#' @return A numeric vector of predictions.
#' @export
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, 
#' family = "gamma", link = "log")
#' 
#' # Doing branch and bound selection 
#' VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
#' bestmodels = 10, showprogress = FALSE)
#' 
#' ## Getting predictions from best model
#' predict(VS, which = 1)
#' 
#' ## Getting linear predictors from 5th best model
#' predict(VS, which = 5, type = "linpreds")
#' 
predict.BranchGLMVS <- function(object, which = 1, ...){
  ## Checking which
  if(!is.numeric(which) || length(which) != 1){
    stop("which must be a positive integer")
  }
  
  ### Getting BranchGLM object
  myfit <- object$initmodel
  myfit$coefficients[, 1] <- coef(object, which = which)
  
  ### Getting predictions
  predict(myfit, ...)
}

#' Print Method for BranchGLMVS Objects
#' @description Print method for BranchGLMVS objects.
#' @param x a `BranchGLMVS` object.
#' @param digits number of digits to display.
#' @param ... further arguments passed to other methods.
#' @return The supplied `BranchGLMVS` object.
#' @export

print.BranchGLMVS <- function(x, digits = 2, ...){
  cat("Variable Selection Info:\n")
  cat(paste0(rep("-", 24), collapse = ""))
  cat("\n")
  if(!(x$type %in% c("backward", "fast backward", "double backward", "fast double backward"))){
    cat(paste0("Variables were selected using ", x$type, " selection with ", x$metric, "\n"))
  }else{
    cat(paste0("Variables were selected using ", x$type, " elimination with ", x$metric, "\n"))
  }
  
  if(x$optType == "exact"){
    if(x$cutoff >= 0){
      if(length(x$bestmetrics) == 1){
        cat(paste0("Found 1 model within ", round(x$cutoff, digits), " " , x$metric, 
                   " of the best ", x$metric, "(", round(x$bestmetrics[1], digits = digits), ")\n"))
      }else{
        cat(paste0("Found ", length(x$bestmetrics), " models within ", 
                   round(x$cutoff, digits), " " , x$metric, 
                   " of the best ", x$metric, "(", round(x$bestmetrics[1], digits = digits), ")\n"))
      }
    }else{
      if(length(x$bestmetrics) == 1){
        cat(paste0("Found the top model with ", x$metric, " = ", round(x$bestmetrics[1], digits = digits), "\n"))
      }else{
        cat(paste0("The range of ", x$metric, " values for the top ", length(x$bestmetrics), 
                   " models is (", round(x$bestmetrics[1], digits = digits), 
                   ", ", round(x$bestmetrics[length(x$bestmetrics)], digits = digits), ")\n"))
      }
    }
  }else{
    cat(paste0("The top model found had ", x$metric, " = ", round(x$bestmetrics[1], digits = digits), "\n"))
  }
  cat(paste0("Number of models fit: ", x$numchecked))
  cat("\n")
  if(!is.null(x$keep) || x$keepintercept){
    temp <- x$keep
    if(x$keepintercept){
      temp <- c("(Intercept)", temp)
    }
    cat("Variables that were kept in each model: ", paste0(temp, collapse = ", "))
  }
  cat("\n")
  if(length(x$order) == 0){
    if(x$type == "forward"){
      cat("No variables were added to the model")
    }else if(x$type %in% c("backward", "fast backward", "double backward", "fast double backward")){
      cat("No variables were removed from the model")
    }
  }else if(x$type == "forward" ){
    cat("Order the variables were added to the model:\n")
  }else if(x$type %in% c("backward", "fast backward", "double backward", "fast double backward")){
    cat("Order the variables were removed from the model:\n")
  }
  cat("\n")
  if(length(x$order) > 0){
    for(i in 1:length(x$order)){
      cat(paste0(i, "). ", x$order[i], "\n"))
    }
  }
  
  invisible(x)
}

Try the BranchGLM package in your browser

Any scripts or data that you put into this service are public.

BranchGLM documentation built on Sept. 11, 2024, 5:09 p.m.