R/BranchGLM.R
In BranchGLM: Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'

Documented in BranchGLM BranchGLM.fit coef.BranchGLM logLik.BranchGLM predict.BranchGLM print.BranchGLM vcov.BranchGLM

#' Fits GLMs
#' @param formula a formula for the model.
#' @param data a dataframe that contains the response and predictor variables.
#' @param family distribution used to model the data, one of "gaussian", "gamma", "binomial", or "poisson".
#' @param link link used to link mean structure to linear predictors. One of
#' "identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log".
#' @param offset offset vector, by default the zero vector is used.
#' @param method one of "Fisher", "BFGS", or "LBFGS". BFGS and L-BFGS are 
#' quasi-newton methods which are typically faster than Fisher's scoring when
#' there are many covariates (at least 50).
#' @param grads number of gradients used to approximate inverse information with, only for \code{method = "LBFGS"}.
#' @param parallel whether or not to make use of parallelization via OpenMP.
#' @param nthreads number of threads used with OpenMP, only used if \code{parallel = TRUE}.
#' @param tol tolerance used to determine model convergence.
#' @param maxit maximum number of iterations performed. The default for 
#' Fisher's scoring is 50 and for the other methods the default is 200.
#' @param init initial values for the betas, if not specified then they are automatically 
#' selected.
#' @param fit a logical value to indicate whether to fit the model or not. Setting 
#' this to false will make it so no coefficients matrix or variance-covariance 
#' matrix are returned.
#' @param keepData Whether or not to store a copy of data and design matrix, the default 
#' is TRUE. If this is FALSE, then the results from this cannot be used inside of \code{VariableSelection}.
#' @param keepY Whether or not to store a copy of y, the default is TRUE. If 
#' this is FALSE, then the binomial GLM helper functions may not work and this 
#' cannot be used inside of \code{VariableSelection}.
#' @param contrasts see \code{contrasts.arg} of \code{model.matrix.default}.
#' @param x design matrix used for the fit, must be numeric.
#' @param y outcome vector, must be numeric.
#' @return \code{BranchGLM} returns a \code{BranchGLM} object which is a list with the following components
#' \item{\code{coefficients}}{ a matrix with the coefficients estimates, SEs, wald test statistics, and p-values}
#' \item{\code{iterations}}{ number of iterations it took the algorithm to converge, if the algorithm failed to converge then this is -1}
#' \item{\code{dispersion}}{ the value of the dispersion parameter}
#' \item{\code{logLik}}{ the log-likelihood of the fitted model}
#' \item{\code{vcov}}{ the variance-covariance matrix of the fitted model}
#' \item{\code{resDev}}{ the residual deviance of the fitted model}
#' \item{\code{AIC}}{ the AIC of the fitted model}
#' \item{\code{preds}}{ predictions from the fitted model}
#' \item{\code{linpreds}}{ linear predictors from the fitted model}
#' \item{\code{tol}}{ tolerance used to fit the model}
#' \item{\code{maxit}}{ maximum number of iterations used to fit the model}
#' \item{\code{formula}}{ formula used to fit the model}
#' \item{\code{method}}{ iterative method used to fit the model}
#' \item{\code{grads}}{ number of gradients used to approximate inverse information for L-BFGS}
#' \item{\code{y}}{ y vector used in the model, not included if \code{keepY = FALSE}}
#' \item{\code{x}}{ design matrix used to fit the model, not included if \code{keepData = FALSE}}
#' \item{\code{offset}}{ offset vector in the model, not included if \code{keepData = FALSE}}
#' \item{\code{data}}{ original dataframe supplied to the function, not included if \code{keepData = FALSE}}
#' \item{\code{mf}}{ the model frame, not included if \code{keepData = FALSE}}
#' \item{\code{numobs}}{ number of observations in the design matrix}
#' \item{\code{names}}{ names of the variables}
#' \item{\code{yname}}{ name of y variable}
#' \item{\code{parallel}}{ whether parallelization was employed to speed up model fitting process}
#' \item{\code{missing}}{ number of missing values removed from the original dataset}
#' \item{\code{link}}{ link function used to model the data}
#' \item{\code{family}}{ family used to model the data}
#' \item{\code{ylevel}}{ the levels of y, only included for binomial glms}
#' \item{\code{xlev}}{ the levels of the factors in the dataset}
#' \item{\code{terms}}{the terms object used}
#' 
#' \code{BranchGLM.fit} returns a list with the following components
#' \item{\code{coefficients}}{ a matrix with the coefficients estimates, SEs, wald test statistics, and p-values}
#' \item{\code{iterations}}{ number of iterations it took the algorithm to converge, if the algorithm failed to converge then this is -1}
#' \item{\code{dispersion}}{ the value of the dispersion parameter}
#' \item{\code{logLik}}{ the log-likelihood of the fitted model}
#' \item{\code{vcov}}{ the variance-covariance matrix of the fitted model}
#' \item{\code{resDev}}{ the residual deviance of the fitted model}
#' \item{\code{AIC}}{ the AIC of the fitted model}
#' \item{\code{preds}}{ predictions from the fitted model}
#' \item{\code{linpreds}}{ linear predictors from the fitted model}
#' \item{\code{tol}}{ tolerance used to fit the model}
#' \item{\code{maxit}}{ maximum number of iterations used to fit the model}
#' 
#' @description Fits generalized linear models via RcppArmadillo. Also has the 
#' ability to fit the models with parallelization via OpenMP.
#' @details Can use BFGS, L-BFGS, or Fisher's scoring to fit the GLM. BFGS and L-BFGS are 
#' typically faster than Fisher's scoring when there are at least 50 covariates 
#' and Fisher's scoring is typically best when there are fewer than 50 covariates.
#' This function does not currently support the use of weights. In the special 
#' case of gaussian regression with identity link the \code{method} argument is ignored
#' and the normal equations are solved directly.
#' 
#' The models are fit in C++ by using Rcpp and RcppArmadillo. In order to help 
#' convergence, each of the methods makes use of a backtracking line-search using 
#' the strong Wolfe conditions to find an adequate step size. There are also 
#' two conditions used to control convergence, the first is whether there is a 
#' sufficient decrease in the negative log-likelihood, and the other is whether 
#' the norm of the score is sufficiently small. The 
#' \code{tol} argument controls both of these criteria. If the algorithm fails to 
#' converge, then \code{iterations} will be -1.
#' 
#' All observations with any missing values are removed before model fitting. 
#' 
#' The dispersion parameter for gamma regression is estimated via maximum likelihood, 
#' very similar to the \code{gamma.dispersion} function from the MASS package.
#' 
#' \code{BranchGLM.fit} can be faster than calling \code{BranchGLM} if the 
#' x matrix and y vector are already available, but doesn't return as much information.
#' The object returned by \code{BranchGLM.fit} is not of class \code{BranchGLM}, so 
#' all of the methods for \code{BranchGLM} objects such as \code{predict} or 
#' \code{VariableSelection} cannot be used.
#' 
#' @examples
#' Data <- iris
#' ### Using BranchGLM
#' BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity")
#' 
#' ### Using BranchGLM.fit
#' x <- model.matrix(Sepal.Length ~ ., data = Data)
#' y <- Data$Sepal.Length
#' BranchGLM.fit(x, y, family = "gaussian", link = "identity")
#' @export

BranchGLM <- function(formula, data, family, link, offset = NULL, 
                    method = "Fisher", grads = 10,
                    parallel = FALSE, nthreads = 8, 
                    tol = 1e-6, maxit = NULL, init = NULL, fit = TRUE, 
                    contrasts = NULL, keepData = TRUE,
                    keepY = TRUE){
  
  ### Validating supplied arguments
  if(!is(formula, "formula")){
    stop("formula must be a valid formula")
  }
  if(!is.data.frame(data)){
    stop("data must be a data frame")
  }
  if(length(method) != 1 || !(method %in% c("Fisher", "BFGS", "LBFGS"))){
    stop("method must be exactly one of 'Fisher', 'BFGS', or 'LBFGS'")
  }
  if(!family %in% c("gaussian", "binomial", "poisson", "gamma")){
    stop("family must be one of 'gaussian', 'binomial', 'gamma', or 'poisson'")
  }
  if(!link %in% c("logit", "probit", "cloglog", "log", "identity", "inverse", "sqrt")){
    stop("link must be one of 'logit', 'probit', 'cloglog', 'log', 'inverse', 'sqrt', or 'identity'")
  }
  if(length(grads) != 1 || !is.numeric(grads) || as.integer(grads) <= 0){
    stop("grads must be a positive integer")
  }
  
  ### Evaluating arguments
  mf <- match.call(expand.dots = FALSE)
  m <- match(c("formula", "data", "offset"), names(mf), 0L)
  mf <- mf[c(1L, m)]
  mf$drop.unused.levels <- TRUE
  mf$na.action <- "na.omit"
  mf[[1L]] <- quote(model.frame)
  mf <- eval(mf, parent.frame())
  
  ## Getting data objects
  y <- model.response(mf, "any")
  offset <- as.vector(model.offset(mf))
  x <- model.matrix(attr(mf, "terms"), mf, contrasts)
  
  if(is.null(offset)){
    offset <- rep(0, length(y))
  }
  
  
  ## Checking y variable for binomial family
  if(family == "binomial"){
    if(!(link %in% c("cloglog", "log", "logit", "probit"))){
      stop("valid link functions for binomial regression are 'cloglog', 'log', 'logit', and 'probit'")
    }else if(is.factor(y) && (nlevels(y) == 2)){
      ylevel <- levels(y)
      y <- as.numeric(y == ylevel[2])
    }else if(is.numeric(y) && all(y %in% c(0, 1))){
      ylevel <- c(0, 1)
    }else if(is.logical(y)){
      ylevel <- c(FALSE, TRUE)
      y <- y * 1
    }else{
      stop("response variable for binomial regression must be numeric with only 
      0s and 1s, a two-level factor, or a logical vector")
    }
  }
  
  ## Getting maxit
  if(is.null(maxit)){
    if(method == "Fisher"){
      maxit = 50
    }else{
      maxit = 200
    }
  }
  
  ### Using BranchGLM.fit to fit GLM
  if(fit){
    df <- BranchGLM.fit(x, y, family, link, offset, method, grads, parallel, nthreads, 
                        init, maxit, tol)
  
    row.names(df$coefficients) <- colnames(x)
  }else{
    df <- list("coefficients" = NULL)
  }
  df$formula <- formula
  
  df$method <- method
  
  if(keepY){
    df$y <- y
  }
  
  df$numobs <- nrow(x)
  
  if(keepData){
    df$data <- data
    df$x <- x
    df$mf <- mf
    df$offset <- offset
  }
  df$names <- attributes(terms(formula, data = data))$factors |>
              colnames()
  
  df$yname <- attributes(terms(formula, data = data))$variables[-1] |>
              as.character()
  
  df$yname <- df$yname[attributes(terms(formula, data = data))$response]
  
  df$parallel <- parallel
  
  df$missing <- nrow(data) - nrow(x)
  
  df$link <- link
  
  df$contrasts <- contrasts
  
  df$family <- family
  
  df$terms <- attr(mf, "terms")
  
  df$xlev <- .getXlevels(df$terms, mf)
  
  df$grads <- grads
  
  df$tol <- tol
  
  df$maxit <- maxit
  
  if(family == "binomial"){
    df$ylevel <- ylevel
  }
  if((family == "gaussian" || family == "gamma") && fit){
    colnames(df$coefficients)[3] <- "t"
  }
  
  # Setting names for vcov
  if(fit){
    rownames(df$vcov) <- colnames(df$vcov) <- colnames(x)
  }
  
  structure(df, class = "BranchGLM")
}

#' @rdname BranchGLM
#' @export
BranchGLM.fit <- function(x, y, family, link, offset = NULL,
                          method = "Fisher", grads = 10,
                          parallel = FALSE, nthreads = 8, init = NULL,  
                          maxit = NULL, tol = 1e-6){
  
  ## Performing a few checks
  if(!is.matrix(x) || !is.numeric(x)){
    stop("x must be a numeric matrix")
  }else if(!is.numeric(y)){
    stop("y must be numeric")
  }else if(nrow(x) != length(y)){
    stop("the number of rows in x must be the same as the length of y")
  }
  
  ## Getting initial values
  if(is.null(init)){
    init <- rep(0, ncol(x))
    GetInit <- TRUE
  }else if(!is.numeric(init) || length(init) != ncol(x)){
    stop("init must be null or a numeric vector with length equal to the number of betas")
  }else{
    GetInit <- FALSE
  }
  
  ## Getting maxit
  if(is.null(maxit)){
    if(method == "Fisher"){
      maxit = 50
    }else{
      maxit = 200
    }
  }
  
  ## Checking y variable and link function for each family
  if(family == "binomial"){
    if(!(link %in% c("cloglog", "log", "logit", "probit"))){
      stop("valid link functions for binomial regression are 'cloglog', 'log', 'logit', and 'probit'")
    }else if(!all(y %in% c(0, 1))){
      stop("for binomial regression y must be a vector of 0s and 1s")
    }
  }else if(family == "poisson"){
    if(!(link %in% c("identity", "log", "sqrt"))){
      stop("valid link functions for poisson regression are 'identity', 'log', and 'sqrt'")
    }else if(!is.numeric(y) || any(y < 0)){
      stop("response variable for poisson regression must be a numeric vector of non-negative integers")
    }else if(any(as.integer(y)!= y)){
      stop("response variable for poisson regression must be a numeric vector of non-negative integers")
    }
  }else if(family == "gaussian"){
    if(!(link %in% c("inverse", "identity", "log", "sqrt"))){
      stop("valid link functions for gaussian regression are 'identity', 'inverse', 'log', and 'sqrt'")
    }else if(!is.numeric(y)){
      stop("response variable for gaussian regression must be numeric")
    }else if(link == "log" && any(y <= 0)){
      stop("gaussian regression with log link must have positive response values")
    }else if(link == "inverse" && any(y == 0)){
      stop("gaussian regression with inverse link must have non-zero response values")
    }else if(link == "sqrt" && any(y < 0)){
      stop("gaussian regression with sqrt link must have non-negative response values")
    }
  }else if(family == "gamma"){
    if(!(link %in% c("inverse", "identity", "log", "sqrt"))){
      stop("valid link functions for gamma regression are 'identity', 'inverse', 'log', and 'sqrt'")
    }else if(!is.numeric(y) || any(y <= 0)){
      stop("response variable for gamma regression must be positive")
    }
  }
  
  ## Getting offset
  if(is.null(offset)){
    offset <- rep(0, length(y))
  }else if(length(offset) != length(y)){
    stop("offset must be the same length as y")
  }
  
  if(length(parallel) != 1 || !is.logical(parallel)){
    stop("parallel must be either TRUE or FALSE")
  }else if(parallel){
    df <- BranchGLMfit(x, y, offset, init, method, grads, link, family, nthreads, 
                       tol, maxit, GetInit) 
  }else{
    df <- BranchGLMfit(x, y, offset, init, method, grads, link, family, 1, tol, maxit, 
                       GetInit) 
  }
  
  df$tol <- tol
  df$maxit <- maxit
  
  return(df)
}


#' Extract Log-Likelihood
#' @param object a \code{BranchGLM} object.
#' @param ... further arguments passed to or from other methods.
#' @return An object of class \code{logLik} which is a number corresponding to 
#' the log-likelihood with the following attributes: "df" (degrees of freedom) 
#' and "nobs" (number of observations).
#' @export

logLik.BranchGLM <- function(object, ...){
  df <- length(coef(object))
  if(object$family == "gaussian" || object$family == "gamma"){
    df <- df + 1
  }
  val <- object$logLik
  attr(val, "nobs") <- length(object$y) 
  attr(val, "df") <- df
  class(val) <- "logLik"
  return(val)
}

#' Extract covariance matrix
#' @param object a \code{BranchGLM} object.
#' @param ... further arguments passed to or from other methods.
#' @return A numeric matrix which is the covariance matrix of the beta coefficients.
#' @export
vcov.BranchGLM <- function(object, ...){
  return(object$vcov)
}

#' Extract Coefficients
#' @param object a \code{BranchGLM} object.
#' @param ... further arguments passed to or from other methods.
#' @return A named vector with the corresponding coefficient estimates.
#' @export

coef.BranchGLM <- function(object, ...){
  coefs <- object$coefficients[,1]
  names(coefs) <- row.names(object$coefficients)
  return(coefs)
}

#' Predict Method for BranchGLM Objects
#' @param object a \code{BranchGLM} object.
#' @param newdata a dataframe, if not specified then the data the model was fit on is used.
#' @param type one of "linpreds" or "response", if not specified then "response" is used.
#' @param ... further arguments passed to or from other methods.
#' @details linpreds corresponds to the linear predictors and response is on the scale of the response variable.
#' Offset variables are ignored for predictions on new data.
#' @description Gets predictions from a \code{BranchGLM} object.
#' @return A numeric vector of predictions.
#' @examples
#' Data <- iris
#' Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity")
#' predict(Fit)
#' ### Example with new data
#' predict(Fit, newdata = iris[1:20,])
#' @export

predict.BranchGLM <- function(object, newdata = NULL, type = "response", ...){
  
  if(!is.null(newdata) && !is(newdata,"data.frame")){
    stop("newdata argument must be a dataframe or NULL")
  }
  
  if(length(type) != 1 ){
    stop("type must have a length of 1")
  }else if(!(type %in% c("linpreds", "response"))){
    stop("type argument must be either 'linpreds' or 'response'")
  }
  
  if(is.null(newdata)){
    if(type == "linpreds"){
      object$linPreds
    }else if(type == "response"){
      object$preds
    }
  }else{
    myterms <- delete.response(terms(object))
    m <- model.frame(myterms, newdata, na.action = "na.omit",
                     xlev = object$xlev)
    x <- model.matrix(myterms, m, contrasts = object$contrasts)
    
    if(ncol(x) != length(coef(object))){
      stop("could not find all predictor variables in newdata")
    }else if(type == "linpreds"){
      drop(x %*% coef(object)) |> unname()
    }else if(type == "response"){
      GetPreds(drop(x %*% coef(object)) |> unname(), object$link)
    }
  }
}

GetPreds <- function(XBeta, Link){
  if(Link == "log"){
    exp(XBeta)
  }
  else if(Link == "logit"){
    1 / (1 + exp(-XBeta))
  }
  else if(Link == "probit"){
    pnorm(XBeta)
  }
  else if(Link == "cloglog"){
    exp(-exp(XBeta))
  }
  else if(Link == "inverse"){
    - 1 / (XBeta)
  }
  else if(Link == "identity"){
    XBeta
  }
  else{
    XBeta^2
  }
}

#' Print Method for BranchGLM
#' @param x a \code{BranchGLM} object.
#' @param coefdigits number of digits to display for coefficients table.
#' @param digits number of digits to display for information after table.
#' @param ... further arguments passed to or from other methods.
#' @return The supplied \code{BranchGLM} object.
#' @export

print.BranchGLM <- function(x, coefdigits = 4, digits = 2, ...){
  if(length(coefdigits)!= 1  || !is.numeric(coefdigits) || coefdigits < 0){
    stop("coefdigits must be a non-negative number")
  }
  if(length(digits)!= 1  || !is.numeric(digits) || digits < 0){
    stop("coefdigits must be a non-negative number")
  }
  cat(paste0("Results from ", x$family, " regression with ", x$link, 
             " link function \nUsing the formula ", deparse1(x$formula), "\n\n"))
  
  printCoefmat(signif(x$coefficients, digits = coefdigits), signif.stars = TRUE, P.values = TRUE, 
               has.Pvalue = TRUE)
  
  cat(paste0("\nDispersion parameter taken to be ", round(x$dispersion, coefdigits)))
  cat(paste0("\n", x$numobs, " observations used to fit model\n(", x$missing, 
             " observations removed due to missingness)\n"))
  cat(paste0("\nResidual Deviance: ", round(x$resDev, digits = digits), " on ",
             x$numobs - nrow(x$coefficients), " degrees of freedom"))
  cat(paste0("\nAIC: ", round(x$AIC, digits = digits)))
  if(x$family != "gaussian" || x$link != "identity"){
    if(x$method == "Fisher"){
      method = "Fisher's scoring"
    }else if(x$method == "LBFGS"){
      method = "L-BFGS"
    }else{method = "BFGS"}
    if(x$iterations == 1){
      cat(paste0("\nAlgorithm converged in 1 iteration using ", method, "\n"))
    }else if(x$iterations > 1){
      cat(paste0("\nAlgorithm converged in ", x$iterations, " iterations using ", method, "\n"))
    }else{
      cat("\nAlgorithm failed to converge\n")
    }
  }else{
    cat("\n")
  }
  
  if(x$parallel){
    cat("Parallel computation was used to speed up model fitting process")
  }
  invisible(x)
}
Any scripts or data that you put into this service are public.
BranchGLM documentation built on Aug. 31, 2023, 5:17 p.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
BranchGLM
Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'

R/BranchGLM.R
In BranchGLM: Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'

Defines functions print.BranchGLM GetPreds predict.BranchGLM coef.BranchGLM vcov.BranchGLM logLik.BranchGLM BranchGLM.fit BranchGLM

Documented in BranchGLM BranchGLM.fit coef.BranchGLM logLik.BranchGLM predict.BranchGLM print.BranchGLM vcov.BranchGLM

Try the BranchGLM package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

BranchGLM Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'

R/BranchGLM.R In BranchGLM: Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'

Defines functions print.BranchGLM GetPreds predict.BranchGLM coef.BranchGLM vcov.BranchGLM logLik.BranchGLM BranchGLM.fit BranchGLM

Documented in BranchGLM BranchGLM.fit coef.BranchGLM logLik.BranchGLM predict.BranchGLM print.BranchGLM vcov.BranchGLM

Try the BranchGLM package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

BranchGLM
Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'

R/BranchGLM.R
In BranchGLM: Efficient Branch and Bound Variable Selection for GLMs using 'RcppArmadillo'