BranchGLM: Fits GLMs
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

BranchGLM

R Documentation

Fits GLMs

Description

Fits generalized linear models (GLMs) via RcppArmadillo with the ability to perform some computation in parallel with OpenMP.

Usage

BranchGLM(
  formula,
  data = NULL,
  family,
  link,
  offset = NULL,
  method = "Fisher",
  grads = 10,
  parallel = FALSE,
  nthreads = 8,
  tol = 1e-06,
  maxit = NULL,
  init = NULL,
  fit = TRUE,
  contrasts = NULL,
  keepData = TRUE,
  keepY = TRUE
)

BranchGLM.fit(
  x,
  y,
  family,
  link,
  offset = NULL,
  method = "Fisher",
  grads = 10,
  parallel = FALSE,
  nthreads = 8,
  init = NULL,
  maxit = NULL,
  tol = 1e-06
)

Arguments

`formula`	a formula for the model.
`data`	an optional data.frame, list or environment (or object coercible by as.data.frame to a data.frame), containing the variables in formula. Neither a matrix nor an array will be accepted. If not found in `data`, the variables are taken from `environment(formula)`, typically the environment from which `BranchGLM` is called.
`family`	the distribution used to model the data, one of "gaussian", "gamma", "binomial", or "poisson". A `family` object may also be supplied for one of the accepted distributions.
`link`	the link used to link the mean structure to the linear predictors. One of "identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log". The accepted links depend on the specified family, see more in details. This only needs to be supplied if a string is supplied for `family`. If a `family` object is supplied for the `family` argument, then the link function is taken from that `family` object.
`offset`	the offset vector, by default the zero vector is used.
`method`	one of "Fisher", "BFGS", or "LBFGS". BFGS and L-BFGS are quasi-newton methods which are typically faster than Fisher's scoring when there are many covariates (at least 50).
`grads`	a positive integer to denote the number of gradients used to approximate the inverse information with, only for `method = "LBFGS"`.
`parallel`	a logical value to indicate if parallelization should be used.
`nthreads`	a positive integer to denote the number of threads used with OpenMP, only used if `parallel = TRUE`.
`tol`	a positive number to denote the tolerance used to determine model convergence.
`maxit`	a positive integer to denote the maximum number of iterations performed. The default for Fisher's scoring is 50 and for the other methods the default is 200.
`init`	a numeric vector of initial values for the betas, if not specified then they are automatically selected via linear regression with the transformation specified by the link function. This is ignored for linear regression models.
`fit`	a logical value to indicate whether to fit the model or not.
`contrasts`	see `contrasts.arg` of `model.matrix.default`.
`keepData`	a logical value to indicate whether or not to store a copy of data and the design matrix, the default is TRUE. If this is FALSE, then the results from this cannot be used inside of `VariableSelection`.
`keepY`	a logical value to indicate whether or not to store a copy of y, the default is TRUE. If this is FALSE, then the binomial GLM helper functions may not work and this cannot be used inside of `VariableSelection`.
`x`	design matrix used for the fit, must be numeric.
`y`	outcome vector, must be numeric.

Details

Fitting

Can use BFGS, L-BFGS, or Fisher's scoring to fit the GLM. BFGS and L-BFGS are typically faster than Fisher's scoring when there are at least 50 covariates and Fisher's scoring is typically best when there are fewer than 50 covariates. This function does not currently support the use of weights. In the special case of gaussian regression with identity link the method argument is ignored and the normal equations are solved directly.

The models are fit in C++ by using Rcpp and RcppArmadillo. In order to help convergence, each of the methods makes use of a backtracking line-search using the strong Wolfe conditions to find an adequate step size. There are three conditions used to determine convergence, the first is whether there is a sufficient decrease in the negative log-likelihood, the second is whether the l2-norm of the score is sufficiently small, and the last condition is whether the change in each of the beta coefficients is sufficiently small. The tol argument controls all of these criteria. If the algorithm fails to converge, then iterations will be -1.

All observations with any missing values are removed before model fitting.

BranchGLM.fit can be faster than calling BranchGLM if the x matrix and y vector are already available, but doesn't return as much information. The object returned by BranchGLM.fit is not of class BranchGLM, so all of the methods for BranchGLM objects such as predict or VariableSelection cannot be used.

Dispersion Parameter

The dispersion parameter for binomial and Poisson regression is always fixed to be 1. For gaussian and gamma regression, the MLE of the dispersion parameter is used for the calculation of the log-likelihood and the Pearson estimator of the dispersion parameter is used for the calculation of standard errors for the coefficient estimates.

Families and Links

The binomial family accepts "cloglog", "log", "logit", and "probit" as possible link functions. The gamma and gaussian families accept "identity", "inverse", "log", and "sqrt" as possible link functions. The Poisson family accepts "identity", "log", and "sqrt" as possible link functions.

Value

BranchGLM returns a BranchGLM object which is a list with the following components

`coefficients`	a data.frame with the coefficient estimates, SEs, Wald test statistics, and p-values
`iterations`	number of iterations it took the algorithm to converge, if the algorithm failed to converge then this is -1
`dispersion`	a vector of length 2 with the MLE of the dispersion parameter first and the Pearson estimator of the dispersion parameter second
`logLik`	the log-likelihood of the fitted model
`vcov`	the variance-covariance matrix of the fitted model
`resDev`	the residual deviance of the fitted model
`AIC`	the AIC of the fitted model
`preds`	predictions from the fitted model
`linpreds`	linear predictors from the fitted model
`residuals`	a numeric vector with the Pearson residuals
`variance`	a numeric vector with the variance evaluated at the final coefficient estimates
`tol`	tolerance used to fit the model
`maxit`	maximum number of iterations used to fit the model
`formula`	formula used to fit the model
`method`	iterative method used to fit the model
`grads`	number of gradients used to approximate inverse information for L-BFGS
`y`	y vector used in the model, not included if `keepY = FALSE`
`x`	design matrix used to fit the model, not included if `keepData = FALSE`
`rownames`	rownames taken from the design matrix
`offset`	offset vector in the model, not included if `keepData = FALSE`
`fulloffset`	supplied offset vector, not included if `keepData = FALSE`
`data`	original `data` argument supplied to the function, not included if `keepData = FALSE`
`mf`	the model frame, not included if `keepData = FALSE`
`numobs`	number of observations in the design matrix
`names`	names of the predictor variables
`yname`	name of y variable
`parallel`	whether parallelization was employed to speed up model fitting process
`missing`	number of missing values removed from the original dataset
`link`	link function used to model the data
`family`	family used to model the data
`ylevel`	the levels of y, only included for binomial glms
`xlev`	the levels of the factors in the dataset
`terms`	the terms object used

BranchGLM.fit returns a list with the following components

`coefficients`	a data.frame with the coefficients estimates, SEs, Wald test statistics, and p-values
`iterations`	number of iterations it took the algorithm to converge, if the algorithm failed to converge then this is -1
`dispersion`	a vector of length 2 with the MLE of the dispersion parameter first and the Pearson estimator of the dispersion parameter second
`logLik`	the log-likelihood of the fitted model
`vcov`	the variance-covariance matrix of the fitted model
`resDev`	the residual deviance of the fitted model
`AIC`	the AIC of the fitted model
`preds`	predictions from the fitted model
`linpreds`	linear predictors from the fitted model
`residuals`	a numeric vector with the Pearson residuals
`variance`	a numeric vector with the variance evaluated at the final coefficient estimates
`tol`	tolerance used to fit the model
`maxit`	maximum number of iterations used to fit the model

References

McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall.

Examples

Data <- iris

# Linear regression
## Using BranchGLM
BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity")

## Using BranchGLM.fit
x <- model.matrix(Sepal.Length ~ ., data = Data)
y <- Data$Sepal.Length
BranchGLM.fit(x, y, family = "gaussian", link = "identity")

# Gamma regression
## Using BranchGLM
BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log")

### init
BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", 
init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL)

### method
BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", 
init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL, method = "LBFGS")

### offset
BranchGLM(Sepal.Length ~ ., data = Data, family = "gamma", link = "log", 
init = rep(0, 6), maxit = 50, tol = 1e-6, contrasts = NULL, 
offset = Data$Sepal.Width)

## Using BranchGLM.fit
x <- model.matrix(Sepal.Length ~ ., data = Data)
y <- Data$Sepal.Length
BranchGLM.fit(x, y, family = "gamma", link = "log", init = rep(0, 6), 
maxit = 50, tol = 1e-6, offset = Data$Sepal.Width)

BranchGLM documentation built on Sept. 28, 2024, 9:07 a.m.