VariableSelection: Variable Selection for GLMs

View source: R/VariableSelection.R

VariableSelectionR Documentation

Variable Selection for GLMs

Description

Performs forward selection, backward elimination, and efficient best subsets variable selection with information criterion for generalized linear models. Best subsets selection is performed with branch and bound algorithms to greatly speed up the process.

Usage

VariableSelection(object, ...)

## S3 method for class 'formula'
VariableSelection(
  object,
  data,
  family,
  link,
  offset = NULL,
  method = "Fisher",
  type = "branch and bound",
  metric = "AIC",
  bestmodels = 1,
  cutoff = 0,
  keep = NULL,
  keepintercept = TRUE,
  maxsize = NULL,
  grads = 10,
  parallel = FALSE,
  nthreads = 8,
  tol = 1e-06,
  maxit = NULL,
  contrasts = NULL,
  showprogress = TRUE,
  ...
)

## S3 method for class 'BranchGLM'
VariableSelection(
  object,
  type = "branch and bound",
  metric = "AIC",
  bestmodels = 1,
  cutoff = 0,
  keep = NULL,
  keepintercept = TRUE,
  maxsize = NULL,
  parallel = FALSE,
  nthreads = 8,
  showprogress = TRUE,
  ...
)

Arguments

object

a formula or a BranchGLM object.

...

further arguments passed to other methods.

data

a dataframe with the response and predictor variables.

family

distribution used to model the data, one of "gaussian", "gamma", "binomial", or "poisson".

link

link used to link mean structure to linear predictors. One of "identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log".

offset

offset vector, by default the zero vector is used.

method

one of "Fisher", "BFGS", or "LBFGS". Fisher's scoring is recommended for forward selection and branch and bound methods since they will typically fit many models with a small number of covariates.

type

one of "forward", "backward", "branch and bound", "backward branch and bound", or "switch branch and bound" to indicate the type of variable selection to perform. The default value is "branch and bound". The branch and bound methods are guaranteed to find the best models according to the metric while "forward" and "backward" are heuristic approaches that may not find the optimal model.

metric

metric used to choose the best models, the default is "AIC", but "BIC" and "HQIC" are also available. AIC is the Akaike information criterion, BIC is the bayesian information criterion, and HQIC is the Hannan-Quinn information criterion.

bestmodels

number of the best models to find according to the chosen metric, the default is 1. This is only used for the branch and bound methods.

cutoff

this is a non-negative number which indicates that the function should return all models that have a metric value within cutoff of the best metric value. The default value is 0 and only one of this or bestmodels should be specified. This is only used for the branch and bound methods.

keep

vector of names to denote variables that must be in the models.

keepintercept

whether to keep the intercept in all models, only used if an intercept is included in the formula.

maxsize

maximum number of variables to consider in a single model, the default is the total number of variables. This number adds onto any variables specified in keep. This argument only works for type = "forward" and type = "branch and bound".

grads

number of gradients used to approximate inverse information with, only for method = "LBFGS".

parallel

one of TRUE or FALSE to indicate if parallelization should be used

nthreads

number of threads used with OpenMP, only used if parallel = TRUE.

tol

tolerance used to determine model convergence when fitting GLMs.

maxit

maximum number of iterations performed when fitting GLMs. The default for Fisher's scoring is 50 and for the other methods the default is 200.

contrasts

see contrasts.arg of model.matrix.default.

showprogress

whether to show progress updates for branch and bound methods.

Details

The supplied formula or the formula from the fitted model is treated as the upper model. The variables specified in keep along with an intercept (if included in formula and keepintercept = TRUE) is the lower model. Factor variables are either kept in their entirety or entirely removed.

The branch and bound method makes use of an efficient branch and bound algorithm to find the optimal models. This is will find the best models according to the metric and can be much faster than an exhaustive search and can be made even faster with parallel computation. The backward branch and bound method is very similar to the branch and bound method, except it tends to be faster when the best models contain most of the variables. The switch branch and bound method is a combination of the two methods and is typically the fastest of the 3 branch and bound methods.

Fisher's scoring is recommended for branch and bound selection and forward selection. L-BFGS may be faster for backward elimination, especially when there are many variables.

All observations that have any missing values in the upper model are removed.

Value

A BranchGLMVS object which is a list with the following components

initmodel

the supplied BranchGLM object or a fake BranchGLM object if a formula is supplied

numchecked

number of models fit

names

character vector of the names of the predictor variables

order

the order the variables were added to the model or removed from the model, this is not included for branch and bound selection

type

type of variable selection employed

metric

metric used to select best models

bestmodels

numeric matrix used to describe the best models

bestmetrics

numeric vector with the best metrics found in the search

cutoff

the supplied cutoff

keep

vector of which variables are kept through the selection process

Examples

Data <- iris
Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", link = "identity")

# Doing branch and bound selection 
VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
bestmodels = 10, showprogress = FALSE)
VS

## Getting summary of the process
Summ <- summary(VS)
Summ

## Plotting the BIC of the best models
plot(Summ, type = "b")

## Getting the best model according to BIC
FinalModel <- fit(Summ, which = 1)
FinalModel

# Now doing it in parallel (although it isn't necessary for this dataset)
parVS <- VariableSelection(Fit, type = "branch and bound", parallel = TRUE, metric = "BIC", 
bestmodels = 10, showprogress = FALSE)

## Getting the best model according to BIC
FinalModel <- fit(parVS, which = 1)
FinalModel

# Using a formula
formVS <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
link = "identity", metric = "BIC", type = "branch and bound", bestmodels = 10, showprogress = FALSE)

## Getting the best model according to BIC
FinalModel <- fit(formVS, which = 1)
FinalModel

# Using the keep argument
keepVS <- VariableSelection(Fit, type = "branch and bound", keep = "Petal.Width", 
metric = "BIC", bestmodels = 5, showprogress = FALSE)
keepVS

## Getting the fifth best model according to BIC when keeping Petal.Width in every model
FinalModel <- fit(keepVS, which = 5)
FinalModel


BranchGLM documentation built on Aug. 31, 2023, 5:17 p.m.