VariableSelection: Variable Selection for GLMs
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

VariableSelection

R Documentation

Variable Selection for GLMs

Description

Performs forward selection, multiple different variants of backward elimination, and efficient best subset variable selection with information criterion for generalized linear models (GLMs). Best subset selection is performed with branch and bound algorithms to greatly speed up the process and backward elimination can be performed with bounding algorithms to speed it up.

Usage

VariableSelection(object, ...)

## S3 method for class 'formula'
VariableSelection(
  object,
  data = NULL,
  family,
  link,
  offset = NULL,
  method = "Fisher",
  type = "switch branch and bound",
  metric = "AIC",
  bestmodels = NULL,
  cutoff = NULL,
  keep = NULL,
  keepintercept = TRUE,
  maxsize = NULL,
  grads = 10,
  parallel = FALSE,
  nthreads = 8,
  tol = 1e-06,
  maxit = NULL,
  contrasts = NULL,
  showprogress = TRUE,
  ...
)

## S3 method for class 'BranchGLM'
VariableSelection(
  object,
  type = "switch branch and bound",
  metric = "AIC",
  bestmodels = NULL,
  cutoff = NULL,
  keep = NULL,
  keepintercept = TRUE,
  maxsize = NULL,
  parallel = FALSE,
  nthreads = 8,
  showprogress = TRUE,
  ...
)

Arguments

`object`	a formula or a `BranchGLM` object.
`...`	further arguments.
`data`	a data.frame, list or environment (or object coercible by as.data.frame to a data.frame), containing the variables in formula. Neither a matrix nor an array will be accepted. If not found in `data`, the variables are taken from `environment(formula)`, typically the environment from which `VariableSelection` is called.
`family`	the distribution used to model the data, one of "gaussian", "gamma", "binomial", or "poisson". A `family` object may also be supplied for one of the accepted distributions.
`link`	the link used to link the mean structure to the linear predictors. One of "identity", "logit", "probit", "cloglog", "sqrt", "inverse", or "log". This only needs to be supplied if a string is supplied for `family`. If a family object is supplied for the `family` argument, then the link function is taken from that family object.
`offset`	the offset vector, by default the zero vector is used.
`method`	one of "Fisher", "BFGS", or "LBFGS". Fisher's scoring is recommended for forward selection and the branch and bound algorithms since they will typically fit many models with a small number of covariates.
`type`	one of "forward", "backward", "fast backward", "double backward", "fast double backward", "branch and bound", "backward branch and bound", or "switch branch and bound" to indicate the type of variable selection to perform. The default value is "switch branch and bound". See more about these algorithms in details
`metric`	the metric used to choose the best models, the default is "AIC", but "BIC" and "HQIC" are also available. AIC is the Akaike information criterion, BIC is the Bayesian information criterion, and HQIC is the Hannan-Quinn information criterion.
`bestmodels`	a positive integer to indicate the number of the best models to find according to the chosen metric or NULL. If this is NULL, then cutoff is used instead. This is only used for the branch and bound algorithms.
`cutoff`	a non-negative number which indicates that the function should return all models that have a metric value within cutoff of the best metric value or NULL. Only one of this or bestmodels should be specified and when both are NULL a cutoff of 0 is used. This is only used for the branch and bound algorithms.
`keep`	a character vector of names to denote variables that must be in the models.
`keepintercept`	a logical value to indicate whether to keep the intercept in all models, only used if an intercept is included in the formula.
`maxsize`	a positive integer to denote the maximum number of variables to consider in a single model, the default is the total number of variables. This number adds onto any variables specified in keep. This argument only works for `type = "forward"` and `type = "branch and bound"`. This argument is now deprecated.
`grads`	a positive integer to denote the number of gradients used to approximate the inverse information with, only for `method = "LBFGS"`.
`parallel`	a logical value to indicate if parallelization should be used.
`nthreads`	a positive integer to denote the number of threads used with OpenMP, only used if `parallel = TRUE`.
`tol`	a positive number to denote the tolerance used to determine model convergence.
`maxit`	a positive integer to denote the maximum number of iterations performed. The default for Fisher's scoring is 50 and for the other methods the default is 200.
`contrasts`	see `contrasts.arg` of `model.matrix.default`.
`showprogress`	a logical value to indicate whether to show progress updates for branch and bound algorithms.

Details

Variable Selection Details

The supplied formula or the formula from the fitted model is treated as the upper model. The variables specified in keep along with an intercept (if included in formula and keepintercept = TRUE) is the lower model. Factor variables are either kept in their entirety or entirely removed and interaction terms are properly handled. All observations that have any missing values in the upper model are removed.

Stepwise Methods

There are 5 different stepwise variable selection algorithms that are available. These are forward selection, backward elimination, fast backward elimination, double backward elimination, and fast double backward elimination. All of these are heuristic algorithms, so the best model found by them may not be the optimal model.

Backward Elimination

Fast backward elimination should give the same results as backward elimination, but it makes use of the bounding techniques used by the branch and bound algorithms to make it faster. Fast backward elimination can give slightly different results than backward elimination if the GLM solver has difficulties fitting some of the larger models.

Double Backward Elimination

Double backward elimination and fast double backward elimination are a variant of backward elimination where up to 2 variables can be removed in one step instead of just 1. This typically results in higher quality models, but can also be much slower. The bounding algorithm used in fast double backward elimination makes it much faster.

Branch and Bound Methods

The branch and bound algorithm is an efficient algorithm used to find the optimal models. The backward branch and bound algorithm is very similar to the branch and bound algorithm, except it tends to be faster when the best models contain most of the variables. The switch branch and bound algorithm is a combination of the two algorithms and is typically the fastest of the 3 branch and bound algorithms. All of the branch and bound algorithms are guaranteed to find the optimal models (up to numerical precision).

GLM Fitting

Fisher's scoring is recommended for branch and bound selection and forward selection. L-BFGS may be faster for the backward elimination and double backward elimination algorithms, especially when there are many variables.

Value

A BranchGLMVS object which is a list with the following components

`initmodel`	the `BranchGLM` object corresponding to the upper model
`numchecked`	number of models fit
`names`	character vector of the names of the predictor variables
`order`	the order the variables were added to the model or removed from the model, this is only included for the stepwise algorithms
`type`	type of variable selection employed
`optType`	whether the type specified used a heuristic or exact algorithm
`metric`	metric used to select the best models
`bestmodels`	numeric matrix used to describe the best models for the branch and bound algorithms or a numeric matrix describing the models along the path taken for stepwise algorithms
`bestmetrics`	numeric vector with the best metrics found in the search for the branch and bound algorithms or a numeric vector with the metric values along the path taken for stepwise algorithms
`beta`	numeric matrix of beta coefficients for the models in bestmodels
`cutoff`	the cutoff that was used, this is set to -1 if bestmodels was used instead or if a stepwise algorithm was used
`keep`	vector of which variables were kept through the selection process
`keepintercept`	a boolean value denoting whether to keep the intercept through the selection process or not

Examples

Data <- iris
Fit <- BranchGLM(Sepal.Length ~ ., data = Data, family = "gaussian", 
link = "identity")

# Doing branch and bound selection 
VS <- VariableSelection(Fit, type = "branch and bound", metric = "BIC", 
bestmodels = 10, showprogress = FALSE)
VS

## Plotting the BIC of the best models
plot(VS, type = "b")

## Getting the coefficients of the best model according to BIC
FinalModel <- coef(VS, which = 1)
FinalModel

# Now doing it in parallel (although it isn't necessary for this dataset)
parVS <- VariableSelection(Fit, type = "branch and bound", parallel = TRUE, 
metric = "BIC", bestmodels = 10, showprogress = FALSE)

## Getting the coefficients of the best model according to BIC
FinalModel <- coef(parVS, which = 1)
FinalModel

# Using a formula
formVS <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
link = "identity", metric = "BIC", type = "branch and bound", bestmodels = 10, 
showprogress = FALSE)

## Getting the coefficients of the best model according to BIC
FinalModel <- coef(formVS, which = 1)
FinalModel

# Using the keep argument
keepVS <- VariableSelection(Fit, type = "branch and bound", 
keep = c("Species", "Petal.Width"), metric = "BIC", bestmodels = 4, 
showprogress = FALSE)
keepVS

## Getting the coefficients from the fourth best model according to BIC when 
## keeping Petal.Width and Species in every model
FinalModel <- coef(keepVS, which = 4)
FinalModel

# Treating categorical variable beta parameters separately
## This function automatically groups together parameters from a categorical variable
## to avoid this, you need to create the indicator variables yourself
x <- model.matrix(Sepal.Length ~ ., data = iris)
Sepal.Length <- iris$Sepal.Length
Data <- cbind.data.frame(Sepal.Length, x[, -1])
VSCat <- VariableSelection(Sepal.Length ~ ., data = Data, family = "gaussian", 
link = "identity", metric = "BIC", bestmodels = 10, showprogress = FALSE)
VSCat

## Plotting results
plot(VSCat, cex.names = 0.75)

BranchGLM documentation built on Sept. 28, 2024, 9:07 a.m.

BranchGLM index

Package overview BranchGLM Vignette Variable Importance Vignette" VariableSelection Vignette"

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

BranchGLM
Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

VariableSelection: Variable Selection for GLMs
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

Variable Selection for GLMs

Description

Usage

Arguments

Details

Variable Selection Details

Stepwise Methods

Backward Elimination

Double Backward Elimination

Branch and Bound Methods

GLM Fitting

Value

See Also

Examples

Related to VariableSelection in BranchGLM...

R Package Documentation

Browse R Packages

We want your feedback!

BranchGLM Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

VariableSelection: Variable Selection for GLMs In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

Variable Selection for GLMs

Description

Usage

Arguments

Details

Variable Selection Details

Stepwise Methods

Backward Elimination

Double Backward Elimination

Branch and Bound Methods

GLM Fitting

Value

See Also

Examples

Related to VariableSelection in BranchGLM...

R Package Documentation

Browse R Packages

We want your feedback!

BranchGLM
Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms

VariableSelection: Variable Selection for GLMs
In BranchGLM: Efficient Best Subset Selection for GLMs via Branch and Bound Algorithms