beset_glm: Best Subset Selection for Generalized Linear Models

View source: R/beset_glm.R

beset_glmR Documentation

Best Subset Selection for Generalized Linear Models

Description

beset_glm performs best subset selection using repeated cross-validation to find the optimal number of predictors for several families of generalized linear models.

Usage

beset_glm(
  form,
  data,
  family = "gaussian",
  link = NULL,
  p_max = 10,
  force_in = NULL,
  nest_cv = FALSE,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  contrasts = NULL,
  offset = NULL,
  weights = NULL,
  start = NULL,
  etastart = NULL,
  mustart = NULL,
  epsilon = 1e-08,
  maxit = 25,
  skinny = FALSE,
  n_cores = NULL,
  parallel_type = NULL,
  cl = NULL
)

beset_lm(
  form,
  data,
  p_max = 10,
  force_in = NULL,
  weights = NULL,
  contrasts = NULL,
  offset = NULL,
  nest_cv = FALSE,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  n_cores = NULL,
  parallel_type = NULL,
  cl = NULL
)

Arguments

form

A model formula.

data

Either a data_partition object containing data sets to be used for both model training and testing, or a single data frame that will be used for model training and cross-validation.

family

Character string naming the error distribution to be used in the model. Available families are listed under 'List of available families and link functions'.

link

(Optional) character string naming the link function to be used in the model. Available links and their defaults differ by family and are listed under 'List of available families and link functions'.

p_max

Maximum number of predictors to attempt to fit. Default is 10.

force_in

(Optional) character vector containing the names of any predictor variables that should be included in every model. (Note that if there is an intercept, it is forced into every model by default.)

nest_cv

Logical value indicating whether to perform nested cross-validation. If nest_cv = TRUE, the cross-validation used to select the best model is nested within a cross-validation used to estimate prediction error on a new sample, thus providing as estimate of test error that is free from potential selection bias. Because this multiplicatively increases compute times by a factor equal to the number of folds, the default is FALSE. Note that setting this parameter to TRUE will provide more informative summary output regarding the uncertatinty in the selection procedure itself, i.e., how often a given model is chosen as "best" according to the given criteria, and is necessary in order for the returned objects to work with certain beset methods, such as compare and importance.

n_folds

Integer indicating the number of folds to use for cross-validation.

n_reps

Integer indicating the number of times cross-validation should be repeated (with different randomized fold assignments).

seed

Integer used to seed the random number generator when assigning observations to folds.

contrasts

(Optional) list. See the contrasts.arg of model.matrix.default.

offset

(Optional) vector of length nobs specifying an a priori known component that will be added to the linear predictor before applying the link function. Useful for the "poisson" family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is NULL.

weights

(Optional) numeric vector of prior weights placed on the observations during model fitting. Default is NULL.

start

(Optional) starting values for the parameters in the linear predictor.

etastart

(Optional) starting values for the linear predictor.

mustart

(Optional) starting values for the vector of means.

epsilon

Numeric value of positive convergence tolerance ε; the iterations converge when |dev - dev_{old}|/(|dev| + 0.1) < ε. Default is 1e-8.

maxit

Integer giving the maximal number of IWLS iterations. Default is 25.

skinny

Logical value indicating whether or not to return a "skinny" model. If FALSE (the default), the return object will include a copy of the model terms, data, contrasts, and a record of the xlevels of the factors used in fitting. If these features are not needed, setting skinny = TRUE will prevent these copies from being made.

n_cores

Integer value indicating the number of workers to run in parallel during subset search and cross-validation. By default, this will be set to one fewer than the maximum number of physical cores you have available, as indicated by detectCores. Set to 1 to disable parallel processing.

parallel_type

(Optional) character string indicating the type of parallel operation to be used, either "fork" or "sock". If omitted and n_cores > 1, the default is "sock" for Windows and otherwise either "fork" or "sock" depending on which process is being run.

cl

(Optional) parallel or snow cluster for use if parallel_type = "sock". If not supplied, a cluster on the local machine is automatically created.

Details

beset_glm performs best subset selection for generalized linear models, fitting a separate model for each possible combination of predictors (all models that contain exactly 1 predictor, all models that contain exactly 2 predictors, and so forth). For each number of predictors, beset_glm first picks the model with the best fit and then estimates how well this model predicts new data using k-fold cross-validation (how well, on average, a model trained using k - 1 folds predicts the left-out fold).

Value

A "beset_glm" object with the following components:

stats

a list with three data frames:

fit

statistics for every possible combination of predictors:

n_pred

the total number of predictors in model; note that the number of predictors for a factor variable corresponds to the number of factor levels minus 1

form

formula for model

aic

-2*log-likelihood + k*npar, where npar represents the number of parameters in the fitted model, and k = 2

dev

twice the difference between the log-likelihoods of the saturated and fitted models, multiplied by the scale parameter

mae

mean absolute error

mce

mean cross entropy, estimated as -log-likelihood/N, where N is the number of observations

mse

mean squared error

r2

R-squared, calculated as 1 - deviance/null deviance

cv

a data frame containing cross-validation statistics for the best model for each n_pred listed in fit_stats. Each metric is computed using predict_metrics, with models fit to n-1 folds and predictions made on the left-out fold. Each metric is followed by its standard error. The data frame is otherwise the same as that documented for fit, except AIC is omitted.

test

if test_data is provided, a data frame containing prediction metrics for the best model for each n_pred listed in fit as applied to the test_data.

fold_assignments

list giving the row indices for the holdout observations for each fold and/or repetition of cross-validation

n_folds

number of folds used in cross-validation

n_reps

number of repetitions used in cross-validation

family

name of error distribution used in the model

link

name of link function used in the model

terms

the terms object used

data

the data argument

offset

the offset vector used

contrasts

(where relevant) the contrasts used

xlevels

(where relevant) a record of the levels of the factors used in fitting

Cross-validation details

beset_glm randomly partitions the data set into n_folds * n_repeats folds within strata (factor levels for factor outcomes, percentile-based groups for numeric outcomes). This insures that the folds will be matched in terms of the outcome's frequency distribution. beset_glm also insures the reproducibility of your analysis by requiring a seed to the random number generator as one of its arguments.

List of available families and link functions

"gaussian"

The Gaussian family accepts the links "identity" (default), "log", and "inverse".

"binomial"

The binomial family accepts the links "logit" (default), "probit", "cauchit", "log", and "cloglog" (complementary log-log).

"poisson"

The Poisson family accepts the links "log" (default), "sqrt", and "identity".

"negbin"

The negative binomial family accepts the links "log" (default), "sqrt", and "identity".

Warnings

  1. beset_glm handles missing data by performing listwise deletion. No other options for handling missing data are provided at this time. The user is encouraged to deal with missing values prior to running this function.

  2. beset_glm is intended for use with additive models only. An exhaustive search over the space of possible interactions and/or non-linear effects is computationally prohibitive, but I hope to offer a greedy search option in the future. In the meantime and in general, I would recommend the MARS technique if you require this feature.

  3. beset_glm is best suited for searching over a small number of predictors (less than 10). For a large number of predictors (more than 20), beset_elnet is recommended instead. However, note that beset_elnet only works with a more restricted set of distributions.

See Also

glm, set.seed, glm.nb

Examples

subset1 <- beset_glm(Fertility ~ ., data = swiss)
summary(subset1)

# Force variables to be included in model
subset2 <- beset_glm(Fertility ~ ., data = swiss,
                     force_in = c("Agriculture", "Examination"))
summary(subset2)

# Use nested cross-validation to evaluate error in selection
subset3 <- beset_glm(Fertility ~ ., data = swiss, nest_cv = TRUE)
summary(subset3)

jashu/beset documentation built on April 20, 2023, 5:28 a.m.