beset_glm: Best Subset Selection for Generalized Linear Models
In jashu/beset: Best Subset Predictive Modeling

beset_glm

R Documentation

Best Subset Selection for Generalized Linear Models

Description

beset_glm performs best subset selection using repeated cross-validation to find the optimal number of predictors for several families of generalized linear models.

Usage

beset_glm(
  form,
  data,
  family = "gaussian",
  link = NULL,
  p_max = 10,
  force_in = NULL,
  nest_cv = FALSE,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  contrasts = NULL,
  offset = NULL,
  weights = NULL,
  start = NULL,
  etastart = NULL,
  mustart = NULL,
  epsilon = 1e-08,
  maxit = 25,
  skinny = FALSE,
  n_cores = NULL,
  parallel_type = NULL,
  cl = NULL
)

beset_lm(
  form,
  data,
  p_max = 10,
  force_in = NULL,
  weights = NULL,
  contrasts = NULL,
  offset = NULL,
  nest_cv = FALSE,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  n_cores = NULL,
  parallel_type = NULL,
  cl = NULL
)

Arguments

`form`	A model `formula`.
`data`	Either a `data_partition` object containing data sets to be used for both model training and testing, or a single data frame that will be used for model training and cross-validation.
`family`	Character string naming the error distribution to be used in the model. Available families are listed under 'List of available families and link functions'.
`link`	(Optional) character string naming the link function to be used in the model. Available links and their defaults differ by `family` and are listed under 'List of available families and link functions'.
`p_max`	Maximum number of predictors to attempt to fit. Default is 10.
`force_in`	(Optional) character vector containing the names of any predictor variables that should be included in every model. (Note that if there is an intercept, it is forced into every model by default.)
`nest_cv`	`Logical` value indicating whether to perform nested cross-validation. If `nest_cv = TRUE`, the cross-validation used to select the best model is nested within a cross-validation used to estimate prediction error on a new sample, thus providing as estimate of test error that is free from potential selection bias. Because this multiplicatively increases compute times by a factor equal to the number of folds, the default is `FALSE`. Note that setting this parameter to `TRUE` will provide more informative summary output regarding the uncertatinty in the selection procedure itself, i.e., how often a given model is chosen as "best" according to the given criteria, and is necessary in order for the returned objects to work with certain `beset` methods, such as `compare` and `importance`.
`n_folds`	`Integer` indicating the number of folds to use for cross-validation.
`n_reps`	`Integer` indicating the number of times cross-validation should be repeated (with different randomized fold assignments).
`seed`	`Integer` used to seed the random number generator when assigning observations to folds.
`contrasts`	(Optional) `list`. See the `contrasts.arg` of `model.matrix.default`.
`offset`	(Optional) `vector` of length `nobs` specifying an a priori known component that will be added to the linear predictor before applying the link function. Useful for the "`poisson`" family (e.g. log of exposure time), or for refining a model by starting at a current fit. Default is `NULL`.
`weights`	(Optional) `numeric vector` of prior weights placed on the observations during model fitting. Default is `NULL`.
`start`	(Optional) starting values for the parameters in the linear predictor.
`etastart`	(Optional) starting values for the linear predictor.
`mustart`	(Optional) starting values for the vector of means.
`epsilon`	`Numeric` value of positive convergence tolerance ε; the iterations converge when `\|dev - dev_{old}\|/(\|dev\| + 0.1) < ε`. Default is `1e-8`.
`maxit`	`Integer` giving the maximal number of IWLS iterations. Default is 25.
`skinny`	`Logical` value indicating whether or not to return a "skinny" model. If `FALSE` (the default), the return object will include a copy of the model `terms`, `data`, `contrasts`, and a record of the `xlevels` of the factors used in fitting. If these features are not needed, setting `skinny = TRUE` will prevent these copies from being made.
`n_cores`	Integer value indicating the number of workers to run in parallel during subset search and cross-validation. By default, this will be set to one fewer than the maximum number of physical cores you have available, as indicated by `detectCores`. Set to 1 to disable parallel processing.
`parallel_type`	(Optional) character string indicating the type of parallel operation to be used, either `"fork"` or `"sock"`. If omitted and `n_cores > 1`, the default is `"sock"` for Windows and otherwise either `"fork"` or `"sock"` depending on which process is being run.
`cl`	(Optional) `parallel` or `snow` cluster for use if `parallel_type = "sock"`. If not supplied, a cluster on the local machine is automatically created.

Details

beset_glm performs best subset selection for generalized linear models, fitting a separate model for each possible combination of predictors (all models that contain exactly 1 predictor, all models that contain exactly 2 predictors, and so forth). For each number of predictors, beset_glm first picks the model with the best fit and then estimates how well this model predicts new data using k-fold cross-validation (how well, on average, a model trained using k - 1 folds predicts the left-out fold).

Value

A "beset_glm" object with the following components:

stats

a list with three data frames:

fit

statistics for every possible combination of predictors:

n_pred: the total number of predictors in model; note that the number of predictors for a factor variable corresponds to the number of factor levels minus 1
form: formula for model
aic: -2*log-likelihood + k*npar, where npar represents the number of parameters in the fitted model, and k = 2
dev: twice the difference between the log-likelihoods of the saturated and fitted models, multiplied by the scale parameter
mae: mean absolute error
mce: mean cross entropy, estimated as -log-likelihood/N, where N is the number of observations
mse: mean squared error
r2: R-squared, calculated as 1 - deviance/null deviance

cv

a data frame containing cross-validation statistics for the best model for each n_pred listed in fit_stats. Each metric is computed using predict_metrics, with models fit to n-1 folds and predictions made on the left-out fold. Each metric is followed by its standard error. The data frame is otherwise the same as that documented for fit, except AIC is omitted.

test

if test_data is provided, a data frame containing prediction metrics for the best model for each n_pred listed in fit as applied to the test_data.

fold_assignments

list giving the row indices for the holdout observations for each fold and/or repetition of cross-validation

n_folds

number of folds used in cross-validation

n_reps

number of repetitions used in cross-validation

family

name of error distribution used in the model

link

name of link function used in the model

terms

the terms object used

data

the data argument

offset

the offset vector used

contrasts

(where relevant) the contrasts used

xlevels

(where relevant) a record of the levels of the factors used in fitting

Cross-validation details

beset_glm randomly partitions the data set into n_folds * n_repeats folds within strata (factor levels for factor outcomes, percentile-based groups for numeric outcomes). This insures that the folds will be matched in terms of the outcome's frequency distribution. beset_glm also insures the reproducibility of your analysis by requiring a seed to the random number generator as one of its arguments.

List of available families and link functions

"gaussian": The Gaussian family accepts the links "identity" (default), "log", and "inverse".
"binomial": The binomial family accepts the links "logit" (default), "probit", "cauchit", "log", and "cloglog" (complementary log-log).
"poisson": The Poisson family accepts the links "log" (default), "sqrt", and "identity".
"negbin": The negative binomial family accepts the links "log" (default), "sqrt", and "identity".

Warnings

beset_glm handles missing data by performing listwise deletion. No other options for handling missing data are provided at this time. The user is encouraged to deal with missing values prior to running this function.
beset_glm is intended for use with additive models only. An exhaustive search over the space of possible interactions and/or non-linear effects is computationally prohibitive, but I hope to offer a greedy search option in the future. In the meantime and in general, I would recommend the MARS technique if you require this feature.
beset_glm is best suited for searching over a small number of predictors (less than 10). For a large number of predictors (more than 20), beset_elnet is recommended instead. However, note that beset_elnet only works with a more restricted set of distributions.

Examples

subset1 <- beset_glm(Fertility ~ ., data = swiss)
summary(subset1)

# Force variables to be included in model
subset2 <- beset_glm(Fertility ~ ., data = swiss,
                     force_in = c("Agriculture", "Examination"))
summary(subset2)

# Use nested cross-validation to evaluate error in selection
subset3 <- beset_glm(Fertility ~ ., data = swiss, nest_cv = TRUE)
summary(subset3)

jashu/beset documentation built on April 20, 2023, 5:28 a.m.