beset_elnet: Beset GLM with Elasticnet Regularization
In jashu/beset: Best Subset Predictive Modeling

beset_elnet

R Documentation

Beset GLM with Elasticnet Regularization

Description

beset_elnet is a wrapper to glmnet for fitting generalized linear models via penalized maximum likelihood, providing automated data preprocessing and selection of both the elastic-net penalty and regularization parameter through repeated k-fold cross-validation.

Usage

beset_elnet(
  form,
  data,
  family = "gaussian",
  alpha = c(0.01, 0.5, 0.99),
  n_lambda = 100,
  nest_cv = FALSE,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  remove_collinear_columns = FALSE,
  skinny = FALSE,
  standardize = TRUE,
  epsilon = 1e-07,
  maxit = 1e+05,
  lambda_min_ratio = NULL,
  force_in = NULL,
  contrasts = NULL,
  offset = NULL,
  weights = NULL,
  parallel_type = NULL,
  n_cores = NULL,
  cl = NULL
)

Arguments

`form`	A model `formula`.
`data`	Either a `data_partition` object containing data sets to be used for both model training and testing, or a single data frame that will be used for model training only.
`family`	`Character` string naming the error distribution to be used in the model. Currently supported options are `"gaussian"` (default), `"binomial"`, and `"poisson"`.
`alpha`	`Numeric` vector of alpha values between 0 and 1 to use as tuning parameters. `alpha = 0` results in ridge regression, and `alpha = 1` results in lasso regression. Values in between result in a mixture of L1 and L2 penalties. (Values closer to 0 weight the L2 penalty more heavily, and values closer to 1 weight the L1 penalty more heavily.) The default is to try three alpha values: 0.01 (emphasis toward ridge penalty), 0.99 (emphasis toward lasso penalty), and 0.5 (equal mixture of L1 and L2).
`n_lambda`	Number of lambdas to be used in a search. Defaults to `100`.
`nest_cv`	`Logical` value indicating whether or not to perform a nested cross-validation that isolates the cross-validation used for tuning `alpha` and `lambda` from the cross-validation used to estimate prediction error. Setting to `TRUE` will increase run time considerably (by a factor equal to the number of folds), but useful for estimating uncertainty in the tuning procedure. Defaults to `FALSE`.
`n_folds`	`Integer` indicating the number of folds to use for cross-validation.
`n_reps`	`Integer` indicating the number of times cross-validation should be repeated (with different randomized fold assignments).
`seed`	`Integer` used to seed the random number generator when assigning observations to folds.
`remove_collinear_columns`	`Logical`. In case of linearly dependent columns, remove some of the dependent columns. Defaults to FALSE.
`skinny`	`Logical` value indicating whether or not to return a "skinny" model. If `FALSE` (the default), the return object will include a copy of the model `terms`, `data`, `contrasts`, and a record of the `xlevels` of the factors used in fitting. This information will be necessary if you apply `predict.beset` to new data. If this feature is not needed, setting `skinny = TRUE` will prevent these copies from being made.
`standardize`	Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is `standardize = TRUE`. If variables are in the same units already, you might not wish to standardize.
`epsilon`	Convergence threshold for coordinate descent.
`maxit`	Maximum number of passes over the data for all lambda values
`lambda_min_ratio`	(Optional) minimum `lambda` used in `lambda` search, specified as a ratio of `lambda_max` (the smallest `lambda` that drives all coefficients to zero). Default if omitted: if the number of observations is greater than the number of variables, then `lambda_min_ratio` is set to 0.0001; if the number of observations is less than the number of variables, then `lambda_min_ratio` is set to 0.01.
`force_in`	(Optional) character vector containing the names of any predictor variables that should be included in every model. (Note that if there is an intercept, it is forced into every model by default.)
`contrasts`	Optional `list`. See the `contrasts.arg` of `model.matrix.default`.
`offset`	(Optional) vector of length equal to the number of observations that is included in the linear predictor. Useful for the "poisson" family (e.g. log of exposure time), or for refining a model by starting at a current fit.
`weights`	(Optional) `numeric` vector of observation weights of length equal to the number of cases.
`parallel_type`	(Optional) character string indicating the type of parallel operation to be used, either `"fork"` or `"sock"`. If omitted and `n_cores > 1`, the default is `"sock"` for Windows and otherwise either `"fork"` or `"sock"` depending on which process is being run.
`n_cores`	Integer value indicating the number of workers to run in parallel during subset search and cross-validation. By default, this will be set to one fewer than the maximum number of physical cores you have available, as indicated by `detectCores`. Set to 1 to disable parallel processing.
`cl`	(Optional) `parallel` or `snow` cluster for use if `parallel_type = "sock"`. If not supplied, a cluster on the local machine is automatically created.

Value

A "beset_elnet" or "nested" object inheriting class "beset_elnet" with the following components:

For "beset_elnet" objects:

stats

a list with three data frames:

fit

alpha: value of L1-L2 mixing parameter
lambda: value of shrinkage parameter
auc: area under curve (binomial models only)
mae: mean absolute error (not given for binomial models)
mce: mean cross entropy, estimated as -log-likelihood/N, where N is the number of observations
mse: mean squared error
rsq: R-squared, calculated as 1 - deviance/null deviance

cv

a data frame containing cross-validation statistics for each alpha and lambda listed in fit. If run with nest_cv = TRUE, this will correspond to the inner cross-validation used to select alpha and lambda. Each metric consists of the following list:

mean: mean of the metric calculated on the aggregate holdout folds for each repetition and averaged across repetitions
btwn_fold_se: the variability between all holdout folds, given as a standard error
btwn_rep_range: after aggregating over all hold-out folds within each repetition, the variability between repetitions, given as a min-max range

test

if a data_partition is provided, or if run with nest_cv = TRUE, a data frame containing prediction metrics for each alpha and lambda listed in fit as applied to the independent test data or outer cross-validation holdout data

glmnet_parameters

a list of all parameters that were passed to glmnet

For "nested" objects:

beset_elnet: a list of "beset_elnet" objects, one for each train- test partition of the outer cross-validation procedure, each consisting of all of the elements listed above

For both "nested" and unnested "beset_elnet" objects:

fold_assignments: list giving the row indices for the holdout observations for each fold and/or repetition of cross-validation
n_folds: number of folds used in cross-validation
n_reps: number of repetitions used in cross-validation
family: names of error distribution used in the model
terms: the terms object used
data: the data argument
offset: the offset vector used
contrasts: (where relevant) the contrasts used
xlevels: (where relevant) a record of the levels of the factors used in fitting

Examples


data("prostate", package = "beset")

# Regularized logistic regression, with 10 X 10 unnested cross-validation
elnet1 <- beset_elnet(tumor ~ ., data = prostate, family = "binomial")
summary(elnet1)
plot(elnet1)

# Include independent test set in addition to cross-validation
data <- partition(prostate, y = "tumor")
elnet2 <- beset_elnet(tumor ~ ., data = data, family = "binomial")
summary(elnet2)
# Plot deviance explained
plot(elnet2, "rsq")

# Use nested cross-validation
elnet3 <- beset_elnet(tumor ~ ., data = prostate, family = "binomial",
                      nest_cv = TRUE)
# Turn off 1SE rule and use minima of CV tuning curve to select penalty
summary(elnet3, oneSE = FALSE)
# Plot AUC stat
plot(elnet3, "auc")

# Force a variable into the model (do not penalize coefficient)
elnet4 <- beset_elnet(tumor ~ ., data = data, family = "binomial",
                      force_in = "race")
summary(elnet4)

jashu/beset documentation built on April 20, 2023, 5:28 a.m.