beset_elnet: Beset GLM with Elasticnet Regularization

View source: R/beset_elnet.R

beset_elnetR Documentation

Beset GLM with Elasticnet Regularization

Description

beset_elnet is a wrapper to glmnet for fitting generalized linear models via penalized maximum likelihood, providing automated data preprocessing and selection of both the elastic-net penalty and regularization parameter through repeated k-fold cross-validation.

Usage

beset_elnet(
  form,
  data,
  family = "gaussian",
  alpha = c(0.01, 0.5, 0.99),
  n_lambda = 100,
  nest_cv = FALSE,
  n_folds = 10,
  n_reps = 10,
  seed = 42,
  remove_collinear_columns = FALSE,
  skinny = FALSE,
  standardize = TRUE,
  epsilon = 1e-07,
  maxit = 1e+05,
  lambda_min_ratio = NULL,
  force_in = NULL,
  contrasts = NULL,
  offset = NULL,
  weights = NULL,
  parallel_type = NULL,
  n_cores = NULL,
  cl = NULL
)

Arguments

form

A model formula.

data

Either a data_partition object containing data sets to be used for both model training and testing, or a single data frame that will be used for model training only.

family

Character string naming the error distribution to be used in the model. Currently supported options are "gaussian" (default), "binomial", and "poisson".

alpha

Numeric vector of alpha values between 0 and 1 to use as tuning parameters. alpha = 0 results in ridge regression, and alpha = 1 results in lasso regression. Values in between result in a mixture of L1 and L2 penalties. (Values closer to 0 weight the L2 penalty more heavily, and values closer to 1 weight the L1 penalty more heavily.) The default is to try three alpha values: 0.01 (emphasis toward ridge penalty), 0.99 (emphasis toward lasso penalty), and 0.5 (equal mixture of L1 and L2).

n_lambda

Number of lambdas to be used in a search. Defaults to 100.

nest_cv

Logical value indicating whether or not to perform a nested cross-validation that isolates the cross-validation used for tuning alpha and lambda from the cross-validation used to estimate prediction error. Setting to TRUE will increase run time considerably (by a factor equal to the number of folds), but useful for estimating uncertainty in the tuning procedure. Defaults to FALSE.

n_folds

Integer indicating the number of folds to use for cross-validation.

n_reps

Integer indicating the number of times cross-validation should be repeated (with different randomized fold assignments).

seed

Integer used to seed the random number generator when assigning observations to folds.

remove_collinear_columns

Logical. In case of linearly dependent columns, remove some of the dependent columns. Defaults to FALSE.

skinny

Logical value indicating whether or not to return a "skinny" model. If FALSE (the default), the return object will include a copy of the model terms, data, contrasts, and a record of the xlevels of the factors used in fitting. This information will be necessary if you apply predict.beset to new data. If this feature is not needed, setting skinny = TRUE will prevent these copies from being made.

standardize

Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize = TRUE. If variables are in the same units already, you might not wish to standardize.

epsilon

Convergence threshold for coordinate descent.

maxit

Maximum number of passes over the data for all lambda values

lambda_min_ratio

(Optional) minimum lambda used in lambda search, specified as a ratio of lambda_max (the smallest lambda that drives all coefficients to zero). Default if omitted: if the number of observations is greater than the number of variables, then lambda_min_ratio is set to 0.0001; if the number of observations is less than the number of variables, then lambda_min_ratio is set to 0.01.

force_in

(Optional) character vector containing the names of any predictor variables that should be included in every model. (Note that if there is an intercept, it is forced into every model by default.)

contrasts

Optional list. See the contrasts.arg of model.matrix.default.

offset

(Optional) vector of length equal to the number of observations that is included in the linear predictor. Useful for the "poisson" family (e.g. log of exposure time), or for refining a model by starting at a current fit.

weights

(Optional) numeric vector of observation weights of length equal to the number of cases.

parallel_type

(Optional) character string indicating the type of parallel operation to be used, either "fork" or "sock". If omitted and n_cores > 1, the default is "sock" for Windows and otherwise either "fork" or "sock" depending on which process is being run.

n_cores

Integer value indicating the number of workers to run in parallel during subset search and cross-validation. By default, this will be set to one fewer than the maximum number of physical cores you have available, as indicated by detectCores. Set to 1 to disable parallel processing.

cl

(Optional) parallel or snow cluster for use if parallel_type = "sock". If not supplied, a cluster on the local machine is automatically created.

Value

A "beset_elnet" or "nested" object inheriting class "beset_elnet" with the following components:

For "beset_elnet" objects:
stats

a list with three data frames:

fit
alpha

value of L1-L2 mixing parameter

lambda

value of shrinkage parameter

auc

area under curve (binomial models only)

mae

mean absolute error (not given for binomial models)

mce

mean cross entropy, estimated as -log-likelihood/N, where N is the number of observations

mse

mean squared error

rsq

R-squared, calculated as 1 - deviance/null deviance

cv

a data frame containing cross-validation statistics for each alpha and lambda listed in fit. If run with nest_cv = TRUE, this will correspond to the inner cross-validation used to select alpha and lambda. Each metric consists of the following list:

mean

mean of the metric calculated on the aggregate holdout folds for each repetition and averaged across repetitions

btwn_fold_se

the variability between all holdout folds, given as a standard error

btwn_rep_range

after aggregating over all hold-out folds within each repetition, the variability between repetitions, given as a min-max range

test

if a data_partition is provided, or if run with nest_cv = TRUE, a data frame containing prediction metrics for each alpha and lambda listed in fit as applied to the independent test data or outer cross-validation holdout data

glmnet_parameters

a list of all parameters that were passed to glmnet

For "nested" objects:
beset_elnet

a list of "beset_elnet" objects, one for each train- test partition of the outer cross-validation procedure, each consisting of all of the elements listed above

For both "nested" and unnested "beset_elnet" objects:
fold_assignments

list giving the row indices for the holdout observations for each fold and/or repetition of cross-validation

n_folds

number of folds used in cross-validation

n_reps

number of repetitions used in cross-validation

family

names of error distribution used in the model

terms

the terms object used

data

the data argument

offset

the offset vector used

contrasts

(where relevant) the contrasts used

xlevels

(where relevant) a record of the levels of the factors used in fitting

See Also

glmnet

Examples


data("prostate", package = "beset")

# Regularized logistic regression, with 10 X 10 unnested cross-validation
elnet1 <- beset_elnet(tumor ~ ., data = prostate, family = "binomial")
summary(elnet1)
plot(elnet1)

# Include independent test set in addition to cross-validation
data <- partition(prostate, y = "tumor")
elnet2 <- beset_elnet(tumor ~ ., data = data, family = "binomial")
summary(elnet2)
# Plot deviance explained
plot(elnet2, "rsq")

# Use nested cross-validation
elnet3 <- beset_elnet(tumor ~ ., data = prostate, family = "binomial",
                      nest_cv = TRUE)
# Turn off 1SE rule and use minima of CV tuning curve to select penalty
summary(elnet3, oneSE = FALSE)
# Plot AUC stat
plot(elnet3, "auc")

# Force a variable into the model (do not penalize coefficient)
elnet4 <- beset_elnet(tumor ~ ., data = data, family = "binomial",
                      force_in = "race")
summary(elnet4)


jashu/beset documentation built on April 20, 2023, 5:28 a.m.