select_auxiliary_variables_lasso_cv: Select Auxiliary Variables via LASSO with Cross-Validation...

View source: R/select_auxiliary_variables_lasso_cv.R

select_auxiliary_variables_lasso_cvR Documentation

Select Auxiliary Variables via LASSO with Cross-Validation (Binary and Continuous Outcomes)

Description

This function performs LASSO-penalized regression (logistic regression for binary outcomes or linear regression for continuous outcomes) with cross-validation to select auxiliary variables for modeling one or more outcome variables. It allows for the inclusion of all two-way interactions among the auxiliary variables and the option to force certain variables to remain in the model through the use of zero penalty factors.

Usage

select_auxiliary_variables_lasso_cv(
  df,
  outcome_vars,
  auxiliary_vars,
  must_have_vars = NULL,
  check_twoway_int = TRUE,
  nfolds = 5,
  verbose = TRUE,
  standardize = TRUE,
  return_models = FALSE,
  parallel = FALSE
)

Arguments

df

A data frame containing the data for modeling.

outcome_vars

Character vector of outcome variable names to model. These can be either binary or continuous outcomes. Each must exist in df and have at least two unique values (after factor conversion for binary outcomes).

auxiliary_vars

Character vector of auxiliary variable names to be used as predictors.

must_have_vars

Optional character vector of variable names that must be included in the model (penalty factor 0). If interactions are included, any interaction containing a must-have variable is also assigned zero penalty. The variables in must_have_vars should refer to either individual variables or the main effect part of interaction terms.

check_twoway_int

Logical; include all two-way interactions among auxiliary variables. Defaults to TRUE.

nfolds

Number of folds for cross-validation. Defaults to 5.

verbose

Logical; print progress messages. Defaults to TRUE.

standardize

Logical; standardize predictors before fitting. Defaults to TRUE.

return_models

Logical; return fitted cv.glmnet objects. Defaults to FALSE.

parallel

Logical; run cross-validation in parallel (requires doParallel). Defaults to FALSE.

Details

The function supports both binary and continuous outcomes. For binary outcomes, logistic regression is used, and for continuous outcomes, linear regression is used. The function outputs a list with the selected variables across outcomes, the associated lambda values, the goodness-of-fit statistics, and optionally the fitted models and interaction terms.

The function supports two types of outcome variables:

  • Binary outcomes: LASSO logistic regression is used. The outcome variable must have exactly two levels after missing values are removed.

  • Continuous outcomes: LASSO linear regression is used. The outcome variable should be numeric.

For factor variables in auxiliary_vars, dummy variables are created to represent each level of the factor. If a factor variable is specified in must_have_vars, its dummy variables will be included in the model, ensuring that any interactions containing those variables are also forced into the model.

Value

An object of class "select_auxiliary_variables_lasso_cv" with the following components:

selected_variables

Character vector of variables selected across all outcome models. This includes the main effect variables and any interaction terms.

by_outcome

Named list of character vectors, each containing the selected variables for each outcome.

selected_lambdas

Named numeric vector of lambda values (specifically, lambda.min) for each outcome.

penalty_factors

Named numeric vector with penalty factors (0 for must-keep, 1 otherwise).

models

List of cv.glmnet objects per outcome if return_models = TRUE, otherwise an empty list.

goodness_of_fit

Named list per outcome with cross-validation metrics (cv_error, cv_error_sd) and full data metrics (deviance_explained for binary outcomes, auc, accuracy, brier_score, rss, mse, r_squared, raw_coefs).

interaction_metadata

List containing metadata on interaction terms, main effects in interactions, and the full formula used.

Examples

## ------------------------------------------------------------
## Example 1: Binary + continuous outcomes, with interactions
##             and must-have variables (factor expanded to dummies)
## ------------------------------------------------------------
set.seed(123)
n <- 150
x1 <- rnorm(n)
x2 <- rnorm(n)
group <- factor(sample(c("A", "B", "C"), n, replace = TRUE))

## Generate outcomes with some signal in x1, x2 and group, plus an interaction
eta_bin <- -0.5 + 1.2 * x2 - 0.8 * (group == "C") + 0.5 * x1 * x2
p <- 1 / (1 + exp(-eta_bin))
y_bin <- rbinom(n, 1, p)
y_cont <- 1.5 * x1 - 2 * (group == "B") + 0.7 * x1 * x2 + rnorm(n, sd = 0.7)

df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, group = group)

res1 <- select_auxiliary_variables_lasso_cv(
  df = df,
  outcome_vars = c("y_bin", "y_cont"),
  auxiliary_vars = c("x1", "x2", "group"),
  must_have_vars = c("x1", "group"), # 'group' (factor) expands to its dummies
  check_twoway_int = TRUE,
  nfolds = 3,
  verbose = FALSE,
  standardize = TRUE,
  return_models = FALSE
)

## Inspect selections and metadata
res1$selected_variables
res1$by_outcome
res1$selected_lambdas
names(which(res1$penalty_factors == 0)) # must-keep terms (incl. factor dummies & interactions)
res1$interaction_metadata$full_formula

## ------------------------------------------------------------
## Example 2: Single continuous outcome, main effects only
## ------------------------------------------------------------
set.seed(456)
n2 <- 120
a <- rnorm(n2)
b <- rnorm(n2)
f <- factor(sample(c("a", "b"), n2, replace = TRUE))
y <- 2 * a - 1 * (f == "b") + rnorm(n2, sd = 1)

toy <- data.frame(y = y, a = a, b = b, f = f)

res2 <- select_auxiliary_variables_lasso_cv(
  df = toy,
  outcome_vars = "y",
  auxiliary_vars = c("a", "b", "f"),
  check_twoway_int = FALSE, # main effects only
  nfolds = 3,
  verbose = FALSE
)

res2$selected_variables
res2$selected_lambdas
res2$goodness_of_fit$y


auxvecLASSO documentation built on Aug. 28, 2025, 9:09 a.m.