select_auxiliary_variables_lasso_cv: Select Auxiliary Variables via LASSO with Cross-Validation...
In auxvecLASSO: LASSO Auxiliary Variable Selection and Auxiliary Vector Diagnostics

View source: R/select_auxiliary_variables_lasso_cv.R

select_auxiliary_variables_lasso_cv

R Documentation

Select Auxiliary Variables via LASSO with Cross-Validation (Binary and Continuous Outcomes)

Description

This function performs LASSO-penalized regression (logistic regression for binary outcomes or linear regression for continuous outcomes) with cross-validation to select auxiliary variables for modeling one or more outcome variables. It allows for the inclusion of all two-way interactions among the auxiliary variables and the option to force certain variables to remain in the model through the use of zero penalty factors.

Usage

select_auxiliary_variables_lasso_cv(
  df,
  outcome_vars,
  auxiliary_vars,
  must_have_vars = NULL,
  check_twoway_int = TRUE,
  nfolds = 5,
  verbose = TRUE,
  standardize = TRUE,
  return_models = FALSE,
  parallel = FALSE
)

Arguments

`df`	A data frame containing the data for modeling.
`outcome_vars`	Character vector of outcome variable names to model. These can be either binary or continuous outcomes. Each must exist in `df` and have at least two unique values (after factor conversion for binary outcomes).
`auxiliary_vars`	Character vector of auxiliary variable names to be used as predictors.
`must_have_vars`	Optional character vector of variable names that must be included in the model (penalty factor 0). If interactions are included, any interaction containing a must-have variable is also assigned zero penalty. The variables in `must_have_vars` should refer to either individual variables or the main effect part of interaction terms.
`check_twoway_int`	Logical; include all two-way interactions among auxiliary variables. Defaults to `TRUE`.
`nfolds`	Number of folds for cross-validation. Defaults to 5.
`verbose`	Logical; print progress messages. Defaults to `TRUE`.
`standardize`	Logical; standardize predictors before fitting. Defaults to `TRUE`.
`return_models`	Logical; return fitted `cv.glmnet` objects. Defaults to `FALSE`.
`parallel`	Logical; run cross-validation in parallel (requires doParallel). Defaults to `FALSE`.

Details

The function supports both binary and continuous outcomes. For binary outcomes, logistic regression is used, and for continuous outcomes, linear regression is used. The function outputs a list with the selected variables across outcomes, the associated lambda values, the goodness-of-fit statistics, and optionally the fitted models and interaction terms.

The function supports two types of outcome variables:

Binary outcomes: LASSO logistic regression is used. The outcome variable must have exactly two levels after missing values are removed.
Continuous outcomes: LASSO linear regression is used. The outcome variable should be numeric.

For factor variables in auxiliary_vars, dummy variables are created to represent each level of the factor. If a factor variable is specified in must_have_vars, its dummy variables will be included in the model, ensuring that any interactions containing those variables are also forced into the model.

Value

An object of class "select_auxiliary_variables_lasso_cv" with the following components:

selected_variables: Character vector of variables selected across all outcome models. This includes the main effect variables and any interaction terms.
by_outcome: Named list of character vectors, each containing the selected variables for each outcome.
selected_lambdas: Named numeric vector of lambda values (specifically, lambda.min) for each outcome.
penalty_factors: Named numeric vector with penalty factors (0 for must-keep, 1 otherwise).
models: List of cv.glmnet objects per outcome if return_models = TRUE, otherwise an empty list.
goodness_of_fit: Named list per outcome with cross-validation metrics (cv_error, cv_error_sd) and full data metrics (deviance_explained for binary outcomes, auc, accuracy, brier_score, rss, mse, r_squared, raw_coefs).
interaction_metadata: List containing metadata on interaction terms, main effects in interactions, and the full formula used.

Examples

## ------------------------------------------------------------
## Example 1: Binary + continuous outcomes, with interactions
##             and must-have variables (factor expanded to dummies)
## ------------------------------------------------------------
set.seed(123)
n <- 150
x1 <- rnorm(n)
x2 <- rnorm(n)
group <- factor(sample(c("A", "B", "C"), n, replace = TRUE))

## Generate outcomes with some signal in x1, x2 and group, plus an interaction
eta_bin <- -0.5 + 1.2 * x2 - 0.8 * (group == "C") + 0.5 * x1 * x2
p <- 1 / (1 + exp(-eta_bin))
y_bin <- rbinom(n, 1, p)
y_cont <- 1.5 * x1 - 2 * (group == "B") + 0.7 * x1 * x2 + rnorm(n, sd = 0.7)

df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, group = group)

res1 <- select_auxiliary_variables_lasso_cv(
  df = df,
  outcome_vars = c("y_bin", "y_cont"),
  auxiliary_vars = c("x1", "x2", "group"),
  must_have_vars = c("x1", "group"), # 'group' (factor) expands to its dummies
  check_twoway_int = TRUE,
  nfolds = 3,
  verbose = FALSE,
  standardize = TRUE,
  return_models = FALSE
)

## Inspect selections and metadata
res1$selected_variables
res1$by_outcome
res1$selected_lambdas
names(which(res1$penalty_factors == 0)) # must-keep terms (incl. factor dummies & interactions)
res1$interaction_metadata$full_formula

## ------------------------------------------------------------
## Example 2: Single continuous outcome, main effects only
## ------------------------------------------------------------
set.seed(456)
n2 <- 120
a <- rnorm(n2)
b <- rnorm(n2)
f <- factor(sample(c("a", "b"), n2, replace = TRUE))
y <- 2 * a - 1 * (f == "b") + rnorm(n2, sd = 1)

toy <- data.frame(y = y, a = a, b = b, f = f)

res2 <- select_auxiliary_variables_lasso_cv(
  df = toy,
  outcome_vars = "y",
  auxiliary_vars = c("a", "b", "f"),
  check_twoway_int = FALSE, # main effects only
  nfolds = 3,
  verbose = FALSE
)

res2$selected_variables
res2$selected_lambdas
res2$goodness_of_fit$y

auxvecLASSO documentation built on Aug. 28, 2025, 9:09 a.m.