View source: R/select_auxiliary_variables_lasso_cv.R
select_auxiliary_variables_lasso_cv | R Documentation |
This function performs LASSO-penalized regression (logistic regression for binary outcomes or linear regression for continuous outcomes) with cross-validation to select auxiliary variables for modeling one or more outcome variables. It allows for the inclusion of all two-way interactions among the auxiliary variables and the option to force certain variables to remain in the model through the use of zero penalty factors.
select_auxiliary_variables_lasso_cv(
df,
outcome_vars,
auxiliary_vars,
must_have_vars = NULL,
check_twoway_int = TRUE,
nfolds = 5,
verbose = TRUE,
standardize = TRUE,
return_models = FALSE,
parallel = FALSE
)
df |
A data frame containing the data for modeling. |
outcome_vars |
Character vector of outcome variable names to model. These can be either binary or continuous outcomes. Each
must exist in |
auxiliary_vars |
Character vector of auxiliary variable names to be used as predictors. |
must_have_vars |
Optional character vector of variable names that must be included
in the model (penalty factor 0). If interactions are included, any interaction containing
a must-have variable is also assigned zero penalty. The variables in |
check_twoway_int |
Logical; include all two-way interactions among auxiliary variables. Defaults to |
nfolds |
Number of folds for cross-validation. Defaults to 5. |
verbose |
Logical; print progress messages. Defaults to |
standardize |
Logical; standardize predictors before fitting. Defaults to |
return_models |
Logical; return fitted |
parallel |
Logical; run cross-validation in parallel (requires doParallel). Defaults to |
The function supports both binary and continuous outcomes. For binary outcomes, logistic regression is used, and for continuous outcomes, linear regression is used. The function outputs a list with the selected variables across outcomes, the associated lambda values, the goodness-of-fit statistics, and optionally the fitted models and interaction terms.
The function supports two types of outcome variables:
Binary outcomes: LASSO logistic regression is used. The outcome variable must have exactly two levels after missing values are removed.
Continuous outcomes: LASSO linear regression is used. The outcome variable should be numeric.
For factor variables in auxiliary_vars
, dummy variables are created to represent each level of the factor. If a factor variable is specified in must_have_vars
,
its dummy variables will be included in the model, ensuring that any interactions containing those variables are also forced into the model.
An object of class "select_auxiliary_variables_lasso_cv"
with the following components:
Character vector of variables selected across all outcome models. This includes the main effect variables and any interaction terms.
Named list of character vectors, each containing the selected variables for each outcome.
Named numeric vector of lambda values (specifically, lambda.min) for each outcome.
Named numeric vector with penalty factors (0 for must-keep, 1 otherwise).
List of cv.glmnet
objects per outcome if return_models = TRUE
, otherwise an empty list.
Named list per outcome with cross-validation metrics (cv_error, cv_error_sd) and full data metrics (deviance_explained for binary outcomes, auc, accuracy, brier_score, rss, mse, r_squared, raw_coefs).
List containing metadata on interaction terms, main effects in interactions, and the full formula used.
## ------------------------------------------------------------
## Example 1: Binary + continuous outcomes, with interactions
## and must-have variables (factor expanded to dummies)
## ------------------------------------------------------------
set.seed(123)
n <- 150
x1 <- rnorm(n)
x2 <- rnorm(n)
group <- factor(sample(c("A", "B", "C"), n, replace = TRUE))
## Generate outcomes with some signal in x1, x2 and group, plus an interaction
eta_bin <- -0.5 + 1.2 * x2 - 0.8 * (group == "C") + 0.5 * x1 * x2
p <- 1 / (1 + exp(-eta_bin))
y_bin <- rbinom(n, 1, p)
y_cont <- 1.5 * x1 - 2 * (group == "B") + 0.7 * x1 * x2 + rnorm(n, sd = 0.7)
df <- data.frame(y_bin = y_bin, y_cont = y_cont, x1 = x1, x2 = x2, group = group)
res1 <- select_auxiliary_variables_lasso_cv(
df = df,
outcome_vars = c("y_bin", "y_cont"),
auxiliary_vars = c("x1", "x2", "group"),
must_have_vars = c("x1", "group"), # 'group' (factor) expands to its dummies
check_twoway_int = TRUE,
nfolds = 3,
verbose = FALSE,
standardize = TRUE,
return_models = FALSE
)
## Inspect selections and metadata
res1$selected_variables
res1$by_outcome
res1$selected_lambdas
names(which(res1$penalty_factors == 0)) # must-keep terms (incl. factor dummies & interactions)
res1$interaction_metadata$full_formula
## ------------------------------------------------------------
## Example 2: Single continuous outcome, main effects only
## ------------------------------------------------------------
set.seed(456)
n2 <- 120
a <- rnorm(n2)
b <- rnorm(n2)
f <- factor(sample(c("a", "b"), n2, replace = TRUE))
y <- 2 * a - 1 * (f == "b") + rnorm(n2, sd = 1)
toy <- data.frame(y = y, a = a, b = b, f = f)
res2 <- select_auxiliary_variables_lasso_cv(
df = toy,
outcome_vars = "y",
auxiliary_vars = c("a", "b", "f"),
check_twoway_int = FALSE, # main effects only
nfolds = 3,
verbose = FALSE
)
res2$selected_variables
res2$selected_lambdas
res2$goodness_of_fit$y
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.