fit_outcome: Fit LASSO model for a single outcome with cross-validation

View source: R/fit_outcome.R

fit_outcomeR Documentation

Fit LASSO model for a single outcome with cross-validation

Description

Fits a LASSO regression model (logistic regression for binary outcomes or linear regression for continuous outcomes) for a single outcome variable using cross-validation. The function drops rows where the outcome is missing, and ensures that the predictors do not have missing values. The model is fitted using the glmnet package, with the option to apply cross-validation for selecting the optimal regularization parameter (lambda).

Usage

fit_outcome(
  yvar,
  df,
  X,
  penalty_factors,
  nfolds = 5,
  standardize = TRUE,
  parallel = FALSE,
  return_models = FALSE,
  verbose = FALSE
)

Arguments

yvar

Character scalar. The name of the outcome variable in df. This can be either a binary or continuous outcome variable.

df

Data frame containing the outcome and predictors. The outcome variable (yvar) and predictor variables must be included in df.

X

Model matrix (rows must align with df). The matrix must not contain missing values, and its rows must match those in df.

penalty_factors

Named numeric vector of penalty factors for each predictor variable. The names should match the column names of X.

nfolds

Number of folds for cross-validation. Default is 5.

standardize

Logical; should the predictors be standardized before fitting the model? Default is TRUE.

parallel

Logical; should the cross-validation be performed in parallel? Default is FALSE.

return_models

Logical; should the fitted cv.glmnet object be returned? Default is FALSE.

verbose

Logical; if TRUE, prints progress messages. Default is FALSE.

Details

  • The outcome variable (yvar) can be either binary or continuous:

    • Binary outcomes: LASSO logistic regression is used. The outcome variable must have exactly two levels after missing values are removed.

    • Continuous outcomes: LASSO linear regression is used. The outcome variable should be numeric.

  • Rows with missing values for the outcome (yvar) or predictors (X) will be dropped.

  • The function uses the glmnet package to fit the LASSO model.

  • Cross-validation is used to select the optimal regularization parameter (lambda).

  • Model performance metrics, including AUC, accuracy, Brier score (for binary outcomes) or RSS, MSE, RMSE, MAE, R-squared (for continuous outcomes), are computed on the non-missing rows.

Value

A list containing:

selected

A character vector of the names of the selected variables (non-zero coefficients).

lambda_min

The value of lambda that minimizes the cross-validation error.

goodness

A list containing performance metrics for the model:

  • cross_validated: A list with cv_error (cross-validation error) and cv_error_sd (standard deviation of CV error).

  • full_data: A list with:

  • deviance_explained: Proportion of deviance explained by the model (only for binary outcomes).

  • auc: Area under the ROC curve (only for binary outcomes).

  • accuracy: Classification accuracy (only for binary outcomes).

  • brier_score: Brier score (mean squared error for probabilities) (only for binary outcomes).

  • rss: Residual sum of squares (only for continuous outcomes).

  • mse: Mean squared error (only for continuous outcomes).

  • rmse: Root mean squared error (only for continuous outcomes).

  • mae: Mean absolute error (only for continuous outcomes).

  • r_squared: R-squared (only for continuous outcomes).

  • raw_coefs: Raw coefficients from the LASSO model.

  • abs_coefs: Absolute values of the coefficients.

model

The fitted cv.glmnet model, if return_models = TRUE. Otherwise, NULL.

Examples

## ============================================================
## Example 1: Binary outcome (binomial LASSO with CV)
## ============================================================
if (requireNamespace("glmnet", quietly = TRUE) &&
  requireNamespace("pROC", quietly = TRUE)) {
  set.seed(101)

  n <- 180
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  f <- factor(sample(c("A", "B", "C"), n, replace = TRUE))

  ## Construct a binary outcome with signal in x2 and x1:x2
  lin <- -0.3 + 1.0 * x2 + 0.6 * (x1 * x2) - 0.7 * (f == "C")
  p <- 1 / (1 + exp(-lin))
  yfac <- factor(rbinom(n, 1, p), labels = c("No", "Yes")) # 2-level factor

  df <- data.frame(y = yfac, x1 = x1, x2 = x2, f = f)

  ## Model matrix with main effects + one interaction, no intercept
  X <- model.matrix(~ x1 + x2 + f + x1:x2 - 1, data = df)

  ## Penalty factors must match X's columns (names + length).
  penalty_factors <- rep(1, ncol(X))
  names(penalty_factors) <- colnames(X)
  ## (Optional) keep x1 unpenalized:
  if ("x1" %in% names(penalty_factors)) penalty_factors["x1"] <- 0

  fit_bin <- fit_outcome(
    yvar            = "y",
    df              = df,
    X               = X,
    penalty_factors = penalty_factors,
    nfolds          = 3,
    standardize     = TRUE,
    parallel        = FALSE,
    return_models   = FALSE,
    verbose         = FALSE
  )

  ## Peek at the results
  fit_bin$selected
  fit_bin$lambda_min
  fit_bin$goodness$full_data$auc
  fit_bin$goodness$full_data$accuracy
}

## ============================================================
## Example 2: Continuous outcome (gaussian LASSO with CV)
## ============================================================
if (requireNamespace("glmnet", quietly = TRUE)) {
  set.seed(202)

  n <- 160
  x1 <- rnorm(n)
  x2 <- rnorm(n)
  f <- factor(sample(c("L", "H"), n, replace = TRUE))

  y <- 1.5 * x1 + 0.8 * x2 - 1.0 * (f == "H") + 0.6 * (x1 * x2) + rnorm(n, sd = 0.7)
  df <- data.frame(y = y, x1 = x1, x2 = x2, f = f)

  ## Main effects only, no intercept
  X <- model.matrix(~ x1 + x2 + f - 1, data = df)

  penalty_factors <- rep(1, ncol(X))
  names(penalty_factors) <- colnames(X)

  fit_cont <- fit_outcome(
    yvar            = "y",
    df              = df,
    X               = X,
    penalty_factors = penalty_factors,
    nfolds          = 3,
    standardize     = TRUE,
    parallel        = FALSE,
    return_models   = FALSE,
    verbose         = FALSE
  )

  ## Key metrics
  fit_cont$selected
  fit_cont$lambda_min
  fit_cont$goodness$full_data$mse
  fit_cont$goodness$full_data$r_squared
}


auxvecLASSO documentation built on Aug. 28, 2025, 9:09 a.m.