Introduction to the auxvecLASSO package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# Load necessary libraries
library(auxvecLASSO)
library(survey)
library(sampling)
library(dplyr)

# Set seed for reproducibility
set.seed(123)

Introduction

Auxiliary variables can greatly improve performance when using models in survey data analyses, for example in contexts like survey calibration, imputation or prediction. Traditional methods for variable selection, such as the commonly used stepwise forward selection method, have several drawbacks: they are greedy/myopic, sensitive to small changes in data, and may often lead to selecting irrelevant variables due to inflated Type I Errors.

In contrast, the LASSO (Least Absolute Shrinkage and Selection Operator) offers several advantages: simultaneous selection and shrinkage, stability even in high-dimensional settings, computational efficiency, and discouraging overfitting, leading to models that tend to generalize better to new data. In short, viewing auxiliary variable selection as a prediction/statistical learning problem can offer many advantages (see also, for example, Caughey & Hartman (2017) doi:10.2139/ssrn.3494436 and Mainzer et al. (2024) PMC7615727.

The auxvecLASSO package provides tools for selecting auxiliary variables using LASSO and conducting diagnostics related to survey calibration. The main function selection_auxiliary_variables_lasso_cv() allows users to perform LASSO-penalized regression (logistic regression for binary outcomes or linear regression for continuous outcomes) with cross-validation to select auxiliary variables for modeling one or more outcome variables. It allows for the inclusion of all two-way interactions among the auxiliary variables and the option to enforce a hierarchical structure by keeping certain variables in the model through the use of zero penalty factors.

A relative strength of the package is the accompanying function assess_aux_vector(), which can be used to evaluate the performance of candidate auxiliary vectors.

In this vignette, we demonstrate how to apply the key functions in the auxvecLASSO package using the api dataset from the survey package.

In this example, we will:

  1. Select auxiliary variables using LASSO

  2. Evaluate the chosen auxiliary vector

Dataset

We will work with the api data from the survey package. It contains survey data about the API scores of a population of schools and additional auxiliary variables. Based on the apipop dataset, we will draw a stratified sample which will serve as the point of departure for our modeling.

We begin by using the apipop dataset and add some binary variables to it to make calculations as easy as possible. Auxiliary variables can also contain more than two categories, and even be continuous/numerical. See the survey package documentation for more information on allowed formats for auxiliary variables.

# Load the population data
data(api)

# Load the population data file and add binary variables
api_pop <- apipop
api_pop$api00_bin <- as.factor(ifelse(api_pop$api00 > 650, 1, 0))
api_pop$growth_bin <- as.factor(ifelse(api_pop$growth > 25, 1, 0))
api_pop$meals_bin <- as.factor(ifelse(api_pop$meals > 40, 1, 0))
api_pop$ell_bin <- as.factor(ifelse(api_pop$ell > 15, 1, 0))
api_pop$hsg_bin <- as.factor(ifelse(api_pop$hsg > 20, 1, 0))
api_pop$full_bin <- as.factor(ifelse(api_pop$full > 90, 1, 0))
api_pop$sch.wide_bin <- as.factor(ifelse(api_pop$sch.wide == "Yes", 1, 0))
api_pop$awards_bin <- as.factor(ifelse(api_pop$awards == "Yes", 1, 0))
api_pop$comp.imp_bin <- as.factor(ifelse(api_pop$comp.imp == "Yes", 1, 0))
api_pop$stype <- as.factor(api_pop$stype)

# Keep only relevant variables
api_pop <- api_pop |>
  dplyr::select("cds", "stype", "enroll", ends_with("_bin"))

Next, we draw a stratified sample based on stype to create a sample file.

strata_sample <- sampling::strata(
  api_pop, # The population dataset
  stratanames = c("stype"), # Stratification variable
  size = c(150, 200, 175), # Desired sample sizes per stratum
  method = "srswor" # Sampling without replacement
)

api_sample <- getdata(api_pop, strata_sample)
api_sample$SamplingWeight <- 1 / api_sample$Prob

We will also add a response indicator to the sample file.

# Artificial logistic regression coefficients (for both main effects and interactions)
coefficients <- c(
  "api00_bin" = 0.5,
  "growth_bin" = -0.3,
  "ell_bin" = -0.6,
  "interaction1" = 0.9, # Interaction term api00_bin:growth_bin
  "interaction2" = 1.2 # Interaction term api00_bin:ell_bin
)

# Create interaction terms for logistic model
api_sample$interaction1 <- as.numeric(api_sample$api00_bin) * as.numeric(api_sample$growth_bin)
api_sample$interaction2 <- as.numeric(api_sample$api00_bin) * as.numeric(api_sample$ell_bin)

# Calculate the logit (log-odds) based on the artificial coefficients and interaction terms
logit <- -1.2 +
  coefficients["api00_bin"] * as.numeric(api_sample$api00_bin) +
  coefficients["growth_bin"] * as.numeric(api_sample$growth_bin) +
  coefficients["ell_bin"] * as.numeric(api_sample$ell_bin) +
  coefficients["interaction1"] * api_sample$interaction1 +
  coefficients["interaction2"] * api_sample$interaction2

# Compute the logistic probabilities
api_sample$response_prob <- 1 / (1 + exp(-logit)) # Logistic function

# Simulate the response variable (1 for response, 0 for non-response)
api_sample$response <- as.factor(rbinom(n = nrow(api_sample), size = 1, prob = api_sample$response_prob))

# --- Check the summary of the sample ---
summary(api_sample)

Selecting Auxiliary Variables via LASSO

Variables in our analyses will have different purposes:

When evaluating auxiliary vector performance, we will also need:

In this example, we will use the following variables:

Variable selection using auxvecLASSO

The select_auxiliary_variables_lasso_cv() function can be used to perform LASSO-based variable selection for binary and continuous outcomes. The LASSO variable selection is made through one call of the function select_auxiliary_variables_lasso_cv(). In this example, we will demand that stype be included in the auxiliary vector since it is the stratification variable.

# Apply LASSO for selecting auxiliary variables
lasso_result <- select_auxiliary_variables_lasso_cv(
  df = api_sample,
  outcome_vars = c("response", "enroll"),
  auxiliary_vars = c(
    "api00_bin", "growth_bin", "comp.imp_bin", "meals_bin",
    "meals_bin", "ell_bin", "hsg_bin", "full_bin", "stype"
  ),
  must_have_vars = "stype", # Include the domain variable (stype) in the model
  check_twoway_int = TRUE, # Include two-way interactions
  nfolds = 5, # Use 5-fold cross-validation
  verbose = FALSE, # Don't print progress messages
  standardize = TRUE, # Standardize the predictors
  return_models = FALSE, # Don't return the models, only the selection results
  parallel = FALSE # Run without parallel processing
)

# Display the results
lasso_result

Understanding the Output

The output from the select_auxiliary_variables_lasso_cv() function includes several key components:

Selected variables

This is the list of all variables (main effects and interaction terms) that were selected by the LASSO model. Variables like api00_bin0, growth_bin1, stypeH, stypeM, etc., represent the individual effects of the auxiliary variables (binary or continuous) used in the model. api00_bin1:comp.imp_bin1, growth_bin1:stypeH, etc., represent interactions between two variables. These are considered important for predicting the outcome and have non-zero coefficients.

By outcome

response and enroll: These are the two outcomes modeled separately. This breakdown helps you see which variables are more relevant for each specific outcome.

Selected lambdas

Lambda is a regularization parameter used in LASSO to control the strength of the penalty.

Penalty factors

Two variables have been forced into the model (must-keep) with a penalty factor of zero. These are variables that must stay in the model regardless of the regularization. Please note that "variables" in this case refers to the columns of the model matrix. In our example, stype was stated as a must-have variable. This variable is a factor with three levels, where the first level has been removed in the modeling phase to avoid the dummy variable trap.

The remaining 43 "variables" are subject to regularization and can potentially be shrunk to zero depending on the strength of the lambda penalty.

Goodness-of-fit

This section describes the model fit at the chosen lambda.min (the lambda that minimizes cross-validation error). Note that the set of goodness-of-fit measures are different between binary and continuous outcomes.

Metrics

For response:

For enroll:

Coefficients at Lambda Min

This section displays the values of the coefficients (betas) at the selected lambda.min. These coefficients represent the effect of each variable in the model at the chosen regularization strength. For response, variables like api00_bin0 and growth_bin1 have non-zero coefficients, suggesting they have a significant relationship with response. For enroll, several variables have large coefficients, including stypeH, stypeM, and comp.imp_bin1:stypeH, which seem to significantly influence the prediction of enroll.

Variables with a zero coefficient have been regularized out of the model, meaning they didn't contribute to the prediction at the chosen lambda.min.

Interaction metadata

This section gives an overview of the results concerning interaction terms and the full formula tested.

Summary

This output demonstrates how LASSO performs both feature selection and regularization to reduce the model complexity and improve generalization. The model for response performs quite well, whereas the enroll model has room for improvement in terms of predictive power.

Assessing Auxiliary Vector Calibration and Diagnostics

Choice of auxiliary vector to examine

Next, we will assess the calibration of auxiliary variables using the assess_aux_vector() function. This function provides a comprehensive assessment of auxiliary vector calibration performance for a given survey design. The function can also calibrate weights based on a specified calibration formula and population totals. The function returns a detailed list of diagnostics, which can help to evaluate and improve survey weights.

Based on the results from the LASSO modeling, together with a domain-expertise judgment of the statistical relationships, data sufficiency, and what usually works in survey analysis and calibration, the following auxiliary vector is assessed:

Preparations

To make the example as realistic as possible, the response set from the sample file is analyzed (since, in practice, we don't have survey variable information for non-respondents).

api_resp <- api_sample[api_sample$response == 1, ]

We will also create a survey design object to pass to the assessment function.

design <- survey::svydesign(
  ids     = ~1,
  data    = api_resp,
  strata  = ~stype,
  weights = ~SamplingWeight
)

Since we are considering register variable diagnostics, register variable means need to be calculated.

register_pop_means <- list(
  total = list(sch.wide_bin = mean(api_pop$sch.wide_bin == "1")),
  by_domain = list(
    stype = tapply(
      api_pop$sch.wide_bin, api_pop$stype,
      function(x) mean(x == "1")
    ),
    awards_bin = tapply(
      api_pop$sch.wide_bin, api_pop$awards_bin,
      function(x) mean(x == "1")
    )
  )
)

We also need to calculate population totals for the weight calibration to be possible. This can be done using the package function generate_population_totals which requires a calibration formula as input.

## --- Define the calibration formula ---
calibration_formula <- ~ stype + growth_bin + api00_bin + ell_bin + comp.imp_bin + hsg_bin + api00_bin:stype + hsg_bin:stype + comp.imp_bin:stype + api00_bin:growth_bin

## --- Calculate population totals with a single terms object built on the POPULATION ---
pop_totals <- generate_population_totals(
  population_df = api_pop,
  calibration_formula
)

Calling the assessment function

We use the objects created in the preparations phase and pass them as arguments to the assessment function.

Please note that the assessment function can be used to compare different auxiliary vectors by passing different calibration formulae (for example, on with and one without interaction terms) and matching population totals to it.

result_diagnostics <- assess_aux_vector(
  design = design,
  df = api_resp,
  already_calibrated = FALSE,
  calibration_formula = calibration_formula,
  calibration_pop_totals = pop_totals$population_totals,
  register_vars = c("sch.wide_bin"),
  register_pop_means = register_pop_means,
  survey_vars = c("enroll", "full_bin"),
  domain_vars = c("stype", "awards_bin"),
  diagnostics = c(
    "weight_variation",
    "register_diagnostics",
    "survey_diagnostics"
  ),
  verbose = FALSE
)

## --- Display output ---
result_diagnostics

Understanding the Output

The function assess_aux_vector() produces a structured summary of diagnostics that help you understand how your calibration or auxiliary vector adjustment performed. The output is organized into three main sections:

Weight variation metrics

This block describes how the analysis weights are distributed. Stable, well-behaved weights improve the reliability of estimates.

Register variable diagnostics

For each register variable supplied with population means, the function reports design-based estimates, uncertainty, and bias relative to the benchmark.

Reported statistics per variable (and per domain, if requested):

Based on the output, the calibrated estimates match the register means closely (no evidence of significant biases). For example at the total level, the bias of the estimate of the mean sch.wide_bin is −0.0027 with a p-value of p=0.849, which indicates that the chosen auxiliary vector consists of valuable auxiliary variables. Similar results are shown at the domain-level. As a test, the domain awards_bin=1 was included, in which all rows have sch.wide_bin=1 - the output mean estimate of exactly 1.000 is consistent with the given data structure.

Survey variable diagnostics

The assessment function provides design-based estimates and precision for the survey variable estimates. Bias/MSE/p-values are NA because there is no benchmark to compare against.

Overall mean enrollment was 603.6 students (RSE = 2.1%), and the proportion of full schools was 0.516 (RSE = 5.5%). Domain-level RSEs generally remained below 10%, supporting reliable comparisons across stype and awards_bin.

Conclusion

The auxvecLASSO package is designed to help survey practitioners treat auxiliary variable selection as a statistical learning problem. In this vignette, we walked through two key components of the package:

Together, these tools provide a workflow for:

  1. generating candidate auxiliary vectors, and

  2. checking whether the resulting weights behave sensibly and lead to accurate, stable estimates.

While this vignette illustrated the functions on a simple dataset, the same approach extends to larger and more complex surveys. Readers are encouraged to adapt the demonstrated steps to their own data, paying special attention to the diagnostics: stable weights, low bias on register variables, and acceptable precision on survey estimates are all signs that the chosen auxiliary vector is working well.

For further details, advanced options, and examples, consult the package documentation.



Try the auxvecLASSO package in your browser

Any scripts or data that you put into this service are public.

auxvecLASSO documentation built on Aug. 28, 2025, 9:09 a.m.