knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Statistical inference after model selection is challenging because standard methods often ignore the selection process, leading to biased results like confidence intervals with lower-than-nominal coverage. This issue, known as post-selection inference (PoSI), is addressed by the PosiR package, which implements the methodology from @kuchibhotla2022. PosiR constructs confidence intervals that are simultaneously valid across a pre-defined universe of models, ensuring the desired coverage probability (e.g., 95%) regardless of the model selection process.
This vignette introduces Post-Selection Inference concepts, lists the package functions, explains the methodology, and provides examples.
The PosiR package provides two core functions:
simultaneous_ci()Key Arguments:
X: Design matrix (n Γ p, no intercept).
y: Response vector (length n).
Q_universe: Named list of model indices (including intercept if add_intercept = TRUE).
B: Number of bootstrap samples (default: 1000).
alpha: Significance level (default: 0.05).
verbose: Logical; if TRUE, shows progress messages.
Returns:
intervals: Data frame with columns model_id, coefficient_name, estimate, lower, upper, psi_hat_nqj, se_nqj.
K_alpha: Critical value for simultaneous coverage.
Other diagnostic outputs (e.g., T_star_b, ``warnings).
plot.simultaneous_ci_result()
Visualizes results from simultaneous_ci().
Key Arguments:
x: A simultaneous_ci_result object.
subset_pars: Vector of coefficient names to plot.
las.labels: Label orientation (e.g., 2 for perpendicular).
col.ci: Color of the confidence intervals.
main: Plot title.
PosiR constructs confidence intervals for coefficients $\theta_{q,j}$ (where $q$ denotes a model in the universe π¬ and $j$ denotes a specific coefficient within that model) that are simultaneously valid.
$$ P(\theta_{q,j} \in \widehat{\text{CI}}_{q,j} \text{ for all } q \in π¬, j \text{ in model } q) \ge 1-\alpha $$
The interval for $\theta_{q,j}$ is: $$ \widehat{\text{CI}}{q,j} = \left[ \widehat{\theta}{q,j} \pm \widehat{K}{\alpha} \frac{\widehat{\Psi}{n,q,j}^{1/2}}{\sqrt{n}} \right] $$ where:
$\widehat{\theta}_{q,j}$ is the estimate of the coefficient (e.g., from OLS) in model $q$ using the original data.
$\widehat{\Psi}{n,q,j}$ is an estimate of the asymptotic variance of $\sqrt{n}(\widehat{\theta}{q,j} - \theta_{q,j})$.
$\widehat{K}_{\alpha}$ is a critical value (quantile) obtained from a bootstrap procedure, designed to ensure simultaneous coverage.
The bootstrap procedure (Algorithm 1 from @kuchibhotla2022) is:
Original Estimate: Compute $\widehat{\theta}_{q,j}$ for all relevant $j$ in all models $q \in π¬$ using the original data.
Bootstrap Sampling: For $b = 1, \ldots, B$, resample pairs $(X_i, y_i)$ with replacement from the original data to compute the bootstrap estimates $\widehat{\theta}_{q,j}^{*b}$ for all $j, q$.
Variance Estimation:
$$ \widehat{\Psi}{n,q,j} := \frac{1}{B{valid}-1} \sum_{b \in \text{valid}} \left[ \sqrt{n} \left( \widehat{\theta}{q,j}^{*b} - \widehat{\theta}{q,j} \right) \right]^2 $$
Bootstrap Max-t Statistics (( T^{*b} )) $$ T^{b} := \max_{q \in π¬} \max_{j \in \text{coeffs}(q)} \left| n^{1/2} \widehat{\Psi}{n,q,j}^{-1/2} \left( \widehat{\theta}{q,j}^{b} - \widehat{\theta}_{q,j} \right) \right| $$
Critical Value: Compute $\widehat{K}_{\alpha}$ as the $(1-\alpha)$ quantile of the collected bootstrap max-t statistics ${ T^{1}, \ldots, T^{B} }$.
Confidence Intervals: $$ \widehat{\theta}{q,j} \pm \widehat{K}{\alpha} \sqrt{\widehat{\Psi}_{n,q,j}/n} $$
By using $\widehat{K}{\alpha}$ derived from the maximum statistic across the model universe, the resulting intervals achieve the desired simultaneous coverage probability: $$ P(\text{all } \theta{q,j} \in \widehat{\text{CI}}_{q,j}) \ge 1-\alpha $$
First, ensure devtools is installed:
if (!requireNamespace("devtools", quietly = TRUE)) { install.packages("devtools") } devtools::install_github("Chukyhenry/PosiR")
Optional dependencies:
pbapply: For progress bars during computation.
glmnet: For the Lasso example (Example 3).
install.packages("pbapply") install.packages("glmnet")
if (!requireNamespace("dplyr", quietly = TRUE)) { stop("Please install the 'dplyr' package to run this vignette.") } library(PosiR) library(dplyr) library(stats)
We simulate data where Age affects the response $\beta_{\text{Age}} = 0.5$.
set.seed(123) # for reproducibility X <- matrix(rnorm(200 * 2), ncol = 2, dimnames = list(NULL, c("Age", "Income"))) y <- 0.5 * X[, "Age"] + rnorm(200) # the true intercept is 0, the true coefficient for Age is 0.5, and the true coefficient for Income is 0.
We specify a universe $π¬$ containing two potential models: one with only Age and one with both Age and Income. The indices refer to the columns in X. Note that simultaneous_ci adds an intercept by default (add_intercept = TRUE), which will also be included in the inference.
Q <- list( Age_only = 1, # Model with Intercept + Age Age_Income = c(1, 2) # Model with Intercept + Age + Income ) # Indices refer to columns of X: 1="Age", 2="Income"
We call the function, providing the data X, y, and the model universe Q. We use B=1000 bootstrap samples for reasonably stable results, cores=2 to speed up computation via parallel processing (adjust based on your machine), and seed for reproducibility. verbose=FALSE suppresses progress messages.
result <- simultaneous_ci(X, y, Q, B = 1000, cores = 2, verbose = FALSE)
The output result is a list. Key components include:
intervals: A data frame with estimates and simultaneous CIs.
K_alpha: The calculated critical value $\widehat{K}_{\alpha}$.
# Show the calculated critical value print(paste("Calculated K_alpha:", round(result$K_alpha, 3))) # Display the intervals data frame print(result$intervals)
The columns of the resulting intervals data frame are:
model_id: Name of the model from Q.
coefficient_name: Name of the coefficient (including intercept).
estimate: Point estimate $\widehat{\theta}_{q,j}$.
lower, upper: Bounds of the $(1-\alpha)$% simultaneous confidence interval.
psi_hat_nqj: Bootstrap variance estimate $\widehat{\Psi}_{n,q,j}$.
se_nqj: Standard error estimate $\sqrt{\widehat{\Psi}_{n,q,j}/n}$.
We use the S3 plot method for simultaneous_ci_result objects. las.labels=2 rotates labels for readability.
plot(result, las.labels = 2, col.ci = "darkblue", main = "Simultaneous Confidence Intervals")
Age Coefficient: The point estimates for Age in both models are close to the true value of 0.5. Importantly, the simultaneous 95% confidence intervals (e.g., approx. [0.26, 0.67] for the Age_Income model) comfortably contain the true value 0.5 and exclude 0, correctly indicating a significant effect.
Income Coefficient: The point estimate for Income (only present in the Age_Income model) is close to 0. The simultaneous CI (approx. [-0.19, 0.17]) contains the true value 0, correctly suggesting no significant effect.
Intercept: The point estimates for the (Intercept) are close to 0. The simultaneous CIs contain the true value 0, indicating no significant intercept term.
Simultaneous Validity: These intervals are valid simultaneously. We can be 95% confident that all these reported intervals contain their respective true parameter values, regardless of whether we decided to focus on the "Age_only" or "Age_Income" model after looking at the data.
mtcars DatasetNow, letβs apply the method to a real dataset, mtcars, to predict miles per gallon (mpg).
We select hp (horsepower), wt (weight), and qsec (1/4 mile time) as potential predictors for mpg.
data(mtcars) X_mtcars <- as.matrix(mtcars[, c("hp", "wt", "qsec")]) y_mtcars <- mtcars$mpg
We consider models involving each predictor individually and a full model with all three. The indices refer to columns in X_mtcars. The intercept is added automatically by simultaneous_ci.
Q_mtcars <- list( HP_only = 1, # Model: mpg ~ Intercept + hp WT_only = 2, # Model: mpg ~ Intercept + wt QSEC_only = 3, # Model: mpg ~ Intercept + qsec Full_Model = 1:3 # Model: mpg ~ Intercept + hp + wt + qsec ) # Indices: 1="hp", 2="wt", 3="qsec"
simultaneous_ci()result_mtcars <- simultaneous_ci(X = X_mtcars, y = y_mtcars, Q_universe = Q_mtcars, B = 2000, cores = 2, seed = 123, verbose = FALSE) # Display the critical value print(paste("Calculated K_alpha:", round(result_mtcars$K_alpha, 3)))
We calculate the standard "naive" 95% confidence intervals using the normal approximation $z_{\alpha/2}$ quantile and the same standard error estimate $\sqrt{\widehat{\Psi}{n,q,j}/n}$ derived from the bootstrap. The naive interval is: $$ \widehat{\text{CI}}{q,j}^{\text{unadj}} = \left[ \widehat{\theta}{q,j} \pm z{\alpha/2} \frac{\widehat{\Psi}{n,q,j}^{1/2}}{\sqrt{n}} \right] $$ where $z{\alpha/2} \approx 1.96$ for $\alpha = 0.05$.
z_alpha_2 <- stats::qnorm(0.975) intervals_comparison <- result_mtcars$intervals %>% mutate( naive_lower = estimate - z_alpha_2 * se_nqj, naive_upper = estimate + z_alpha_2 * se_nqj ) %>% select(model_id, coefficient_name, estimate, lower, upper, naive_lower, naive_upper) %>% arrange(model_id, coefficient_name) # Ensure consistent ordering print(intervals_comparison, digits = 3)
plot(result_mtcars, las.labels = 2, main = "mtcars Data: Simultaneous CIs")
Interval Width: As expected, the simultaneous intervals (lower, upper) are uniformly wider than the corresponding naive intervals (naive_lower, naive_upper). This widening is caused by the multiplier $\widehat{K}{\alpha}$ being larger than $z{\alpha/2} \approx 1.96$.
Significance: For these mtcars models, the conclusions about which coefficients are statistically significant (i.e., interval excludes zero) do not change between the naive and simultaneous intervals. For example, hp and wt generally show significant negative associations with mpg, while qsec and the intercept terms show less significance.
PoSI Correction: The key takeaway is that the simultaneous intervals provide valid inference after considering multiple models. If we had used the naive intervals and selected, for instance, the Full_Model because it looked "best" based on some criterion, the naive intervals for that model's coefficients would not strictly be 95% confidence intervals due to the selection bias. The simultaneous intervals, however, maintain their 95% coverage guarantee across the entire set Q_mtcars. This correction ensures our conclusions are not overly optimistic.
A common workflow involves using a variable selection method like Lasso to choose a smaller model from many potential predictors, followed by inference on the selected model. PosiR can provide valid post-selection inference in this context by defining a universe $π¬$ that includes at least the Lasso-selected model.
We simulate data with $100$ observations and $50$ predictors, where only the first 5 predictors ($X_1, \ldots, X_5$) have non-zero coefficients.
set.seed(123) n_lasso <- 100 p_lasso <- 50 X_lasso <- matrix(rnorm(n_lasso * p_lasso), n_lasso, p_lasso, dimnames = list(NULL, paste0("X", 1:p_lasso))) # True coefficients: 1 for first 5, 0 for rest beta_true <- c(rep(1.0, 5), rep(0, p_lasso - 5)) true_intercept <- 0.5 # Generate response (linear model) y_lasso <- drop(true_intercept + X_lasso %*% beta_true + rnorm(n_lasso))
We use glmnet to perform Lasso regression and identify variables selected using cross-validation (lambda.min).
# This chunk only runs if glmnet is installed if (requireNamespace("glmnet", quietly = TRUE)) { library(glmnet) # Code using glmnet } else { message("glmnet is not available. Skipping this example.") } library(glmnet) # Use cv.glmnet to find optimal lambda cv_lasso_fit <- cv.glmnet(X_lasso, y_lasso, alpha = 1) # alpha=1 for Lasso # Get coefficients at lambda.min, excluding the intercept lasso_coeffs <- coef(cv_lasso_fit, s = "lambda.min")[-1, 1] # Identify indices of selected variables (non-zero coefficients) # Indices are relative to the columns of X_lasso (1 to p_lasso) select_var_index <- which(lasso_coeffs != 0) # Get names of selected variables select_var_names <- names(select_var_index) cat("Lasso selected variables (indices in X_lasso):", paste(select_var_index, collapse = ", "), "\n") cat("Lasso selected variables (names):", paste(select_var_names, collapse = ", "), "\n") # If no variables selected (unlikely here), handle gracefully for Q definition if (length(select_var_index) == 0) { warning("Lasso selected no variables. Defining Q with intercept-only and full model.") select_var_index1 <- 1 # Just intercept index for design matrix } else { # IMPORTANT: Need indices relative to the design matrix used in simultaneous_ci # which includes intercept as column 1. Add 1 to Lasso indices. select_var_index1 <- c(1, select_var_index + 1) }
# Fallback if glmnet is not installed if (!requireNamespace("glmnet", quietly = TRUE)) { warning("glmnet package not found. Using arbitrary selection for vignette.", call. = FALSE) select_var_index <- c(1, 2, 3, 4, 5, 9, 13, 18, 22, 24, 26, 29, 33, 36, 41) select_var_names <- paste0("X", select_var_index) select_var_index1 <- c(1, select_var_index + 1) }
Our universe $π¬$ for post-selection inference must include the model selected by Lasso. We can also include other models, like the full model, for comparison or robustness.
# Indices for X in simultaneous_ci context (assuming add_intercept=TRUE) # Intercept is 1, X1 is 2, ..., Xp is p+1 full_model_indices <- 1:(p_lasso + 1) Q_lasso <- list( # Model selected by Lasso (Intercept + selected X's) LassoSelected = select_var_index1, # Full model (Intercept + All X's) - optional, but common for context FullModel = full_model_indices ) # Check if Lasso selection was empty if (length(select_var_index1) == 1 && select_var_index1 == 1) { Q_lasso$LassoSelected <- 1 # Ensure it's just the intercept index } print("Models in Q_lasso (indices refer to design matrix with intercept):") # Print only first few indices for brevity if models are large print(lapply(Q_lasso, function(idx) if(length(idx)>10) c(idx[1:5], "...", idx[length(idx)]) else idx))
Run simultaneous_ci()We now run PosiR using the original data (X_lasso, y_lasso) and the universe Q_lasso containing the data-dependent Lasso model.
# Note: We pass the original X matrix (without intercept) to simultaneous_ci # It will add the intercept based on add_intercept=TRUE result_lasso <- simultaneous_ci( X_lasso, y_lasso, Q_lasso, B = 2000, # Recommend B=2000+ for high-dim cores = 2, seed = 123, verbose = FALSE )
intervals_lasso_comp <- result_lasso$intervals %>% filter(model_id == "LassoSelected") %>% # Focus on the Lasso model mutate( naive_lower = estimate - qnorm(0.975) * se_nqj, naive_upper = estimate + qnorm(0.975) * se_nqj ) %>% select(coefficient_name, estimate, lower, upper, naive_lower, naive_upper) %>% arrange(match(coefficient_name, c("(Intercept)", paste0("X", 1:p_lasso)))) # Order numerically print(intervals_lasso_comp, digits = 3)
# Plot only coefficients from the Lasso-selected model selected_plot <- result_lasso$intervals %>% filter(model_id == "LassoSelected") %>% pull(coefficient_name) plot(result_lasso, subset_pars = selected_plot, main = "Simultaneous CIs for Lasso-Selected Model", las.labels = 1)
Post-Selection Validity: The simultaneous confidence intervals (lower, upper) provide statistically valid 95% coverage after selecting the model using Lasso. They account for the uncertainty introduced by the data-driven variable selection step.
Protection Against False Discoveries: For the true predictors (X1-X5), both interval types correctly identify them as significant. However, for several noise predictors incorrectly selected by Lasso (X18, X24), the naive intervals misleadingly suggest statistical significance (they exclude the true value of 0). The simultaneous intervals, being wider due to the $\widehat{K}_{\alpha}$ adjustment, correctly contain 0 for these noise predictors, thus protecting against false positive conclusions.
Cost of Validity: This protection comes at the cost of wider intervals for all coefficients, reflecting the true uncertainty post-selection.
The PosiR package provides a practical implementation of the simultaneous inference framework from @kuchibhotla2022. By using the simultaneous_ci() function, researchers can obtain valid confidence intervals for linear model coefficients even when model selection has been performed, ensuring robust and reliable statistical conclusions. Remember that the validity is with respect to the specified universe of models $π¬$.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.