budgetIV: Partially identify causal effects with invalid instruments

View source: R/BudgetIV.R

budgetIVR Documentation

Partially identify causal effects with invalid instruments

Description

Computes the set of possible values of a causal parameter consistent with observational data and given budget constraints. See Penn et al. (2025) for technical definitions.

Usage

budgetIV(
  beta_y,
  beta_phi,
  phi_basis = NULL,
  tau_vec = NULL,
  b_vec = NULL,
  ATE_search_domain = NULL,
  X_baseline = NULL,
  delta_beta_y = NULL
)

Arguments

beta_y

Either 1 \times d_{Z} matrix or a d_{Z}-dimensional vector representing the (estimated) cross covariance \mathrm{Cov}(Y, Z).

beta_phi

A d_{\Phi} \times d_{Z} matrix representing the (estimated) cross covariance \mathrm{Cov}(\Phi (X), Z).

phi_basis

A d_{\Phi}-dimensional expression (separated by commas) with each term representing a component of \Phi (X). The expression consists of d_{X} unique vars. The default value NULL can be used for a d_{X} = d_{\Phi}-dimensional linear model.

tau_vec

A K-dimensional vector of increasing, positive thresholds representing degrees of IV invalidity. The default value NULL can be used for a single threshold at 0.

b_vec

A K-dimensional vector of increasing positive integers representing the maximum number of IVs that can surpass each threshold. The default value NULL can be used for a single threshold at 0, with at least 50\% of IVs assumed to be valid.

ATE_search_domain

A d_{X}-column data.frame with column names equal to the vars in phi_basis. Rows correspond to values of the treatment X. The default value NULL can be used to generate a small d_{X}-dimensional grid.

X_baseline

Either a data.frame or list representing a baseline treatment x_0, with names equal to the vars in phi_basis. The default value NULL can be used for the baseline treatment 0 for each of of the d_{X} vars.

delta_beta_y

A d_{Z}-dimensional vector of positive half-widths for box-shaped confidence bounds on beta_y. The default value NULL can be used to not include finite sample uncertainty.

Details

Instrumental variables are defined by three structural assumptions: (A1) they are associated with the treatment; (A2) they are unconfounded with the outcome; and (A3) exclusively effect the outcome through the treatment. Of these, only (A1) can be tested without further assumptions. The budgetIV function allows for valid causal inference when some proportion (possibly a small minority) of candidate instruments satisfy both (A2) and (A3). Tuneable thresholds decided by the user also allow for bounds on the degree of invalidity for each instrument (i.e., bounds on the proportion of \mathrm{Cov}(Y, Z) not explained by the causal effect of X on Z). Full technical details are included in Penn et al. (2025).

budgetIV assumes that treatment effects are homogeneous, which implies a structural equation of the form Y = \theta \cdot \Phi(X) + g_y(Z, \epsilon_x), where \theta and \Phi(X) are a d_{\Phi}-dimensional vector and vector-valued function respectively. A valid basis expansion \Phi (X) is assumed (e.g., linear, logistic, polynomial, RBF, neural embedding, PCA, UMAP etc.). It is also assumed that d_{\Phi} <= d_{Z}, which allows us to treat the basis functions as a complete linear model (see Theil (1953), or Sanderson et al. (2019) for a modern MR focused discussion). The parameters \theta capture the unknown treatment effect. Violation of (A2) and/or (A3) will bias classical IV approaches through the statistical dependence between Z and g_y(Z, \epsilon_x), summarized by the covariance parameter \gamma := \mathrm{Cov} (g_y(Z, \epsilon_x), Z).

budgetIV constrains \gamma through a series of positive thresholds 0 \leq \tau_1 < \tau_2 < \ldots < \tau_K and corresponding integer budgets 0 < b_1 < b_2 < \ldots < b_K \leq d_Z. It is assumed for each i \in \{ 1, \ldots, K\} that no more than b_i components of \gamma are greater in magnitude than \tau_i. For instance, taking d_Z = 100, K = 1, b_1 = 5 and \tau_1 = 0 means assuming 5 of the 100 candidates are valid instrumental variables (in the sense that their ratio estimates \theta_j := \mathrm{Cov}(Y, Z_j)/\mathrm{Cov}(\Phi(X), Z_j) are unbiased).

With delta_beta_y = NULL, budgetIV returns the identified set of causal effects that agree with both the budget constraints described above and the values of \mathrm{Cov}(Y, Z) and \mathrm{Cov}(Y, Z), assumed to be exactly precise. Unlike classical partial identification methods (see Manski (1990) for a canonical example), the non-convex mixed-integer budget constraints yield a possibly disconnected solution set. Each connected subset has a different interpretation as to which of the candidate instruments Z are valid up to each threshold.

delta_beta_y represents box-constraints to quantify uncertainty in beta_y. In the examples, delta_beta_y is calculated through a Bonferroni correction and gives an (asymptotically) valid confidence set over beta_y. Under the so-called "no measurement error" assumption (see Bowden et al. (2016)), which is commonly applied in Mendelian randomization, it is assumed that the estimate of beta_y is the dominant source of finite-sample uncertainty, with uncertainty in beta_x considered negligible. With an (asymptotically) valid confidence set for delta_beta_y, and under the "no measurement error" assumption, budgetIV returns an (asymptotically) valid confidence set for \theta when using just a single exposure.

Value

A data.table with each row corresponding to a set of bounds on the ATE at a given point in ATE_search_domain. Columns include: a non-unique identifier curve_index with a one-to-one mapping with U; lower_ATE_bound and upper_ATE_bound for the corresponding bounds on the ATE; a list U for the corresponding budget assignment; and a column for each unique variable in ATE_search_domain to indicate the treatment value at which the bounds are being calculated.

References

Jordan Penn, Lee Gunderson, Gecia Bravo-Hermsdorff, Ricardo Silva, and David Watson. (2024). BudgetIV: Optimal Partial Identification of Causal Effects with Mostly Invalid Instruments. AISTATS 2025.

Jack Bowden, Fabiola Del Greco M, Cosetta Minelli, George Davey Smith, Nuala A Sheehan, and John R Thompson. (2016). Assessing the suitability of summary data for two-sample Mendelian randomization analyses using MR-Egger regression: the role of the I^2 statistic. Int. J. Epidemiol. 46.6, pp. 1985–1998.

Charles F Manski. (1990). Nonparametric bounds on treatment effects. Am. Econ. Rev. 80.2, pp. 219–323.

Henri Theil. (1953). Repeated least-squares applied to complete equation systems. Centraal Planbureau Memorandum.

Eleanor Sanderson, George Davey Smith, Frank Windmeijer and Jack Bowden. (2019). An examination of multivariable Mendelian randomization in the single-sample and two-sample summary data settings. Int. J. Epidemiol. 48.3, pp. 713–727.

Examples

 
data(simulated_data_budgetIV)

beta_y <- simulated_data_budgetIV$beta_y

beta_phi_1 <- simulated_data_budgetIV$beta_phi_1
beta_phi_2 <- simulated_data_budgetIV$beta_phi_2

beta_phi <- matrix(c(beta_phi_1, beta_phi_2), nrow = 2, byrow = TRUE)

delta_beta_y <- simulated_data_budgetIV$delta_beta_y

tau_vec = c(0)
b_vec = c(3)

x_vals <- seq(from = 0, to = 1, length.out = 500)

ATE_search_domain <- expand.grid("x" = x_vals)

phi_basis <- expression(x, x^2)

X_baseline <- list("x" = c(0))

solution_set <- budgetIV(beta_y = beta_y, 
                         beta_phi = beta_phi, 
                         phi_basis = phi_basis, 
                         tau_vec = tau_vec, 
                         b_vec = b_vec, 
                         ATE_search_domain = ATE_search_domain, 
                         X_baseline = X_baseline,
                         delta_beta_y = delta_beta_y)


budgetIVr documentation built on April 16, 2025, 5:11 p.m.