knitr::opts_chunk$set(
  collapse = TRUE,
  width = 200,
  comment = "#>"
)

Author: Brian Vegetabile, PhD

email : bvegetab@rand.org

Overview

This document provides a basic tutorial on the entbal package and using entropy balancing to create weights for use in analyses aiming to make causal inferences.

The document begins by providing an overview of weighting in causal inference to demonstrate the setting and assumptions that the package was originally created to handle. Users that are familiar with causal inference may skip this section. The document then provides an overview of the package and the main estimation function entbal, outlining the key functionality and inputs to the function. The tutorial then provides examples in three different settings: binary exposures, multi-leveled exposures, and continous exposures.

The primary functions of the package are demonstrated using real-world data when possible and simulated datasets where necessary. These data sets will be discussed when introduced.

A Crash Course in Causal Inference and Entropy Balancing

Causal Inference

This section serves to set the stage for causal inference and weighted estimation and is no means all encompassing of the field of causal inference. For good overviews see: Imbens & Rubin (2015) Causal Inference for Statistics, Social, and Biomedical Sciences or Rosenbaum (2010) Design of Observational Studies. For a good overview on weighting analyses see Li et al. (2018) Balancing Covariates via Propensity Score Weighting. As stated in the overview, feel free to skip this section if you are familiar with these concepts.

Consider three sets of variables: an exposure variable $A$, an outcome variable $Y$, and a $d$-dimensional vector of covariates $X$ that are responsible for assigning exposure levels and potentially related to the outcome variable. We frame causal inference from the potential outcomes framework of Neyman & Rubin. That is, for each individual $i$ we posit that there exists a "potential" outcome for each exposure level that $A$ can take on; i.e., for each $A_i = a$ there exists a potential outcome variable $Y_i(a)$. Under the potential outcomes framework, causal effects are often defined as contrasts of these potential outcomes, e.g., if $A$ is binary the effect can be defined as $\tau_i = Y_i(1) - Y_i(0)$.

The Fundamental Problem of Causal Inference (see Holland (1986)) is that we can only ever observe one of these potential outcomes for an individual (without making additional strong assumptions); specifically it means that the observed $Y_i$ can be assumed to equal the potential outcome $Y_i(a)$ if exposure was $A_i=a$ and all other $Y_i(a')$ are unobserved (or missing). Therefore one relaxation from individual inference would be to find a way to borrow information from individuals in the population to estimate averages of these contrasts of potential outcomes in some population. The goal then becomes to estimate (sub)population effects, i.e., $\tau_g = E_{g}[Y(a') - Y(a)]$ where $g$ represents a density function that defines a (sub)population over which we want to perform inference and $A$ is now a general exposure variable. The general issue is that in our setup, if there is a relationship between $X$ and $A$ then it follows that $E[Y|A=a] \ne E[Y(a)]$ (except under certain circumstances) and therefore simple regressions will not suffice to estimate causal effects and perform inferences. Our goal then will be to find estimation strategies that allow us to estimate $E_g[Y(a)]$ and a common strategy is weighted estimation that we discuss here.

The goal is to find weights $w$ such that $E_p[wY|A=a] = E_g[Y(a)]$. If the goal of estimation is in estimating the average treatment effect in the population (ATE), then for each unit $i$ receiving $A_i=a$ and with covariates $X_i = x$ the weights are of the form

$$w_i(a,x) = \frac{p_A(a)}{p_{A|X}(a|x)}$$

where the notation $p_{Z}(z)$ is the density function of the random variable $Z$ evaluated at $z$ and the denominator is related to the propensity score (see Rosenbaum & Rubin (1983)). The form of the weights can be generalized more broadly (as is done in Li et al. (2018) Balancing Covariates via Propensity Score Weighting), so we do not discuss the generalization of these weights much further unless as needed. The entbal package is focused on ATT and ATE estimation throughout.

If treatment levels are discrete (e.g., binary or multi-leveled), then estimators of the form

$$\hat \tau(a) = \frac{\sum_i w_i I(A_i = a) Y_i }{\sum_i w_i I(A_i = a)}$$

can provide consistent estimation for $E_g[Y(a)]$ (or local regression estimators if the exposure is continuous). Similarly, weighted regressions could be performed using packages like the survey package in R.

A note on strong assumptions for Causal Inference

Note: Estabilishing valid causal inference requires very strong assumptions, such as strong ignorability and positivity, that we do not present here. See Imbens & Rubin (2015) Causal Inference for Statistics, Social, and Biomedical Sciences for a detailed treatment of causal inference and required assumptions.

Entropy Balancing

Estimation therefore requires two-stages: 1) a first stage that estimates the weights, and 2) a second stage that actually estimates the outcome. Entropy balancing, and the entbal software, is focused on creating weights in this first stage that can be used to estimate effects in the subsequent stage.

There is an extensive literature on estimating weights in the first-stage and many focus on either parametrically, or nonparametrically, estimating the density function $p_{A|X}(a|x)$ or, when the exposure is binary, the propensity score $e(x) = Pr(A = 1 | X = x)$. Entropy balancing is a nonparametric optimization based approach that focuses on estimating the weights directly, focusing on exploiting properties of the distributions of the covariates that should occur if the weights have the correct ratio-of-densities interpretation.

The use of weights has implications on the actual observed distributions of $A$ and $X$, i.e., we can look at the weighted distributions of the covariates to assess how their properties are changed under the weighting we've employed.

Binary & Multi-level exposures

Consider estimating the ATE, we see that

[ \begin{aligned} E_{X|A}[w g(X) | A=a] &= \int g(x) \frac{p_{A}(a)}{p_{A|X}(a|x)} p_{X|A}(x|a) \partial x \ &= \int g(x) \frac{p_{A,X}(a,x)}{p_{A|X}(a|x)} \partial x \ &= \int g(x) p_X(x) \partial x \ &= E_X[g(X)] \end{aligned} ]

Here we see that under this weighting scheme, the expectations of functions of the covariates conditioned upon treatment should have the same values as expectations of functions of the covariates in the overall population. These properties of the weights suggest a strategy where we can find weights that are optimized to meet these characteristics subject to certain constraints.

Specifically, entropy balancing for binary and multi-leveled exposures does this by solving the following optimization problem for each treatment level $A=a$:

[ \begin{aligned} \min_{w} \sum_{i:A_i =a} w_i \log\left(\frac{w_i}{q_i}\right) & \hspace{1em} \mbox{subject to} \ \sum_{i:A_i =a} w_i g_k(X_{ij}) = E_h[g_k(X_{j})] & \hspace{1em} \mbox{for population } h \mbox{, covariate dimension } j \mbox{, and function } k \ \sum_{i:A_i =a} w_i = 1 & \hspace{1em} w_i > 1 \mbox{ for all } i \\ \end{aligned} ]

The optimization is minimizing the Kullback-Leibler divergence between the desired weights $w$ and a set of base weights $q$. The base weights, $q_i$, are typically set to empirical weights ($q_i = [\sum_i I(A_i = a)]^{-1}$) that maximimizes the sample information. The parameters $E_h[g_k(X_{j})]$ are often called the "targets" and are set empirically from the data, i.e., for the ATE the parameter $E_X[g_k(X_{j})]$ may be empirically set to $\frac{1}{N} \sum_{i=1}^N g_k(X_{ij})$. The functions are $g_k(\cdot)$ are typically chosen to be the moments of the distribution, i.e., for the ATE, $E_X[X_{j}^p]$ may be empirically set to $\frac{1}{N} \sum_{i=1}^N X_{ij}^p$ or their scaled versions. In the entbal package we only consider the weighted moments of each scaled covariate (i.e., $E[\tilde X_{ij}^p]$, where $\tilde X_j$ is the scaled transformation of $X_j$).

Continuous exposures

Entropy balancing for continuous exposures is achieved by changing the constraints slightly to handle covariances between $A$ and $X$ and using all of the observations:

[ \begin{aligned} \min_{w} \sum_{i} w_i \log\left(\frac{w_i}{q_i}\right) & \hspace{1em} \mbox{subject to} \ \sum_{i} w_i f_l(A_{i}) = E_h[f_l(A)] & \hspace{1em} \mbox{for population } h \mbox{, covariate dimension } j \mbox{, and function } l \ \sum_{i} w_i g_k(X_{ij}) = E_h[g_k(X_{j})] & \hspace{1em} \mbox{for population } h \mbox{, covariate dimension } j \mbox{, and function } k \ \sum_{i} w_i f_k(A_{i})g_k(X_{ij}) = E_h[f_k(A)g_k(X_{j})] & \hspace{1em} \mbox{for population } h \mbox{, covariate dimension } j \mbox{, and functions } k,l \ \sum_{i} w_i = 1 & \hspace{1em} w_i > 1 \mbox{ for all } i \end{aligned} ]

For more information on theoretical arguments see Tübbicke (2020) Entropy Balancing For Continuous Treatments and Vegetabile et al. (2021) Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures.

Overview of the package

Installation and Setup

If the package is not yet installed, you can install from CRAN using

install.packages('entbal', dependencies = TRUE)

Or, you can install the development package by running the following code from the console in R. Note that the package devtools is required for installing in this way:

devtools::install_github('bvegetabile/entbal', build_vignettes = T)

Once installed, the package can be loaded using the familiar library command.

library(entbal)
Primary Function Overview and Inputs

The primary function in the package has the same name as the package, entbal. This function takes as input the following variables:

| Input Variable | Description | |---------------------|--------------------------------------------------------------| | formula | R style formula representing exposure given pre-exposure | | | covariates, e.g., TA ~ X1 + X2 + X3 | | data | Dataframe that contains the treatment variable TA and the | | | covariates X1, X2, X3. If a dataframe is not provided | | | the variables will be searched for in the global environment | | eb_pars | List of parameters required for the entropy balancing | | | optimization. These are discussed further below. | | suppress_warnings | A logical to print warnings from a function that checks the | | | input variables to eb_pars. Defaults to FALSE. |

The example usage outlined in the help() file for entball is provided below:

entbal(
  formula,
  data = NULL,
  eb_pars = entbal::ebpars_default_binary(),
  suppress_warnings = F
)

The exact usage of the function may be more apparent later in the examples.

Overview of parameters needed for eb_pars

The variable that will be of most interest to users is the list of parameters ep_pars that is input to entbal. This sets the parameters for the entropy balancing optimization, defines the exposure types, and other various settings necessary. Note that a script is run at the beginning of entbal that attempts to diagnose issues with inputs to eb_pars. These warnings can be suppressed by setting suppress_warnings to TRUE.
The variables expected in the list eb_pars that are necessary for any exposure type are:

| Variable Name | Description/Options | | -------------- | ----------------------------------------------------------------- | | exp_type | Exposure type. Choose binary, multi, or continuous. If no | | | exposure type is chosen, the function attempts to set this using | | | simple rules. It is best to set this option though (see default | | | functions below, e.g., ebpars_default_binary() for binary exposures | | n_moments | Number of moments to balance in the optimization. Good default is 3. | | | We do not recommend numbers higher than 4 and typically 2 or 3 is sufficient. | | | Increasing n_moments typically decreases the effective size. | | optim_method | Optimization method. Suggested default is L-BFGS-B, i.e., | | | Limited memory BFGS algorithm with bounding constraints. This | | | method allows for setting bounding constraints that ensure that | | | the objective does not overflow. The other acceptable input is BFGS. | | max_iters | Maximum number of iterations of the optimization routine. Suggested default is to set to 1000 | | verbose | Logical on whether the optimization should print information to the | | | screen. A value of FALSE implies no printing. |

Based on values of the above variables, additional inputs may be necessary

| Variable Name | Condition, Description/ Options | | ----------------- | ----------------------------------------------------------------- | | estimand | Condition: If exp_type == "binary" or exp_type == "multi" |
| | Estimand of interest that constructs targets to reweight the data | | | to. For the average treatment effect in the population (ATE), set | | | the estimand = "ATE". For the average effect on the treated (ATT) | | | set estimand = "ATT" or for the effect on the control estimand = "ATC", | | | when the exposure is binary. When the exposure is multi set | | | estimand = "ATZ". | | which_z | Condition: If estimand == "ATT", estimand == "ATC" or estimand == "ATZ". | | | Defines the baseline, or referent, group | | bal_tol | Condition: If optim_method == "L-BFGS-B" | | | Defines a tolerance on the gradient of the optimization. The gradient | | | directly relates to the balance conditions. | | opt_constraints | Condition: If optim_method == "L-BFGS-B" | | | Bounding constraints on the parameters. Defaults to c(-100,100) | | | Changing these values should only be explored if there are convergence | | | issues, or if balance cannot be attained. |

Helper Functions for eb_pars

There are three functions that can be used to create a list of parameters for the user, so that they do not have to be built by hand. These are:

# Returns list of default parameters for binary exposure weights
ebpars_default_binary()  

# Returns list of default parameters for multi-exposure weights
ebpars_default_multi()   

# Returns list of default parameters for continuous exposure weights
ebpars_default_cont()    

For example, ebpars_default_binary() returns a list as follows:

> ebp <- entbal::ebpars_default_binary()
$exp_type
[1] "binary"

$n_moments
[1] 3

$max_iters
[1] 1000

$estimand
[1] "ATE"

$verbose
[1] TRUE

$optim_method
[1] "l-bfgs-b"

$bal_tol
[1] 1e-08

$opt_constraints
[1] -1000  1000

These can be edited as necessary. For example, by changing the estimand as follows:

ebp$estimand <- 'ATT'

Or can be set to reduce printing to the screen:

ebp$verbose <- FALSE

These defaults are often helpful for quickly iterating with the software and reducing frustrations by ensuring that the list of eb_pars has the required variables.

Binary Exposures

A Simple Example for ATE Weights: Data Generation \& Exploration

Consider the following data-generating mechanism and assume an ultimate goal of population inference, i.e., estimation of the ATE:

set.seed(2020)
nobs <- 1000
X1 <- rnorm(nobs); X2 <- rnorm(nobs); X3 <- rnorm(nobs);
ps <- plogis(3 * X1 - 0.75 * (X2 - 0.5)^2 +  2 * X3 +  X2 * X3 + 0.75)
TA <- rbinom(nobs, 1, ps)
dat <- data.frame('TA' = TA, 'X1' =X1, 'X2' = X2, 'X3' = X3)

We can visualize the marginal distributions of the covariates (white histograms) as well as the conditional distributions of covariates given the exposure assignment.

We see that the effect of non-equal probablity is two distributions that are some distance apart, and that neither looks like the population distribution where we hope to perform inference (the marginal distribution). The second row of the figure demonstrates empirical cumulative distribution functions in each conditional distribution as well and provides the true population distribution (black line). Again these demonstrate that there differences in the distributions of the covariates in each exposure group and neither is similar to the population that we hope to perform inference over.

plot_hists <- function(xval, ta, brks){
  hist(xval, breaks = brks)
  hist(xval[ta==0], breaks = brks, add = T, col = col_c(0.5))
  hist(xval[ta==1], breaks = brks, add = T, col = col_t(0.5))
}
plot_ecdf <- function(xval, ta, xpts){
  x_0_ecdf <- ecdf(xval[ta==0])
  x_1_ecdf <- ecdf(xval[ta==1])
  plot(0, xlim = c(-4,4), ylim = c(0,1))
  lines(xpts, pnorm(xpts))
  lines(xpts, x_0_ecdf(xpts), lwd = 3, col = col_c(0.75))
  lines(xpts, x_1_ecdf(xpts), lwd = 3, col = col_t(0.75))
}

brks <- seq(-4,4,length.out = 25)
xpts <- seq(-4,4,length.out = 100)
col_c <- function(x) rgb(0,0,0.75,x)
col_t <- function(x) rgb(0.75,0,0,x)

par(mfrow = c(2,3), oma = c(0,0,0,0), mar = c(4,4,2,2)+0.1)
plot_hists(X1, TA, brks)
plot_hists(X2, TA, brks)
plot_hists(X3, TA, brks)
legend('topright', c('Exposed', 'Control', 'Marginal'), fill = c(col_t(.5), col_c(.5), 'white'))

plot_ecdf(X1, TA, xpts)
plot_ecdf(X2, TA, xpts)
plot_ecdf(X3, TA, xpts)

Similarly, for binary variables we can address this using a function in the entbal package called baltable. Its usage is

baltable(Xmat, TA, wts = NULL, show_unweighted = T, n_digits = 3)

When we apply this function on the synthetic data we see the following:

entbal::baltable(dat[,2:4], dat$TA)

In this table MeanGroup0 and MeanGroup1 are means in each group, SEGroup0 and SEGroup1 are the estimated standard deviations in each group, the column labeled StdDiffMeans is the standardized difference in means calculated as $$ \Delta_j = \frac{\bar X_{1j} - \bar X_{0j}}{\sqrt{\frac{1}{2}\left(s_{0j}^2 + s_{1j}^2\right)}} $$ and column labeled LogRatioSE and is calculated as $$ \Gamma_j = \log(s_{1j}) - \log(s_{0j}) $$ and the column labeled MaxKS is the maximum difference between the empirical cumulative distribution function between the two groups.

A Simple Example for ATE Weights: Weight Estimation

To estimate weights, we first set the parameters for the optimization

ebp <- ebpars_default_binary(estimand = 'ATE')

We can see what parameters are generated by examining ebp.

ebp

We see that the estimand is the ATE and that the algorithm intends to balance three moments. We can use these within the entbal function to calculate weights (recall from the last section data is stored in dat). In the example below, the formula "TA ~ ." implies the weights should be estimated assuming that TA is the exposure variable and all other variables should be balanced. WARNING: generally be cautious in your applications using this formula dot notation if your outcome variable is in the dataset!

ebmod <- entbal(TA ~ ., 
                data = dat, 
                eb_pars = ebp)

The algorithm appears to have converged and now we can assess balance given the weights.

The function to do this is summary.

summary(ebmod)

summary starts by first telling us the reference levels for each label in the table. It then provides the unweighted balance statistics in the top panel and the weighted balance statistics in the lower panel. The table concludes with assessments of the effective sample sizes.

Here we see that the algorithm has provided excellent balance and the StdDiffMeans and LogRatioSE are both zero and the MaxKS is small.

For binary exposures only, the entbal package provides simple plotting of empirical CDF (both weighted and unweighted). Using plot on the ebmod object with the option which_vars tells the plot function which columns of the data frame to plot the results.

plot(ebmod, which_vars = c(1,2,3))

This was a simple example that demonstrated how to get weights for ATE estimation, but does not touch on issues of estimation of the outcome function. Using these weights are discussed in the next section.

ATT Weights, Outcome Estimation, and the LaLonde Data Sets

In order to provide an example where the outcome is known in advance, we leverage the LaLonde (1986) Evaluating the Econometric Evaluations of Training Programs with Experimental Data data sets that are available in the package. The data sets from LaLonde (1986) and included in entbal are described most accurately in Dehejia \& Wahba (1999) Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs.

The dw99 data sets are provided with the entbal package and can be loaded using

# Experimental Evaluation Data
data("dw99nsw")

# Observation Controls from Current Population Survey
data("dw99cps1")
data("dw99cps2")
data("dw99cps3")

# Observation Controls from the Panel Study on Income Dynamics
data("dw99psid1")
data("dw99psid2")
data("dw99psid3")

In each of the above, the nsw is the National Supported Work Demonstration and was the source of the experimental evaluation.

The psid and cps data sets are combined with the nsw treatment units to create a pseudo-observational setting where we know the experimental result and can be used to compare results. The psid1 and cps1 data sets are the largest data sets and represent the full set of observational controls. Each subsequent data set was currated by hand under the assumption that the observations were "closest" to those from the nsw treated units. Throughout we will use the dw99psid2 data set to ensure computational runtimes are efficient.

Each data set contains the following variables

| Variable Name | Type | Description/Options | | -------------- | --------- | ------------------------------------------------ | | TA | Exposure | Indicator of Treated Unit: 1 $\equiv$ exposed. |
| -------------- | --------- | ------------------------------------------------ | | age | Covariate | Age of individual - Continuous | | education | Covariate | Highest Grade of Education - Continuous | | black | Covariate | Indicator of being "Black" - Binary | | hispanic | Covariate | Indicator of being "Hispanic" - Binary | | married | Covariate | Indicator of being "Married" - Binary | | nodegree | Covariate | Indicator of having "No Degree" - Binary | | RE74 | Covariate | Retrospective Earings: 1974 - Continuous | | RE75 | Covariate | Retrospective Earings: 1975 - Continuous | | -------------- | --------- | ------------------------------------------------ | | RE78 | Outcome | Earings: 1978 - Continuous |
| -------------- | --------- | ------------------------------------------------ |

Performing a simple analysis using the nsw experimental data, we see that the effect of the program was approximately a difference of $\$1794.3$ in earnings in 1978.

summary(lm(RE78 ~ TA, data = dw99nsw))

Performing a similar analysis using the observational data set would imply a negative effect of the program where there was approximately a difference of $-\$3646.8$ in earnings in 1978 using the observational controls.

summary(lm(RE78 ~ TA, data = dw99psid2))

Let's now demonstrate how to control for variables using entropy balancing to ensure these groups are more similar. In this example, because we are pairing the nsw treated units to larger pool of controls, the goal is estimating the treatment effect on the treated, or the ATT. This is because the experimental units were not representative of the overall population.

To set up out weight estimation, we first load the data. We remove the outcome variable from the data set to simplify the formula for entbal and then define the parameters using ebpars_default_binary and setting the estimand = "ATT".

data("dw99psid2")
outcome <- dw99psid2[,ncol(dw99psid2)]
dat <- dw99psid2[,-ncol(dw99psid2)]
ebpars <- ebpars_default_binary(estimand = 'ATT')

Running the entropy balancing algorithm we see that we get very good balance, but there is a low effective sample size after applying these weights.

eb_lalonde <- entbal(TA ~ ., 
                     eb_pars = ebpars, 
                     data = dat)
summary(eb_lalonde)

We can then use these weights with the survey package in R to estimate the effect.

library(survey)
sdesign <- svydesign(ids = ~1, weights = eb_lalonde$wts, data = dw99psid2)
sglm <- svyglm(RE78 ~ TA, sdesign)
summary(sglm)

We observe that there is now a positive effect of the exposure and that it is approximately the same magnitude as the experimental nsw data set.

Multi-leveled Exposures

The entbal package can be useful for estimation beyond binary exposure settings. In this section, we explore usage for multi-leveled exposures. Below we provide code to simulate simple data generating process for a three exposure setting.

set.seed(2021)
create_multisimdata <- function(n_obs = 1000, n_dim = 5){
  A <- matrix(rnorm(n_dim**2), n_dim, n_dim)
  Z <- matrix(rnorm(n_obs * n_dim), nrow = n_obs, ncol = n_dim)
  X <- Z %*% t(A)
  XD <- cbind(1, X)

  B1 <- rnorm(n_dim + 1, sd = 0.25)
  B2 <- rnorm(n_dim + 1, sd = 0.25)
  B3 <- rnorm(n_dim + 1, sd = 0.25)

  f1 <- XD %*% B1
  f2 <- XD %*% B2
  f3 <- XD %*% B3

  p1 <- exp(f1) / (exp(f1) + exp(f2) + exp(f3))
  p2 <- exp(f2) / (exp(f1) + exp(f2) + exp(f3))
  p3 <- exp(f3) / (exp(f1) + exp(f2) + exp(f3))

  p <- cbind(p1,p2,p3)

  A <- apply(p, 1, function(x) sample(1:3, size = 1, prob = x))

  data.frame('TA'=A, X)
}

mdat <- create_multisimdata()

Default parameters for the entbal function can set up using ebpars_default_multi and summary is provided to assess covariate balance. By default ebpars_default_multi returns an estimand of the ATE type.

m_ebpars <- ebpars_default_multi()
m_ebpars$verbose <- FALSE

We then estimate weights and assess balance:

m_ate_wts <- entbal(TA ~ ., eb_pars = m_ebpars, data = mdat)
summary(m_ate_wts)

From the output of summary, we see that the first panel has four sets of numbers representing the unweighted summary statistics. Those for the targets for the estimand (Mean, SD) and those for conditional distributions of each covariate group (M:3 SD:3,M:2 SD:2,M:1 SD:1).

The next panel is the weighted summary statistics and we can see that these all look similar to the target. The next panels are balance statistics that are a version of the standardized difference in means and the log of the ratio of standard deviations. We see that the balance is exact in this case.

Another using setting is understanding how to do average treatment effect in the treated type estimation with multiple treatments. In entbal you can specify the parameters using the estimand = 'ATZ' argument to ebpars_default_multi and below we set which_z = 2 to indicate that Group 2 should be the reference group.

m_ebpars <- ebpars_default_multi(estimand = 'ATZ', which_z = 2)
m_ebpars$verbose <- FALSE

Running entbal with these parameters, we see that the targets now reflect the Group 2 statistics and all weighted values are closer to those.

m_atz_wts <- entbal(TA ~ ., eb_pars = m_ebpars, data = mdat)
summary(m_atz_wts)

Continuous Exposures

Finally, entbal can handle continuous exposures. Below we create a function that will simulate of data using the process described in Vegetabile et al. (2021) Nonparametric estimation of population average dose-response curves using entropy balancing weights for continuous exposures.

set.seed(2021)
create_contsimdata <- function(n_obs = 1000, n_dim = 5){
  MX1 <- -0.5
  MX2 <- 1
  MX3 <- 0.3

  X1 <- rnorm(n_obs, mean = MX1, sd = 1)
  X2 <- rnorm(n_obs, mean = MX2, sd = 1)
  X3 <- rnorm(n_obs, mean = 0, sd = 1)
  X4 <- rnorm(n_obs, mean = MX2, sd = 1)
  X5 <- rbinom(n_obs, 1, prob = MX3)

  Z1 <- exp(X1 / 2)
  Z2 <- (X2 / (1 + exp(X1))) + 10
  Z3 <- (X1 * X3 / 25) + 0.6
  Z4 <- (X4 - MX2)**2 
  Z5 <- X5

  muA <- 5 * abs(X1) + 6 * abs(X2) + 3 * abs(X5) + abs(X4)

  A <- rchisq(n_obs, df = 3, ncp = muA)

  datz <- data.frame('TA' = A, 'X1' = Z1, 'X2' = Z2, 'X3' = Z3, 'X4' = Z4, 'X5' = Z5)
  datz
}
cdat <- create_contsimdata()

As with the previous two examples, ebpars_default_cont is useful for setting up the parameters for usage.

c_ebpars <- ebpars_default_cont()
c_ebpars$verbose <- FALSE

The algorithm is simply run using these parameters as was done in earlier examples.

c_wts <- entbal(TA ~ ., eb_pars = c_ebpars, data = cdat)
summary(c_wts)

Entropy balancing in continuous exposures should ensure that there is no correlation between the exposures and covariates after the algorithm has run. We see that is the case in the summary statistics.

Conclusion

Hopefully this provided a quick introduction to entropy balancing and the entbal package. This package is constantly being developed to include advances in entropy balancing. Please stop by often as things change.



bvegetabile/entbal documentation built on July 28, 2021, 8:50 p.m.