ECICourse::upload_images_to_canvas() knitr::opts_chunk$set( echo = TRUE, message = TRUE, collapse = TRUE )
Read along with this tutorial, and run the R code in RStudio as you go (copy and paste, or type directly into R).
Start by loading packages:
library(ECICourse) library(tidyverse)
Also load the DeclareDesign package:
library(DeclareDesign)
In this tutorial, we will simulate experiments with both complete and block random assignment. And for each randomization strategy, we will estimate the average treatment effect (ATE) using difference-in-means and regression estimators. The focus will be on understanding how the variance (or standard deviation) of the estimators is affected by different randomization and estimation strategies. We will not attempt to determine whether a particular assignment and estimation strategy is the "best," because there is no such thing as a universally "best" experimental design.
Imagine we will conduct an experiment involving a sample of employees from a large company (with a population of 10,000 employees). The experiment will randomly assign some employees to receive special job training, and we hope to measure the impact of this training on subsequent job performance. There are two variables that are expected to correlate with job performance. One is a binary variable, W
, indicating whether the employee received a bonus in the previous year. The other is an ordered categorical variable describing employees' level of education, X
.
First, we will define the population and a random sample for the experiment.
population <- declare_population( N = 10000, W = draw_binary(prob = .5, N = N), # bonus X = draw_ordered(x = rnorm(N, 1 + .5 * W, 2), # formal education breaks = c(0, 1, 1.5, 3.5)))
W
(bonus) is a binary variable, and X
(education) is an ordered categorical variable. Simulated values of X
are higher among employees with W = 1
and lower among employees with W = 0
. Conversely, employees with higher values of X
are more likely to have received a bonus:
population() |> as_tibble() |> group_by(X) |> summarise(prop_with_bonus = mean(W)) |> ggplot() + aes(x = X, y = prop_with_bonus) + geom_col() + labs(y = 'Proportion who received bonus', x = 'Education level') + theme_bw()
The experiment will measure an outcome variable Y
(job performance). The potential outcomes are expected to differ under treatment Z = 1
and control Z = 0
by a constant amount. Moreover, the potential outcomes for job performance are expected to be dependent on W
and X
. The code below simulates these potential outcomes.
potential_outcomes <- declare_potential_outcomes( Y_Z_0 = round(10 * pmin(10, 5 + 2*W + X + rnorm(N)), 0)/10, Y_Z_1 = pmin(10, Y_Z_0 + .5))
The value of Y_Z_0
(i.e., the potential outcome Y(0)) is a random normal variable with mean 5 + 2*W + X
and a standard deviation of 1, rounded to a single decimal place, and taking a maximum value of 10. The value of Y_Z_1
is equal to Y_Z_0 + .5
. We can plot simulated values of Y_Z_0
(i.e., Y(0), or the job performance rating without the training):
(population + potential_outcomes) |> draw_data() |> ggplot(aes(x = factor(X), y = Y_Z_0, colour = factor(W))) + geom_point(position = position_jitterdodge( jitter.width = .2, jitter.height = 0, dodge.width = .5), alpha = .25) + labs(y = 'Y(0)', x = 'X', colour = 'W') + guides(colour = guide_legend(override.aes = list(alpha = 1))) + theme_bw()
The experiment will attempt to measure the ATE, which in these simulations, is equal to .5. This is the increase in job performance caused by the training.
ate <- declare_inquiry(ATE = mean(Y_Z_1 - Y_Z_0), label = 'ATE')
For the simulated experiments, we will always work with sample of N = 300
drawn from this population.
exp_sample <- declare_sampling(S = draw_rs(N = N, n = 300))
The remaining design choices differ according to the treatment assignment and estimation strategies. So for now we will combine the previous steps into a single object called basic_design
.
basic_design <- population + potential_outcomes + ate + exp_sample
The most straightforward randomization strategy is complete randomization with equal probability of assignment into treatment and control:
complete_assn <- declare_assignment( Z = complete_ra(N, prob = .5), label = 'Complete')
The other strategies we will consider are block random assignment, blocked by covariates W
, X
, or both W
and X
jointly.
block_W_assn <- declare_assignment( Z = conduct_ra(prob = .5, blocks = factor(W)), label = 'Block by W') block_X_assn <- declare_assignment( Z = conduct_ra(prob = .5, blocks = factor(X)), label = 'Block by X') block_WX_assn <- declare_assignment( Z = conduct_ra(prob = .5, blocks = factor(W):factor(X)), label = 'Block by W and X')
These different randomization strategies lead to different numbers of individuals assigned to treatment and control in each of the subgroups defined by values of W
(bonus) and X
(education). To illustrate, consider a simulated experiment based on complete random assignment. We can tabulate the number of individuals assigned to treatment or control (Z
) at each level of X
:
(basic_design + complete_assn) |> draw_data() |> count(X, Z) |> pivot_wider(names_from = Z, values_from = n, names_prefix = 'Z == ')
Now compare the distribution of treatment assignments against the treatment assignments obtained through block randomization on X
:
(basic_design + block_X_assn) |> draw_data() |> count(X, Z) |> pivot_wider(names_from = Z, values_from = n, names_prefix = 'Z == ')
Within each block, the number of individuals assigned to treatment and control differs by at most 1. Similarly, under blocked random assignment, the values of X
and W
within each of the two treatment groups should be more balanced (i.e., the distributions of X
and W
should be relatively similar in the treatment and control groups) when compared to complete randomization.
(basic_design + complete_assn) |> draw_data() |> group_by(Z) |> summarise(avg_X = mean(X), avg_W = mean(W)) (basic_design + block_WX_assn) |> draw_data() |> group_by(Z) |> summarise(avg_X = mean(X), avg_W = mean(W))
The most straightforward estimator is the simple difference-in-means estimator. This estimator calculates the mean value of Y
among employees with Z == 1
and Z == 0
, and then takes the difference between these two values. We can define such an estimator in DeclareDesign as:
ate_dim <- declare_estimator(Y ~ Z, .method = difference_in_means, inquiry = "ATE", label = 'DIM')
In practice, there are a number of ways we might implement such an estimator. One is using the mean
function (taking averages of Y
over different groups of employees depending on their value of Z
). Another is using a linear regression without covariates. Consider the following simulated experiment, based on complete random assignment:
reveal_Y <- declare_measurement(Y = reveal_outcomes(Y ~ Z)) sim <- (basic_design + complete_assn + reveal_Y) |> draw_data() |> as_tibble() sim
Estimating the ATE using mean
looks something like this:
sim |> group_by(Z) |> summarise(mean_Y = mean(Y)) |> pivot_wider(names_from = 'Z', values_from = 'mean_Y', names_prefix = 'avg Y|Z=') |> mutate(ATE = `avg Y|Z=1` - `avg Y|Z=0`)
Estimating the ATE using linear regression (and no other covariates) looks like this:
fit <- lm(Y ~ Z, data = sim) summary(fit)
The average treatment effect in this regression is the coefficient for Z
, which equals r coef(fit)['Z']
. As you can see, the two methods produce the same estimated ATE (the change in job performance due to the job training). This is not true in general, but in this special case, the two methods agree on the estimated treatment effect.
The other estimators we will consider are based on linear regression with covariates. There are three such estimators we will work with, corresponding with the three block randomization strategies defined earlier.
ate_reg_W <- declare_estimator(Y ~ Z + W, .method = lm_robust, term = 'Z', inquiry = "ATE", label = 'reg Y ~ Z + W') ate_reg_X <- declare_estimator(Y ~ Z + X, .method = lm_robust, term = 'Z', inquiry = "ATE", label = 'reg Y ~ Z + X') ate_reg_WX <- declare_estimator(Y ~ Z + W + X, .method = lm_robust, term = 'Z', inquiry = "ATE", label = 'reg Y ~ Z + W + X')
As the code above implies, in practice, we use lm
to estimate the ATE when adjusting or controlling for covariates. For example:
fit <- lm(Y ~ Z + W + X, data = sim) summary(fit)
In this regression, the coefficient for Z
is the estimated average treatment effect, and the variables W
and X
are included as control variables in order to reduce the variance of the estimated treatment effect. (A warning: It is not automatically the case that a coefficient named after a treatment variable will equal the average treatment effect. It is true in these examples based on OLS, but this is not true for other types of regression.)
We can simulate many experiments using each of the 4 random assignment strategies, and then estimate the ATE using the difference-in-means estimator.
complete_dim <- basic_design + complete_assn + reveal_Y + ate_dim block_W_dim <- basic_design + block_W_assn + reveal_Y + ate_dim block_X_dim <- basic_design + block_X_assn + reveal_Y + ate_dim block_WX_dim <- basic_design + block_WX_assn + reveal_Y + ate_dim comparison_dim <- diagnose_designs(complete_dim, block_W_dim, block_X_dim, block_WX_dim, sims = 1000) # this might take a minute or two to complete comparison_dim
The SD Estimate
column gives the standard deviation of the estimated ATEs for each of the four assignment strategies. Compared to complete randomization, blocking by W
(bonus) reduces the standard deviation, but not by much. Blocking by X
(education) lowers the standard deviation by a more substantial amount, and blocking by X
and W
lowers it even further. We can visualize the dispersion of the estimated ATEs across all simulations in order to see these differences more clearly:
comparison_dim$simulations_df |> transmute(estimate, assignment = str_replace(design, '_dim', '') |> factor() |> fct_inorder()) |> ggplot(aes(x = estimate, colour = assignment)) + stat_density(alpha = 0, position = 'identity') + labs(colour = 'Assignment', x = 'Estimated ATE', title = 'Estimates from DIM Estimators') + theme_bw()
Next, we will perform a similar analysis as above. This time, however, we will simulate all of the experiments using complete randomization, and estimate the ATE using the four different estimators (difference-in-means and the three regressions with covariates).
# complete_dim was defined above complete_reg_W <- basic_design + complete_assn + reveal_Y + ate_reg_W complete_reg_X <- basic_design + complete_assn + reveal_Y + ate_reg_X complete_reg_WX <- basic_design + complete_assn + reveal_Y + ate_reg_WX comparison_complete <- diagnose_designs(complete_dim, complete_reg_W, complete_reg_X, complete_reg_WX, sims = 1000) comparison_complete
The pattern of improvement parallels the previous analysis. Adding the covariate W
provides a small benefit, adding X
has a more substantial benefit, and adding W
and X
together yields further improvement over just X
.
comparison_complete$simulations_df |> transmute(estimate, estimator = str_replace(design, 'complete_', '') |> factor() |> fct_inorder()) |> ggplot(aes(x = estimate, colour = estimator)) + stat_density(alpha = 0, position = 'identity') + labs(colour = 'Estimator', x = 'Estimated ATE', title = 'Estimates Under Complete Assignment') + theme_bw()
The motivation behind blocked assignment is to remove variation in Y
(job performance) that is related to variation in the blocking variable(s) (e.g., bonus, education, or both). Once this variation in the outcome variable has been reduced via block randomization, the difference-in-means estimator should be about as efficient as a regression estimator that includes the blocking variable(s). We can still use the regression estimator to estimate the ATE for an experiment that uses block randomization, but because the coefficients for the blocking variables will be very close to zero, there is no expected benefit over the difference-in-means estimator. To illustrate, compare the regression estimator (with covariates W
and X
) against the difference-in-means estimators, when both estimators are applied to an experiment that randomizes treatment while blocking on W
and X
.
block_WX_reg_WX <- basic_design + block_WX_assn + reveal_Y + ate_reg_WX block_WX_dim <- basic_design + block_WX_assn + reveal_Y + ate_dim comparison_block_WX <- diagnose_designs(block_WX_reg_WX, block_WX_dim, sims = 1000) comparison_block_WX
There is almost no difference in the dispersion of the estimated ATEs, as we can see here:
comparison_block_WX$simulations_df |> transmute(estimate, estimator = str_replace(design, 'block_WX_', '') |> factor() |> fct_inorder()) |> ggplot(aes(x = estimate, colour = estimator)) + stat_density(alpha = 0, position = 'identity') + labs(colour = 'Estimator', x = 'Estimated ATE', title = 'Estimates Under Blocked Assignment by W and X') + theme_bw()
However, even if we block randomize by some variables, a regression estimator that includes covariates that were not used for blocking can yield improvements over the simple difference-in-means estimator. For example, if we block randomize treatment by W
(bonus), but then also include X
(education) in the regression estimator, we can see that the regression estimator that includes X
achieves a lower standard deviation than the difference-in-means estimator:
# block_W_dim was defined above block_W_reg_X <- basic_design + block_W_assn + reveal_Y + ate_reg_X comparison_block_W <- diagnose_designs(block_W_dim, block_W_reg_X, sims = 1000) comparison_block_W comparison_block_W$simulations_df |> transmute(estimate, estimator = str_replace(design, 'block_W_', '') |> factor() |> fct_inorder()) |> ggplot(aes(x = estimate, colour = estimator)) + stat_density(alpha = 0, position = 'identity') + labs(colour = 'Estimator', x = 'Estimated ATE', title = 'Estimates Under Blocked Assignment by W') + theme_bw()
The same thing is true for regression estimators. If we block assignment on W
, then a regression estimator that also includes X
as a covariate will have (in expectation) smaller standard deviation for the estimated ATE (the coefficient on Z
) than a regression that does not include X
:
sim2 <- draw_data(basic_design + block_W_assn + reveal_Y) fit_without_X <- lm(Y ~ Z, data = sim2) summary(fit_without_X) fit_with_X <- lm(Y ~ Z + X, data = sim2) summary(fit_with_X)
Notice that the standard error on Z
is smaller in the regression that includes X
as a covariate.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.