knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Before getting started, make sure that you have installed the package.
To install the package from CRAN:
install.packages("DiDforBigData")
To install the package from Github:
devtools::install_github("setzler/DiDforBigData")
If the package is installed, make sure that you load it:
library(DiDforBigData)
control_group
argumentDepending on empirical context, the parallel-trends assumption may only hold for one of these possible control groups:
"never-treated"
: These are the control units that are never observed receiving the treatment within the sampling time frame."future-treated"
: These are the control units that are observed receiving the treatment within the sampling time frame."all"
: These include both the "never-treated" and "future-treated" groups.DiDforBigData
makes it easy to choose among these possibilities by setting the control_group
argument of the DiD
function.
We start by simulating some data to work with:
sim = SimDiD(seed=123, sample_size = 1000) simdata = sim$simdata print(simdata)
We can check what the true ATT is in the simulated data. In particular, let's focus on the ATT for Cohort 2007:
print(sim$true_ATT[cohort==2007])
We also set up the variable names as a list so that DiDforBigData
knows which variable is the outcome, which variable is the treatment cohort, etc.:
varnames = list() varnames$time_name = "year" varnames$outcome_name = "Y" varnames$cohort_name = "cohort" varnames$id_name = "id"
Now, we are ready to estimate DiD. We focus on the ATT for the 2007 cohort at event times -3,...,5. Initially, the control group is not set, so the function will use the default option control_group = "all"
:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5) print(did$results_cohort[Cohort==2007])
We now consider changing the estimation to only use the "never-treated" control group:
did = DiD(inputdata = simdata, varnames = varnames, control_group = "never-treated", min_event = -3, max_event=5) print(did$results_cohort[Cohort==2007])
Note that the sample size for the control group, Ncontrol
, is now substantially lower. Finally, we consider changing this to only use the "future-treated" group:
did = DiD(inputdata = simdata, varnames = varnames, control_group = "future-treated", min_event = -3, max_event=5) print(did$results_cohort[Cohort==2007])
We see that, because there are no future-treated cohorts left at event time +5, it is no longer possible to provide an ATT estimate at event time +5 if using the future-treated control group.
base_event
argumentIn DiD designs, the researcher must choose a base pre-period against which differences are measured. A natural choice is base_event = -1
, which means to use the time period just before treatment onset at event time 0.
As discussed by Callaway & Sant'Anna (2021), one possibility is that the treatment group begins responding to treatment prior to treatment onset, which is called "anticipation." If anticipation begins at period -1, then differences measured relative to base_event = -1
yield inconsistent DiD estimates.
Fortunately, there is an easy solution: choose a base pre-period from before anticipation began. If it seems reasonable to assume that the treatment group could not have anticipated treatment 3 years before treatment onset, then base_event = -3
should be free of anticipation.
We can simulate data that is subject to anticipation using the anticipation
argument in the SimDiD
function:
sim = SimDiD(seed=123, sample_size = 200, anticipation = 2) simdata = sim$simdata print(simdata)
Let's focus on the average ATT across all cohorts. There are two periods of treatment effects prior to treatment, which we can verify by checking the true ATT from the simulation:
print(sim$true_ATT[cohort=="Average"])
We set up the varnames
to prepare for estimation:
varnames = list() varnames$time_name = "year" varnames$outcome_name = "Y" varnames$cohort_name = "cohort" varnames$id_name = "id"
If we estimate DiD using the default argument that base_event = -1
, the estimate will be biased and inconsistent:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=3) print(did$results_average)
We now set base_event = -3
to avoid the anticipation at -1 and -2:
did = DiD(inputdata = simdata, varnames = varnames, base_event = -3, min_event = -3, max_event=3) print(did$results_average)
We see that the estimate is now close to the true ATT.
covariate_names
entryWe sometimes worry that treatment cohorts are selected on time-varying observables, and those time-varying observables also directly affect the outcome. If the growth rate in the observables differs between the treatment and control groups, it creates a violation of the parallel-trends assumption: the treatment and control groups would have experienced different growth profiles in the absence of treatment due to their different observables.
Fortunately, this is easy to fix: Since the confounding variables are observable, we just need to control for those observables. DiDforBigData
requires only that the list of variable names, varnames
, is modified to include a covariate_names
entry. For example, varnames$covariate_names = c("X1","X2")
tells it to control linearly for time-variation in X1 and X2.
There is an option in SimDiD()
to add a couple of covariates.
sim = SimDiD(sample_size=1000, time_covars=TRUE) simdata = sim$simdata print(simdata) print(sim$true_ATT[cohort==2007])
There are two covariates, X1 and X2. In particular, X2 differs in growth rates across treatment cohorts, which means that it causes violations of parallel trends if not controlled.
We set up the varnames
to prepare for estimation, ignoring the covariates for now:
varnames = list() varnames$time_name = "year" varnames$outcome_name = "Y" varnames$cohort_name = "cohort" varnames$id_name = "id"
We verify that DiD gives the wrong answer for cohort 2007:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5) print(did$results_cohort[Cohort==2007])
To control linearly for X1 and X2, we just need to add them to the covariate_names
argument of varnames
:
varnames$covariate_names = c("X1","X2")
Now we check DiD with controls for time-variation in X1 and X2:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -3, max_event=5) print(did$results_cohort[Cohort==2007])
We see that the bias has been removed thanks to the control variables.
cluster_names
entryBy default, this package always provides heteroskedasticity-robust standard errors. However, in difference-in-differences applications, it is often the case that treatment is assigned to groups of individuals (e.g., a change in state-wide policy treats all individuals in a state simultaneously). If those groups are also subject to common shocks, this induces correlation in the estimation errors within cluster, and standard errors will tend to be too small.
Fortunately, this is easy to fix: If the groups within which the estimation errors are correlated are known to the researcher, we just need to cluster standard errors by group. DiDforBigData
requires only that the list of variable names, varnames
, is modified to include a cluster_names
entry. For example, varnames$cluster_names = c("group1","group2")
tells it to use multi-way clustering in a way that accounts for common shocks to each of observable groups, "group1" and "group2".
Note: When estimating a regression that combines multiple treatment cohorts and/or multiple event times, it is necessary to always cluster on unit (individual). DiDforBigData
adds this clustering by default.
There is an option in SimDiD()
using clusters=TRUE
to group individuals into bins that are differentially selected for treatment and that also face common shocks within each bin:
sim = SimDiD(sample_size=1000, clusters = TRUE) simdata = sim$simdata print(simdata) print(sim$true_ATT[cohort=="Average"])
We set up the varnames
to prepare for estimation:
varnames = list() varnames$time_name = "year" varnames$outcome_name = "Y" varnames$cohort_name = "cohort" varnames$id_name = "id"
We check the usual standard errors, which are clustered on unit based on varnames$id_name
by default:
did = DiD(inputdata = simdata, varnames = varnames, min_event = -1, max_event=3) print(did$results_average)
Next, we cluster on the "cluster" variable by adding it to the varnames
and re-estimating:
varnames$cluster_names = "cluster" did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3) print(did$results_average)
parallel_cores
argumentIf you have the parallel
package installed, it is trivial to execute your DiD estimation in parallel by setting the parallel_cores
argument in the DiD()
command. For example, DiD(..., parallel_cores = 4)
will utilize 4 cores in parallel. To determine how many cores are available on your system, just run the command parallel::detectCores()
in R. (I suggest leaving at least 1 core free to keep your system from freezing.)
While parallel processing usually does not interact well with the progress bar, I have written a modified version of R's parallelization protocol that correctly updates the progress bar as it completes its tasks. Thus, if you have the progress
package installed, you will correctly see a progress bar and predicted completion time while your DiD
estimation is executing in parallel.
We simulate some data:
sim = SimDiD(seed=123, sample_size = 1000) simdata = sim$simdata
We set up the varnames
to prepare for estimation:
varnames = list() varnames$time_name = "year" varnames$outcome_name = "Y" varnames$cohort_name = "cohort" varnames$id_name = "id"
We run the estimation not in parallel:
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3)
We run the estimation in parallel with 2 processes:
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3, parallel_cores = 2)
Esets
argumentSuppose you wish to average the DiD estimate across a few event times, with a corresponding standard error for the average across event times. This is done by providing a list to the Esets
argument in the DiD()
command.
For example, if Esets = list(c(1,2,3))
, then the output will include the average DiD for event times e=1,2,3
. Multiple sets of event times can be provided, e.g., Esets = list(c(1,2,3), c(1,3))
will provide the average across e=1,2,3
as well as the average for e=1
and e=3
.
Event set averages are returned in the $results_Esets
argument of the output list. Sample size is not provided, as it is unclear how to define the sample size when averaging multiple statistics, each of which has a different sample size.
We simulate some data:
sim = SimDiD(seed=123, sample_size = 1000) simdata = sim$simdata
We set up the varnames
to prepare for estimation:
varnames = list() varnames$time_name = "year" varnames$outcome_name = "Y" varnames$cohort_name = "cohort" varnames$id_name = "id"
We run the estimation with two event set averages:
did = DiD(inputdata = copy(simdata), varnames = varnames, min_event = -1, max_event=3, Esets = list(c(1,2,3), c(1,3))) print(did)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.