This R package provides a big-data-friendly and memory-efficient difference-in-differences (DiD) estimator for staggered (and non-staggered) treatment contexts. It supports controlling for time-varying covariates, heteroskedasticity-robust standard errors, and (single and multi-way) clustered standard errors. It addresses 4 issues that arise in the context of large administrative datasets:
DiDforBigData
will provide
estimation and inference for staggered DiD with millions of
observations on a personal laptop. It is orders of magnitude faster
than other available software if the sample size is large; see the
demonstration
here.DiDforBigData
helps by using much less memory than other software;
see the demonstration
here.data.table
for big data
management and sandwich
for robust standard error estimation,
which are already installed with most R distributions. Optionally,
it will use the fixest
package to speed up the estimation if it is
installed. If the progress
package is installed, it will also
provide a progress bar so you know how much longer the estimation
will take.DiDforBigData
makes
parallelization easy as long as the parallel
package is installed.To install the package from CRAN:
install.packages("DiDforBigData")
To install the package from Github:
devtools::install_github("setzler/DiDforBigData")
To use the package after it is installed:
library(DiDforBigData)
It is recommended to also make sure these optional packages have been installed:
library("progress")
library("fixest")
library("parallel")
There are only 3 functions in this package:
SimDiD()
: This function simulates data.DiDge()
: This function estimates DiD for a single cohort and a
single event time.DiD()
: This function estimates DiD for all available cohorts and
event times.Details for each function are available from the Function Documentation.
Before estimation, set up a variable list with the names of your variables:
varnames = list()
varnames$time_name = "year"
varnames$outcome_name = "Y"
varnames$cohort_name = "cohort"
varnames$id_name = "id"
To estimate DiD for a single cohort and event time, use the DiDge
command. For example:
DiDge(inputdata = yourdata, varnames = varnames,
cohort_time = 2010, event_postperiod = 3)
A detailed manual explaining the various features available in DiDge
is available
here or
by running this command in R:
?DiDge
To estimate DiD for many cohorts and event times, use the DiD
command.
For example:
DiD(inputdata = yourdata, varnames = varnames,
min_event = -3, max_event = 5)
A detailed manual explaining the various features available in DiD
is
available
here or
by running this command in R:
?DiD
For more information, read the following articles:
Acknowledgements: Thanks to Mert Demirer and Kirill Borusyak for helpful comments.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.