Nothing
#' Extended two-way fixed effects
#'
#' @md
#' @description
#' Estimates an "extended" two-way fixed effects regression, with fully
#' saturated interaction effects _a la_ Wooldridge (2021, 2023). At its heart,
#' `etwfe` is a convenience function that automates a number of tedious and
#' error prone preparation steps involving both the data and model formulae.
#' Computation is passed on to the \code{\link[fixest]{feols}} (linear) /
#' \code{\link[fixest]{feglm}} (nonlinear) functions from the **fixest**
#' package. `etwfe` should be paired with its companion [`emfx`] function.
#'
#' @param fml A two-side formula representing the outcome (lhs) and any control
#' variables (rhs), e.g. `y ~ x1 + x2`. If no controls are required, the rhs
#' must take the value of 0 or 1, e.g. `y ~ 0`.
#' @param tvar Time variable. Can be a string (e.g., "year") or an expression
#' (e.g., year).
#' @param gvar Group variable. Can be either a string (e.g., "first_treated")
#' or an expression (e.g., first_treated). In a staggered treatment setting,
#' the group variable typically denotes treatment cohort.
#' @param data The data frame that you want to run ETWFE on.
#' @param ivar Optional index variable. Can be a string (e.g., "country") or an
#' expression (e.g., country). Leaving as NULL (the default) will result in
#' group-level fixed effects being used, which is more efficient and
#' necessary for nonlinear models (see `family` argument below). However, you
#' may still want to cluster your standard errors by your index variable
#' through the `vcov` argument. See Examples below.
#' @param xvar Optional interacted categorical covariate for estimating
#' heterogeneous treatment effects. Enables recovery of the marginal
#' treatment effect for distinct levels of `xvar`, e.g. "child", "teenager",
#' or "adult". Note that the "x" prefix in "xvar" represents a covariate that
#' is *interacted* with treatment, as opposed to a regular control variable.
#' @param tref Optional reference value for `tvar`. Defaults to its minimum
#' value (i.e., the first time period observed in the dataset).
#' @param gref Optional reference value for `gvar`. You shouldn't need to
#' provide this if your `gvar` variable is well specified. But providing an
#' explicit reference value can be useful/necessary if the desired control
#' group takes an unusual value.
#' @param cgroup What control group do you wish to use for estimating treatment
#' effects. Either "notyet" treated (the default) or "never" treated.
#' @param fe What level of fixed effects should be used? Defaults to "vs"
#' (varying slopes), which is the most efficient in terms of estimation and
#' terseness of the return model object. The other two options, "feo" (fixed
#' effects only) and "none" (no fixed effects whatsoever), trade off
#' efficiency for additional information on other (nuisance) model
#' parameters. Note that the primary treatment parameters of interest should
#' remain unchanged regardless of choice.
#' @param family Which [`family`] to use for the estimation. Defaults to NULL,
#' in which case [`fixest::feols`] is used. Otherwise passed to
#' [`fixest::feglm`], so that valid entries include "logit", "poisson", and
#' "negbin". Note that if a non-NULL family entry is detected, `ivar` will
#' automatically be set to NULL.
#' @param ... Additional arguments passed to [`fixest::feols`] (or
#' [`fixest::feglm`]). The most common example would be a `vcov` argument.
#' @return A \code{\link[fixest]{fixest}} object with fully saturated
#' interaction effects, and a few additional attributes used for
#' post-estimation in `emfx`.
#'
#' @importFrom fixest demean feols feglm
#' @importFrom stats reformulate setNames
#' @importFrom Formula as.Formula
#'
#' @section Heterogeneous treatment effects:
#'
#' Specifying `etwfe(..., xvar = <xvar>)` will generate interaction effects
#' for all levels of `<xvar>` as part of the main regression model. The
#' reason that this is useful (as opposed to a regular, non-interacted
#' covariate in the formula RHS) is that it allows us to estimate
#' heterogeneous treatment effects as part of the larger ETWFE framework.
#' Specifically, we can recover heterogeneous treatment effects for each
#' level of `<xvar>` by passing the resulting `etwfe` model object on to
#' `emfx()`.
#'
#' For example, imagine that we have a categorical variable called "age" in
#' our dataset, with two distinct levels "adult" and "child". Running
#' `emfx(etwfe(..., xvar = age))` will tell us how the efficacy of treatment
#' varies across adults and children. We can then also leverage the in-built
#' hypothesis testing infrastructure of `marginaleffects` to test whether
#' the treatment effect is statistically different across these two age
#' groups; see Examples below. Note the same principles carry over to
#' categorical variables with multiple levels, or even continuous variables
#' (although continuous variables are not as well supported yet).
#'
#' @section Performance tips:
#'
#' Under most situations, `etwfe` should complete very quickly. For its part,
#' `emfx` is quite performant too and should take a few seconds or less for
#' datasets under 100k rows. However, `emfx`'s computation time does tend to
#' scale non-linearly with the size of the original data, as well as the
#' number of interactions from the underlying `etwfe` model. Without getting
#' too deep into the weeds, the numerical delta method used to recover the
#' ATEs of interest has to estimate two prediction models for *each*
#' coefficient in the model and then compute their standard errors. So, it's
#' a potentially expensive operation that can push the computation time for
#' large datasets (> 1m rows) up to several minutes or longer.
#'
#' Fortunately, there are two complementary strategies that you can use to
#' speed things up. The first is to turn off the most expensive part of the
#' whole procedure---standard error calculation---by calling `emfx(..., vcov
#' = FALSE)`. Doing so should bring the estimation time back down to a few
#' seconds or less, even for datasets in excess of a million rows. While the
#' loss of standard errors might not be an acceptable trade-off for projects
#' where statistical inference is critical, the good news is this first
#' strategy can still be combined our second strategy. It turns out that
#' collapsing the data by groups prior to estimating the marginal effects can
#' yield substantial speed gains of its own. Users can do this by invoking
#' the `emfx(..., collapse = TRUE)` argument. While the effect here is not as
#' dramatic as the first strategy, our second strategy does have the virtue
#' of retaining information about the standard errors. The trade-off this
#' time, however, is that collapsing our data does lead to a loss in accuracy
#' for our estimated parameters. On the other hand, testing suggests that
#' this loss in accuracy tends to be relatively minor, with results
#' equivalent up to the 1st or 2nd significant decimal place (or even
#' better).
#'
#' Summarizing, here's a quick plan of attack for you to try if you are
#' worried about the estimation time for large datasets and models:
#'
#' 0. Estimate `mod = etwfe(...)` as per usual.
#'
#' 1. Run `emfx(mod, vcov = FALSE, ...)`.
#'
#' 2. Run `emfx(mod, vcov = FALSE, collapse = TRUE, ...)`.
#'
#' 3. Compare the point estimates from steps 1 and 2. If they are are similar
#' enough to your satisfaction, get the approximate standard errors by
#' running `emfx(mod, collapse = TRUE, ...)`.
#'
#' @references
#' Wooldridge, Jeffrey M. (2021). \cite{Two-Way Fixed Effects, the
#' Two-Way Mundlak Regression, and Difference-in-Differences Estimators}.
#' Working paper (version: August 16, 2021). Available:
#' http://dx.doi.org/10.2139/ssrn.3906345
#'
#' Wooldridge, Jeffrey M. (2023). \cite{Simple Approaches to Nonlinear
#' Difference-in-Differences with Panel Data}. The Econometrics Journal,
#' 26(3), C31-C66. Available: https://doi.org/10.1093/ectj/utad016
#' @seealso [fixest::feols()], [fixest::feglm()] which power the underlying
#' estimation routines. [`emfx`] is a companion function that handles
#' post-estimation aggregation to extract quantities of interest.
#' @examples
#' \dontrun{
#' # We’ll use the mpdta dataset from the did package (which you’ll need to
#' # install separately).
#'
#' # install.packages("did")
#' data("mpdta", package = "did")
#'
#' #
#' # Basic example
#' #
#'
#' # The basic ETWFE workflow involves two consecutive function calls:
#' # 1) `etwfe` and 2) `emfx`
#'
#' # 1) `etwfe`: Estimate a regression model with saturated interaction terms.
#' mod = etwfe(
#' fml = lemp ~ lpop, # outcome ~ controls (use 0 or 1 if none)
#' tvar = year, # time variable
#' gvar = first.treat, # group variable
#' data = mpdta, # dataset
#' vcov = ~countyreal # vcov adjustment (here: clustered by county)
#' )
#'
#' # mod ## A fixest model object with fully saturated interaction effects.
#'
#' # 2) `emfx`: Recover the treatment effects of interest.
#'
#' (mod_es = emfx(mod, type = "event")) # dynamic ATE a la an event study
#'
#' # Etc. Other aggregation type options are "simple" (the default), "group"
#' # and "calendar"
#'
#' # To visualize results, use the native plot method (see `?plot.emfx`)
#' plot(mod_es)
#'
#' # Notice that we don't get any pre-treatment effects with the default
#' # "notyet" treated control group. Switch to the "never" treated control
#' # group if you want this.
#' etwfe(
#' lemp ~ lpop, tvar = year, gvar = first.treat, data = mpdta,
#' vcov = ~countyreal,
#' cgroup = "never" ## <= use never treated group as control
#' ) |>
#' emfx("event") |>
#' plot()
#'
#' #
#' # Heterogeneous treatment effects
#' #
#'
#' # Example where we estimate heterogeneous treatment effects for counties
#' # within the 8 US Great Lake states (versus all other counties).
#'
#' gls = c("IL" = 17, "IN" = 18, "MI" = 26, "MN" = 27,
#' "NY" = 36, "OH" = 39, "PA" = 42, "WI" = 55)
#'
#' mpdta$gls = substr(mpdta$countyreal, 1, 2) %in% gls
#'
#' hmod = etwfe(
#' lemp ~ lpop, tvar = year, gvar = first.treat, data = mpdta,
#' vcov = ~countyreal,
#' xvar = gls ## <= het. TEs by gls
#' )
#'
#' # Heterogeneous ATEs (could also specify "event", etc.)
#'
#' emfx(hmod)
#'
#' # To test whether the ATEs across these two groups (non-GLS vs GLS) are
#' # statistically different, simply pass an appropriate "hypothesis" argument.
#'
#' emfx(hmod, hypothesis = "b1 = b2")
#'
#' plot(emfx(hmod))
#'
#' #
#' # Nonlinear model (distribution / link) families
#' #
#'
#' # Poisson example
#'
#' mpdta$emp = exp(mpdta$lemp)
#'
#' etwfe(
#' emp ~ lpop, tvar = year, gvar = first.treat, data = mpdta,
#' vcov = ~countyreal,
#' family = "poisson" ## <= family arg for nonlinear options
#' ) |>
#' emfx("event")
#' }
#'
#' @export
etwfe = function(
fml = NULL,
tvar = NULL,
gvar = NULL,
data = NULL,
ivar = NULL,
xvar = NULL,
tref = NULL,
gref = NULL,
cgroup = c("notyet", "never"),
fe = c("vs", "feo", "none"),
family = NULL,
...
) {
cgroup = match.arg(cgroup)
fe = match.arg(fe)
rhs = ctrls = vs = ref_string = xvar_dm_df = ctrls_fml_vars = xvar_fml_vars = NULL
gref_min_flag = FALSE
if (is.null(fml)) stop("A non-NULL `fml` argument is required.\n")
if (is.null(data)) stop("A non-NULL `data` argument is required.\n")
data = as.data.frame(data)
## NSE ----
nl = as.list(seq_along(data))
names(nl) = names(data)
tvar = eval(substitute(tvar), nl, parent.frame())
if (is.numeric(tvar)) tvar = names(data)[tvar]
gvar = eval(substitute(gvar), nl, parent.frame())
if (is.numeric(gvar)) gvar = names(data)[gvar]
ivar = eval(substitute(ivar), nl, parent.frame())
if (is.numeric(ivar)) ivar = names(data)[ivar]
xvar = eval(substitute(xvar), nl, parent.frame())
if (is.numeric(xvar)) xvar = names(data)[xvar]
if (is.null(gvar)) stop("A non-NULL `gvar` argument is required.\n")
if (is.null(tvar)) stop("A non-NULL `tvar` argument is required.\n")
if (!is.null(family)) ivar = NULL
if ("group" %in% c(xvar, gvar, tvar, ivar)) {
stop(
"A variable called 'group' was detected in your model formula.",
"This is a reserved name within etwfe.",
"Please rename the 'group' column in your dataset, or use a different variable (a copy of 'group' is fine)."
)
}
fml_paste = paste(fml)
lhs = fml_paste[2]
ctrls = fml_paste[3]
if ("group" %in% c(xvar, gvar, tvar, ivar, lhs, ctrls)) {
stop(
"A variable called 'group' was detected in your model formula.",
"This is a reserved name within etwfe.",
"Please rename the 'group' column in your dataset, or use a different variable (a copy of 'group' is fine)."
)
}
if (length(ctrls) == 0) {
ctrls = NULL
} else if (ctrls %in% c("0", "1")) {
ctrls = NULL
} else {
ctrls_dm = unique(paste0(strsplit(ctrls, " \\+ | \\* | \\: ")[[1]], "_dm"))
if (fe == "vs") {
vs = paste0("[", gsub(" \\+", ",", ctrls), "]") ## For varying slopes later
}
}
if (is.null(gref)) {
ug = unique(data[[gvar]])
ut = unique(data[[tvar]])
gref = ug[ug > max(ut)]
if (length(gref)==0) gref = ug[ug < min(ut)]
if (length(gref)==0 && cgroup=="notyet") gref = max(ug)
if (length(gref)==0) {
stop("The '", cgroup,"' control group for ", gvar, " could not be identified. You can provide a bespoke group reference level via the `gref` argument.\n")
}
if (length(gref) > 1) {
gref = min(gref) ## placeholder. could do something a bit smarter here like bin post periods.
## also: what about NA vals?
}
if (gref < min(ut)) gref_min_flag = TRUE
} else {
# Sanity check proposed gref level
if (!(gref %in% unique(data[[gvar]]))) {
stop("Proposed reference level ", gref, " not found in ", gvar, ".\n")
}
if (gref < min(unique(data[[tvar]]))) gref_min_flag = TRUE
}
ref_string = paste0(", ref = ", gref)
if (is.null(tref)) {
tref = min(data[[tvar]], na.rm = TRUE)
} else if (!(tref %in% unique(data[[tvar]]))) {
stop("Proposed reference level ", tref, " not found in ", tvar, ".\n")
}
if (length(tref) > 1) {
tref = min(tref, na.rm = TRUE) ## placeholder. could do something a bit smarter here like bin post periods.
## also: what about NA vals?
}
if (cgroup == "notyet") {
ref_string = paste0(ref_string, ", ref2 = ", tref)
data[[".Dtreat"]] = data[[tvar]] >= data[[gvar]] & data[[gvar]] != gref
if (!gref_min_flag) {
data[[".Dtreat"]] = ifelse(data[[tvar]] < gref, data[[".Dtreat"]], NA)
} else {
data[[".Dtreat"]] = ifelse(data[[tvar]] > gref, data[[".Dtreat"]], NA)
}
} else {
## Placeholder .Dtreat for never treated group
# data[[".Dtreat"]] = TRUE
## Force reference group to be the year before treatment if cgroup == "never"
data[[".Dtreat"]] = data[[tvar]] != data[[gvar]] - 1L
}
rhs = paste0(".Dtreat : ", rhs)
rhs = paste0(rhs, "i(", gvar, ", i.", tvar, ref_string, ")")
## Demean and interact controls ----
if (!is.null(ctrls)) {
dm_fml = reformulate(gvar, response = ctrls)
ctrls_dm_df = demean(dm_fml, data = data, as.matrix = FALSE)
ctrls_dm_df = setNames(ctrls_dm_df, ctrls_dm)
data = cbind(data, ctrls_dm_df)
if (length(ctrls_dm) > 1) {
ctrls_fml_vars = paste("(", paste(ctrls_dm, collapse = " + "), ")")
} else {
ctrls_fml_vars = paste(ctrls_dm)
}
rhs = paste(rhs, "/", ctrls_fml_vars)
if (fe != "vs") {
ictrls = strsplit(ctrls, split = " \\+ ")[[1]]
ictrls = paste(
c(
ctrls,
paste(paste0("i(", gvar, ", ", ictrls, ", ref = ", gref, ")"), collapse = " + "),
paste(paste0("i(", tvar, ", ", ictrls, ", ref = ", tref, ")"), collapse = " + ")
),
collapse = " + "
)
rhs = paste(rhs, "+", ictrls)
}
}
## Demean the interacted covariate (for heterogeneous ATEs) ----
if (!is.null(xvar)) {
data$.Dtreated_cohort = ifelse(data[[gvar]] != gref & !is.na(data[[gvar]]), 1, 0) # generate a treatment-dummy
xvar_dm_fml = reformulate(gvar, response = xvar)
xvar_dm_df = demean(xvar_dm_fml, data = data, weights = data$.Dtreated_cohort, as.matrix = FALSE) # weights: only use the treated cohorts (units) to demean
if (length(xvar)==ncol(xvar_dm_df)){
xvar_dm_df = setNames(xvar_dm_df, paste0(xvar, "_dm")) # give a name
xvar_fml_vars = paste0(xvar, "_dm")
} else {
names(xvar_dm_df) = paste0(names(xvar_dm_df), "_dm") # give a name
xvar_fml_vars = paste0("(",paste(names(xvar_dm_df), collapse = "+"), ")")
}
data = cbind(data, xvar_dm_df)
if (is.null(ctrls)) {
rhs = paste0(
rhs,
" / ", xvar_fml_vars,
" + i(", tvar, ", ", xvar_fml_vars, ", ref = ", tref, ")"
)
# splice together with ctrl vars if necessary
} else {
rhs = paste0(
gsub(
ctrls_fml_vars,
paste0("(", ctrls_fml_vars, "+", xvar_fml_vars, ")"),
rhs
),
" + i(", tvar, ", ", xvar_fml_vars, ", ref = ", tref, ")"
)
}
}
## Fixed effects ----
if (fe != "none") {
if (is.null(ivar)) {
fes = reformulate(paste0(c(gvar, tvar), vs))
} else {
fes = reformulate(paste0(c(ivar, tvar), vs))
}
fes = paste(fes)[2]
} else {
fes = 0
rhs = paste0(
rhs,
"+ i(", gvar, ", ref = ", gref, ") + i(", tvar, ", ref = ", tref, ")"
)
}
## Estimation ----
## Formula
if( !is.null(xvar) ) {# Formula with interaction
# one could add gvar:xvar, but the result is equivalent
Fml = as.Formula(paste(
lhs, " ~ ", rhs, "|", fes
))
} else {# formula without interaction
Fml = as.Formula(paste(lhs, " ~ ", rhs, "|", fes))
}
## Estimate
if (is.null(family)) {
est = feols(Fml, data = data, notes = FALSE, ...)
} else {
est = feglm(Fml, data = data, notes = FALSE, family = family, ...)
}
# catch for offset if/when passing to emfx later
# hacky but works (i.e., overcomes the xpd / envir mismatch)
if (!is.null(est$call$offset) && !is.null(est$model_info$offset)) {
est$call$offset = reformulate(est$model_info$offset)
}
## Overload class and new attributes (for post-estimation) ----
class(est) = c("etwfe", class(est))
attr(est, "etwfe") = list(
yvar = lhs,
gvar = gvar,
tvar = tvar,
xvar = xvar,
ivar = ivar,
gref = gref,
tref = tref,
cgroup = cgroup
)
## Return ----
return(est)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.