required <- c("clubSandwich", "geepack") do_eval <- all(sapply(required, requireNamespace, quietly = TRUE)) knitr::opts_chunk$set( collapse = FALSE, comment = "", message = FALSE, eval = do_eval )
The panelr
package contributes two categories of things:
panel_data
object and some tools to create/manipulate them.panel_data
framesCheck out the other vignette for a lot of detail on how to take your raw data
and reshape it into a panel_data
format. Here's a short version, using some
example data provided by this package.
library(panelr) data("teen_poverty") teen_poverty
library(panelr) data("teen_poverty") teen_poverty
These data come from a subset of young women surveyed as part of the
National Longitudinal Survey of Youth starting in 1979. The teen_poverty
data come in "wide" format, meaning there is one row per respondent and
each of the repeated measures is in a separate column for each wave.
We need to convert this to "long" format, in which you have one row for each
respondent in each wave of the 5-wave survey. We'll use long_panel()
for
that.
teen <- long_panel(teen_poverty, begin = 1, end = 5, label_location = "end") teen
Now we have a panel_data
object! It is a special version of a tibble
,
which is itself a special kind of data.frame
. panel_data
objects
work very hard to make sure you never accidentally drop the variables that
are the identifiers for each respondent and the indicators for which wave the
row corresponds to. panel_data
objects also try to stay in order by ID and
wave.
Note that if your raw data are already in long format, you can use the
panel_data()
function to convert them to panel_data
format.
data("WageData") wages <- panel_data(WageData, id = id, wave = t)
panel_data
frames are designed to work with tidyverse
packages,
particularly dplyr
. When used inside mutate()
, functions like lag()
work
properly by taking the previous value for the specific respondent. If you ever
need to do something that is easier to do with a "regular" data frame, you can
just use the unpanel()
function to convert the panel_data
frame back to
normal.
The original motivation to create this package was to automate the process of fitting "within-between" models, sometimes called "between-within" or "hybrid" models (see Allison, 2009; Bell & Jones, 2015). These combine the benefits of what econometricians call "fixed effects" models — robustness to time-invariant confounding chief among them — as well as what they call "random effects" models, which allow the inclusion of time-invariant coefficients. Within-between models include coefficients that are identical to the fixed effects equivalent, but the flexibility to also include the random effects and other time-invariant predictors (this was noticed by Mundlak, 1978). They are fit via multilevel models which allow for some other nice possibilities like inclusion of random slopes and generalized linear model specifications.
From here, I'll give a somewhat technical description of these models. If you just want to look at how to estimate them in R, skip ahead to the next mini-section.
Note that fixed effects models can be fit using individual demeaning. That is, you can subtract the entity's own mean for each predictor and the dependent variable and fit a model via OLS that is equivalent to the so-called least squares dummy variable approach (in which dummy variables for every entity ID are included as predictors).
Let's get a bit more technical. We have entities $i = 1, ..., n$ who are measured at times $t = 1, ..., T$. We have as our dependent variable $y_{it}$, the variable $y$ for individual $i$ at time $t$. We have predictors that vary over time $x_{it}$, variables that do not vary over time $z_i$, and variables we did not measure that do not vary over time $\alpha_i$ as well as random error $\epsilon_{it}$
The fixed effects model, then, looks like this:
$$ y_{it} = \mu_t + \beta_1x_{it} + \gamma z_i + \alpha_i + \epsilon_{it} $$
Although $\alpha_i$ is not observed, it can be estimated by including a dummy variable for each $i$. The $\gamma$ is undefined because the $z_i$ are perfectly collinear with the $\alpha_i$ dummy variables.
The individual-mean-centered version of the fixed effects models is based on calculating a mean of $y$ and $x$ for each $i$ — so $\bar{y_i}$ and $\bar{x_i}$ and subtracting it from each $y_{it}$ and $x_{it}$. The model can be expressed like this, including $\bar{z_i}$ and $\bar{\alpha_i}$ for demonstration:
$$ y_{it} - \bar{y_i} = \mu_t + \beta_1(x_{it} - \bar{x_i}) + (z_i - \bar{z_i} = 0) + (\alpha_i - \bar{\alpha_i} = 0) + (\epsilon_{it} - \bar{\epsilon_i}) $$
By de-meaning everything, all the time-invariant variables drop out:
$$ y_{it} - \bar{y_i} = \mu_t + \beta_1(x_{it} - \bar{x_i}) + (\epsilon_{it} - \bar{\epsilon_i}) $$
This is often called the "within" estimator. You can take these de-meaned variables and fit an OLS regression and get valid estimates (with some adjustments to the standard errors).
You can also do something slightly different and get the same results with multilevel models. Take this, for example:
$$ y_{it} = \beta_{0i} + \beta_1(x_{it} - \bar{x_i}) + (\epsilon_{it} - \bar{\epsilon_i}) $$
Where $\beta_{0i}$ is a random intercept estimated for each $i$. This is equivalent to subtracting $\bar{y_i}$ in terms of the estimation of $\beta_1$. But in the multilevel modeling framework, we can include those time-invariant $z_i$ as well. Conceptually, they are basically being included in a model predicting $\beta_{0i}$:
$$ \beta_{0i} = \beta_0 + \gamma z_i + u_{0i} $$
Where $u_{0i}$ is the random error of the model predicting $\beta_{0i}$.
In fact, we can include the $\bar{x_i}$ in our multilevel model as well and they are used just like the $z_i$:
$$ \beta_{0i} = \beta_0 + \beta_2 \bar{x_i} + \gamma z_i + u_{0i} $$
Now we can substitute into the previous multilevel equation and we have our within-between model:
$$
y_{it} = \beta_{0} + \beta_1(x_{it} - \bar{x_i})
+ \beta_2 \bar{x_i} + \gamma z_i + u_{0i} + \epsilon_{it}
$$
The $\beta_1$ has the same interpretation as in the fixed effects model, these are the effects of within-entity deviations of $x$ on within-entity deviations of $y$. The $\beta_2$ is basically predicting the $\bar{y_i}$, however, so these coefficients are helpful for predicting differences in mean levels across entities. The same is true for the $z_i$.
A similar model that I call the "contextual" model because this is how it is often interpreted (see, e.g., Raudenbush & Bryk, 2002). Here we do not demean the $x_i$:
$$ y_{it} = \beta_{0} + \beta_1 x_{it} + \beta_2 \bar{x_i} + \gamma z_i + u_{0i} + \epsilon_{it} $$
Believe it or not, the $\beta_1$ is unchanged in this model; it is the $\beta_2$ that changes. The interpretation of $\beta_2$ becomes a the difference between the within- and between-entities effects. A significant coefficient for $\beta_2$ means significant differences between the within- and between-entity effects. For those who are familiar, this is like a variable-by-variable Hausman test. Substantively, $\beta_2$ is often interpreted as a contextual effect.
From this framework, we can do cross-level interactions, random slopes, generalized linear models, and all kinds of interesting stuff.
In the fixed effects framework, it is generally considered wrong to operationalize an interaction between two time-varying variables (let's call them $w$ and $x$) by taking the product of their individual-demeaned forms. That is, you are not supposed to generate the interaction term $xw_{it}$ by doing this:
$$ xw_{it} = (x_{it} - \bar{x_{i}}) \times (w_{it} - \bar{w_i}) $$
Instead, the conventional wisdom goes, you should first take the product of the observed variables and subtract the individual-level mean of that product, like so:
$$ xw_{it} = x_{it}w_{it} - \overline{xw}_i $$
Where $\overline{xw}i$ can also be expressed as $\frac{\sum{t=1}^{T_i}{x_{it}w_{it}}}{T_i}$, the sum of all products for each $i$ divided by the number of time points for each $i$, $T_i$.
Giesselmann and Schmidt-Catran (2020) show that this conventional method for generating $xw_{it}$ does not have the unbiasedness that the individual terms do. I'll leave it to them to explain why exactly this is, but the solution is to start with the first, wrong version of $xw_{it}$, which I'll call $xw_{it}^$, and subtract its* mean too:
$$ xw_{it}^ = (x_{it} - \bar{x_{i}}) \times (w_{it} - \bar{w_i}) \ xw_{it} = xw_{it}^ - \overline{xw_i^*} $$
I call this the "double-demeaning" approach to interactions, in contrast to
the one-time demeaning in the conventional approach. By default, wbm()
calculates interactions via the double-demeaning method. You can change this
via the interaction.style
argument if you need your results to match other
software.
The workhorse function for within-between models is wbm()
, which is built on
top of lme4
's lmerMod()
and glmerMod()
. It is not so hard to understand
how to treat your data to estimate within-between models, but the programming
can be a challenge to those who aren't skilled with R (or whatever else they
might use) and is error-prone in any case.
The main thing to know in order to use wbm()
is how the model formula
works, because it's a little different from your typical regression model.
It is split into up to 3 parts, each for a different kind of variable. Each
part is separated by a |
. The pattern is like this:
dependent ~ time_varying | time_invariant | cross_lev_interactions + (random_slopes | id)
So you start with your dependent variable on the left-hand side like normal and
then what comes next are variables that vary over time. You will only get
within-entity estimates for these variables. Next are time-invariant variables;
the between-entity terms for the time-varying variables are added automatically
so no need to try to include them here. Finally, in the third part you can
specify cross-level interactions (i.e., within-entity by
between-entity/time-invariant) as well as additional random effects terms
using the lme4
-style syntax. By default, (1 | id)
(or whatever the ID
variable is) is added internally for a random intercept so you do not need to
include it yourself.
Let's walk through an example with the wages
data we looked at briefly
earlier. We'll predict the logarithm of wages (lwage
) using weeks worked
(wks
), union membership (union
), marital status (ms
),
blue (vs. white) collar job status (occ
), black race (blk
), and
female sex (fem
).
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem, data = wages) summary(model)
As you can see, the output distinguishes within- and between-entity effects.
When you see imean()
around a variable, that is the between-entity effect
represented as the individual mean.
Here, we see there seems to be a wage penalty for switching from white collar
to blue collar work (occ
) and although married people earn more (imean(ms)
),
just becoming married (ms
) coincides with a drop in earnings. We also see a
boost in earnings from joining a union (union
).
Maybe we think the timing of the marriage effect is off and the true effect
occurs the time period after a person becomes married. We can ask for the
lagged effect using lag()
.
model <- wbm(lwage ~ wks + union + lag(ms) + occ | blk + fem, data = wages) summary(model)
Well that doesn't change the direction of the estimate, but it also moved it sufficiently close to 0 that we can't say much about it one way or another.
Keep in mind that you do not have to stick to linear models. Using the family
argument (just like glm()
), you can estimate logit (family = binomial
),
probit (family = binomal(link = "probit")
), poisson (family = poisson
), or
other model families and links as needed.
Now maybe we want to include an effect of time since wages tend to go up for
everyone, on average, over time. We can just include the time variable in the
formula or set use.wave
to TRUE
.
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem, data = wages, use.wave = TRUE) summary(model)
Including t
wipes out some of those previously observed effects. Believe it
or not, we just fit a growth curve model!
Now, we might think people have different trajectories. We can include that as a random slope, which will go in the third part of the formula.
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem | (t | id), use.wave = TRUE, data = wages) summary(model)
And now we have a latent growth curve model. The general effect on the other
coefficients is more uncertainty and attenuated estimates. It's worth
keeping in mind that it is sometimes wrong to use a growth curve model like this
if you think the variables in your model cause the time trend; if you think
wages are going up because more people are moving into white collar work, then
including the growth curve will make it harder for you to see the true effect
of occ
.
By default, wbm()
does as the name suggests. But if you'd rather have the
contextual model described earlier, in which the means are not subtracted from
the time varying variables, that's an option too.
model <- wbm(lwage ~ wks + union + ms + occ | blk + fem, data = wages, model = "contextual") summary(model)
Now the individual means have a new interpretation as the difference in effect compared to the within-entity estimates.
If you don't want to use any of the time-invariant variables, you can also just ask for the "within" estimator:
model <- wbm(lwage ~ wks + union + ms + occ, data = wages, model = "within") summary(model)
This can help declutter your output when you really just don't care about the between-subjects effects.
You don't have to estimate these models using multilevel models and in fact you may get better inferences by avoiding some of the assumptions inherent to multilevel modeling (see McNeish, 2019). You can use the semiparametric generalized estimating equations (GEE) approach to estimation, with the main tradeoff being that you can no longer use random slopes or anything like that. But if you only care about the average effects across all entities, GEE can be a better approach that doesn't require you to be right about the distribution of effects and several other assumptions.
wbgee()
builds on geeglm()
from the geepack
package and works just like
wbm()
.
model <- wbgee(lwage ~ wks + union + ms + occ | blk + fem, data = wages) summary(model)
This gives us more conservative estimates, in general. Note that by default,
wbgee()
uses an AR-1 working error correlation structure in estimation.
This makes sense in general but at times it may make sense to use
"exchangeable" as the argument to cor.str
which assumes all within-entity
correlations are equal regardless of time lag. Other options include
"unstructured", which can be very computationally intensive, and "independence,"
assuming no correlation within entities.
Like wbm()
, you can do generalized linear models via the family
argument.
It is for these generalized linear models that GEEs are likely to stand out the
most in terms of added benefit above and beyond the multilevel models, although
this is not a well-tested question to my knowledge.
Sometimes, theory may suggest that increases in a variable have a different effect than decreases in a variable. For instance, getting married and getting divorced are probably not equivalent (in the sense that one is the exact opposite of the other) in their effects on other outcomes. Allison (2019) described a method for estimating models with asymmetric effects based on first differences.
First, you take first differences:
$$ y_{it} - y_{it-1} = (\mu_t - \mu_{t-1}) + \beta(x_{it} - x_{it -1}) + (\epsilon_{it} - \epsilon_{it-1}) $$
We need a slightly different model for asymmetric effects in which we decompose the differences into positive and negative variables.
Our asymmetric effects model will be:
$$ y_{it} - y_{it-1} = (\mu_t - \mu_{t-1}) + \beta^+x_{it}^+ + \beta^-x_{it}^- + (\epsilon_{it} - \epsilon_{it-1}) $$
Where
$$ x_{it}^+ = x_{it} - x_{it -1} \text{ if } (x_{it} - x_{it -1}) > 0, \text{otherwise } 0 \ x_{it}^- = -(x_{it} - x_{it -1}) \text{ if } (x_{it} - x_{it -1}) < 0, \text{otherwise } 0 $$
In other words, if the difference is positive, it becomes part of the $x_{it}^+$ and if it is negative, it is multiplied by -1 to be made positive and is made part of the $x_{it}^-$ variable. If the effects are symmetric, $\beta^+ = -\beta^-$.
After fitting the model via GLS, we can then do a test of the contrasts of the $\beta^+$ and $\beta^-$ coefficients as a formal way to assess the presence of asymmetric effects.
Here's how it works with the panelr
function, asym()
.
model <- asym(lwage ~ ms + occ + union + wks, data = wages) summary(model)
As you can see, in a model comparable to our within-between model from earlier, the effects seem quite symmetric.
Let's look at the teen
data from earlier, where spouse
indicates whether
the respondent is living with a spouse, inschool
indicates whether the
respondent is enrolled in school, and hours
is the hours worked in the week
of the survey.
summary(asym(hours ~ spouse + inschool, data = teen))
Here we see an asymmetric effect of marriage: gaining a spouse corresponds with fewer hours worked, but there's no effect on work hours when a spouse is lost. You can see in the lower table that this difference in coefficients is associated with a fairly low p value. There is only weak evidence of an asymmetric effect for entering/leaving school.
The downside to the first differences method is that it does not generalize to non-continuous dependent variables — you can't run a logit model with a differenced binary outcome. Allison (2019) showed that you can do a modified form for such situations.
Instead of including the $x_{it}^+$ and $x_{it}^-$ as predictors, you instead create new variables $z_{it}^+$ and $z_{it}^-$ that are the cumulative sum of all differences prior to time $t$.
$$ z_{it}^+ = \sum_{s = 1}^{t}{x_{is}^+} \ z_{it}^- = \sum_{s = 1}^{t}{x_{is}^-} \ $$
Note that at $t = 1$, both are set to 0. I'll leave the details as to why this works to the manuscript, but he shows that we're left with the following equation:
$$ y_{it} = \mu_t + \beta^+ z_{it}^+ + \beta^-z_{it}^- + \alpha_i + \epsilon_{it} $$
So we can treat this like a fixed effects model in which we just need to address the $\alpha_i$. For situations like this that call for a conditional logit, as Allison used in his paper, another option is the GEE with logit link.
Let's try with the teen
data, which also appears in Allison (2019). Here our
outcome variable is pov
, poverty, and there's a new predictor, mother
, an
indicator for whether the respondent has ever had any children.
model <- asym_gee(pov ~ mother + spouse + inschool + hours, data = teen, family = binomial(link = "logit"), use.wave = TRUE, wave.factor = TRUE) summary(model)
The results are broadly similar in terms of coefficient estimates to those
obtained by Allison. Unlike Allison, we do not have good evidence of an
asymmetric effect in the case of spouse
but we do have one in the case of
hours
. Note that mother
never goes down so the negative version of this
variable is dropped from the model with a message. To match Allison, I also
used use.wave
to include the wave variable and wave.factor
to make it
a factor variable.
Allison, P. D. (2009). Fixed effects regression models. Thousand Oaks, CA: SAGE Publications. https://doi.org/10.4135/9781412993869.d33
Allison, P. D. (2019). Asymmetric fixed-effects models for panel data. Socius, 5, 1–12. https://doi.org/10.1177/2378023119826441
Bell, A., & Jones, K. (2015). Explaining fixed effects: Random effects modeling of time-series cross-sectional and panel data. Political Science Research and Methods, 3, 133–153. https://doi.org/10.1017/psrm.2014.7
Giesselmann, M., & Schmidt-Catran, A. W. (2020). Interactions in fixed effects regression models. Sociological Methods & Research, 1–28. https://doi.org/10.1177/0049124120914934
McNeish, D. (2019). Effect partitioning in cross-sectionally clustered data without multilevel models. Multivariate Behavioral Research, Advance online publication. https://doi.org/10.1080/00273171.2019.1602504
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.