knitr::opts_chunk$set( collapse = FALSE, comment = "", fig.path = "README-", message = FALSE ) options("panelr.table.format" = "multiline")
This is an R package designed to aid in the analysis of panel data,
designs in which the same group of respondents/entities are contacted/measured
multiple times. panelr
provides some useful infrastructure, like a
panel_data
object class, as well as automating some emerging methods for
analyses of these data.
wbm()
automates the "within-between" (also known as
"between-within" and "hybrid") specification that combines the
desirable aspects of both fixed effects and random effects econometric models
and fits them using the lme4
package in the backend. Bayesian estimation of
these models is supported by interfacing with the brms
package (wbm_stan()
)
and GEE estimation via geepack
(wbgee()
).
It also automates the fairly new "asymmetric effects" specification described
by Allison (2019)
and supports estimation via GLS for linear asymmetric effects
models (asym()
) and via GEE for non-Gaussian models (asym_gee()
).
panelr
is now available via CRAN.
install.packages("panelr")
panel_data
framesWhile not strictly required, the best way to start is to declare your data
as panel data. I'll load the example data WageData
to demonstrate.
library(panelr) data("WageData") colnames(WageData)
The two key variables here are t
and id
. t
is the wave of the survey the
row of the data refers to while id
is the survey respondent. This is a
perfectly balanced data set, so there are 7 observations for each of the 595
respondents. We will use those two pieces of information to create a
panel_data
object.
wages <- panel_data(WageData, id = id, wave = t) wages
We have to tell panel_data()
which column refers to the unique identifiers
for respondents/entities (the latter when you have something like countries
or companies instead of people) and which column refers to the period/wave of
data collection.
Note that the resulting panel_data
object will remember which of the columns
is the ID column and which is the wave column. It will also fight you a bit
when you do things that might have the side effect of dropping those columns
or putting them out of time order.
panel_data
frames are modified tibbles
(tibble
package) that are grouped by entity
(i.e., the ID column).
panel_data
frames are meant to play nice with the
tidyverse
. Here's a quick sample of how a tidy workflow
with panelr
can work:
library(dplyr) data("WageData") # Create `panel_data` object wages <- panel_data(WageData, id = id, wave = t) %>% # Pass to mutate, which will calculate statistics groupwise when appropriate mutate( wage = exp(lwage), # reverse transform the log wage variable mean_wage_individual = mean(wage), # means calculated separately by entity lag_wage = lag(wage) # mutate() will calculate lagged values correctly ) %>% # Use `panelr`'s complete_data() to filter for entities that have # enough observations complete_data(wage, union, min.waves = 5) %>% # drop if there aren't 5 completions # You can use unpanel() if you need to do rowwise or columnwise operations unpanel() %>% mutate( mean_wage_grand = mean(wage) ) %>% # You'll need to convert back to panel_data if you want to keep using panelr functions panel_data(id = id, wave = t)
wbm()
--- the within-between modelAnyone can fit a within-between model without the use of this package as it is just a particular specification of a multilevel model. With that said, it's something that will require some programming and could be rather prone to error. In the best case, it is cumbersome and inefficient to create the necessary variables.
wbm()
is the primary model-fitting function that you'll use from this package
and it fits within-between models for you, utilizing
lme4
as a
backend for estimation.
A three-part model syntax is used that goes like this:
dv ~ varying_variables | invariant_variables | cross_level_interactions/random effects
It works like a typical formula otherwise. The bars just tell panelr
how to
treat the variables. Note also that you can specify random slopes using
lme4
-style syntax in the third part of the formula as well. A random intercept
for the ID variable is included by default and doesn't need to be specified
in the formula.
Lagged variables are supported as well through the lag()
function. Unlike base
R, panelr
lags the variables correctly --- wave 1 observations will have NA
values for the lagged variable rather than taking the final wave value of the
previous entity.
Here we will specify a model using the wages
data. We will predict
logged wages (lwage
) using two time-varying variables --- lagged
union membership (union
) and contemporaneous weeks worked (wks
) --- along
with a time-invariant predictor, a binary indicator for black race (blk
).
For demonstrative purposes, we'll fit a random slope for lag(union)
and a
cross-level interaction between blk
and wks
.
model <- wbm(lwage ~ lag(union) + wks | blk | blk * wks + (lag(union) | id), data = wages) summary(model)
Note that imean()
is an internal function that calculates the individual-level
mean, which represents the between-subjects effects of the time-varying
predictors. The within effects are the time-varying predictors at the occasion
level with the individual-level mean subtracted. If you want the model specified
such that the occasion level predictors do not have the mean subtracted, use
the model = "contextual"
argument. The "contextual" label refers to the way
these terms are normally interpreted when it is specified that way.
You may also use model = "between"
to fit what econometricians call the
random effects model, which does not disaggregate the within- and between-entity
variation.
widen_panel()
and long_panel()
Two functions that should cover your bases for the tricky business of
reshaping panel data are included. Sometimes, like for doing SEM-based
analyses, you need your data in wide format --- i.e., one row per entity.
widen_panel()
makes that easy and should require minimal trial and error or
thinking.
Perhaps more often, your raw data are already in wide format and you need
to get it into long format to do cool stuff like use wbm()
. That can be very
tricky, but long_panel()
(I didn't think lengthen_panel()
or longen_panel()
quite worked as names) should cover most situations. You tell it what the
labels for periods are (e.g., does it range from 1
to 5
, "A"
to "E"
,
or something else?), where they are located (before or after the variable's
name?), and what kinds of formatting go before/after it. Check out the
vignette for more details and some worked examples.
I'm happy to receive bug reports, suggestions, questions, and (most of all) contributions to fix problems and add features. I prefer you use the Github issues system over trying to reach out to me in other ways. Pull requests for contributions are encouraged.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
The source code of this package is licensed under the MIT License.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.