pmdplyr: Panel Maneuvers in dplyr"

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE,
  message = FALSE,
  comment = "#>"
)
library(pmdplyr)

The pmdplyr package is an extension to dplyr designed for cleaning and managing panel and hierarchical data. It contains variations on the dplyr::mutate() and dplyr::join() functions that address common panel data needs, and contains functions for managing and cleaning panel data.

Unlike other panel data packages, functions in pmdplyr are all designed to work even if there is more than one observation per individual per period. This comes in handy if each individual is observed multiple times per period - for example, multiple classes per student per term; or if you have hierarchical data - for example, multiple companies per country.

There are three vignettes in total describing the contents of pmdplyr:

  1. CURRENT VIGNETTE "pmdplyr"/"Get Started", which describes the pibble panel data object type, and the pmdplyr tools for creating well-behaved ID and time variables id_variable() and time_variable().

  2. "dplyr variants", which that describes pmdplyr variations on dplyr functions mutate() (mutate_cascade() and mutate_subset()), _join (inexact_join and safe_join()), and lag (tlag()).

  3. "Panel Tools", which describes novel tools that pmdplyr provides for cleaning and manipulating panel data (panel_fill(), panel_locf(), fixed_check(), fixed_force(), between_i(), within_i(), mode_order()).


pibble

The pibble data type is a tibble data frame with additional attributes .i, which is a set of variables in the pibble that identifies individuals, .t, which is a variable in the pibble that indentifies the time index, and .d, which identifies the gap between periods of .t. If the gap between time periods doesn't matter and any consecutive time periods should be treated as consecutive, set .d = 0.

pibble status will be maintained if the pibble is modified using functions from pmdplyr or dplyr (unless you use those functions in a way that drops or renames the .i or .t variables, in which case pibble status will be lost). Other data manipulation functions may remove pibble status, requiring it to be re-declared as a pibble.

Most functions in pmdplyr will allow you to declare .i and .t in the function itself. But if a pibble is passed in via a dplyr verb, there is no need.

pibble() and as_pibble()

pibbles can be declared in two main ways: raw, via pibble():

pibble(...,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE
)

or by transforming an existing data.frame, list, or tbl_df using as_pibble():

as_pibble(x,
  .i = NULL,
  .t = NULL,
  .d = 1,
  .uniqcheck = FALSE,
  ...
)

Both functions work exactly as tibble::tibble() and tibble::as_tibble() do, except that they also take the arguments .i, .t, and .d, with .i and .t accepting either unquoted or quoted variable names. If you'd like your pibble checked to see if .i and .t uniquely identify your observations, set .uniqcheck = TRUE. It will do this automatically the first time in each R session you create a pibble, but if you'd like it to keep doing it, use uniqcheck.

As a side bonus, you can check if the variables a, b, c uniquely identify the observations in data set d by running as_pibble(d, .i = c(a, b, c), .uniqcheck = TRUE). No warning? It's uniquely identified!

# .d = 1 by default, so in this data,
# a = 1, b = 3 comes one period after a = 1, b = 2.
basic_pibble <- pibble(
  a = c(1, 1, 1, 2, 2, 2),
  b = c(1, 2, 3, 2, 3, 3),
  c = 1:6,
  .i = a,
  .t = b
)

data(SPrail)
# In SPrail, insert_date does not imply regular gaps between
# time periods, so we set .d = 0
declared_pibble <- as_pibble(SPrail,
  .i = c(origin, destination),
  .t = insert_date,
  .d = 0
)

panel_convert()

pmdplyr also has the function panel_convert() which allows you to convert between different popular R panel data objects, including pibble. This can come in handy for creating pibbles, or exporting your cleaned pibble to use with a package that does panel data analysis (which pmdplyr does not):

panel_convert(
  data,
  to,
  ...
)

Where data is a panel data object, either pibble, tsibble, pdata.frame, or panel_data, and to is the type of object you'd like returned, which you can refer to by object name, object class, or package name: get a pibble with "pmdplyr", "pibble", or "tbl_pb", a tsibble with "tsibble" or "tbl_ts", a pdata.frame with "plm" or "pdata.frame", or a panel_data with "panelr" or "panel_data". ... sends additional arguments to the functions used to declare those objects.

When using panel_convert, be aware that any grouping will be lost, and you must have the relevant package of your to option installed (tsibble, plm, or panelr). When your data object is a pdata.frame, it is recommended to also have sjlabelled installed.

All valid objects of the non-pibble types can be converted to pibbles, but the reverse is not true, since pibble does not enforce some strict requirements that other types do:

Feature/Requirement | pibble | tsibble| pdata.frame | panel_data -----------------------|-----------|-----------|----------------|------------ ID | .i | key | index[1] | id Time | .t | index | index[2] | wave Gap control | .d | regular | No | No
ID must exist | No | No | Yes | Yes Time must exist | No | Yes | Yes | Yes[1] Only one ID variable[2]| No | No | Yes | Yes Unique identification | No | Yes | No[3] | No[3]

[1] pdata.frame does not require that time be provided, but if not provided will create it based on original ordering of the data. The pdata.frame option to set index equal to an integer for a balanced panel and have it figure out the rest by itself is not supported.

[2] Use id_variable() (described below) to generate a single ID variable from multiple if one is required.

[3] pdata.frame and panel_data do not require that ID and time uniquely identify the observations on declaring the data, but functions in these packages may not work correctly without unique identification.

In addition to the above, be aware that the different packages have different requirements on which variable classes can be Time variables. time_variable() (described below) can build an integer variable that will work in all packages.


ID and Time Variables

Any panel data dataset is defined by ID variable(s) (.i) that indicate the individual person/firm/country/etc. you're talking about, and a time variable (.t) that tell you when the observation in question was recorded.

While pmdplyr allows multiple ID variables, many panel data packages only allow one. id_variable() allows you to turn multiple ID variables into a single variable.

Many panel data packages, including pmdplyr, prefer a time variable that is an integer, so that the difference between each period is 1 (or .d in pmdplyr's case). The function time_variable() can help you generate an integer time variable from a Date-class variable, or from one or more variables containing, for example, year and month.

id_variable()

id_variable() syntax follows:

id_variable(...,
  .method = "number",
  .minwidth = FALSE
)

where ... is the set of identity variables that you want to combine into a single one (or, potentially, a single variable you'd like to encode numerically).

.method describes the way in which you'd like the variable encoded:

df <- data.frame(
  country = c(
    "US", "US", "US", "US",
    "ENG", "ENG", "ENG", "ENG"
  ),
  city = c(
    "NYC", "NYC", "Cambridge", "NYC",
    "Cambridge", "London", "Manchester", "Manchester"
  )
) %>%
  mutate(
    numeric_ID = id_variable(country, city),
    random_ID = id_variable(country, city, .method = "random"),
    char_ID = id_variable(country, city, .method = "character")
  )

df

time_variable()

time_variable() syntax follows:

time_variable(...,
  .method = "present",
  .datepos = NA,
  .start = 1,
  .skip = NA,
  .breaks = NA,
  .turnover = NA,
  .turnover_start = NA
)

Where ... is the set of variables that you want to combine into a single, integer-class time variable. The rest of the options determine how the variable(s) will be read or transformed; the need for each varies depending on the structure of the original data and which .method is used.

.method can take the values:

data(SPrail)

# Since we have a date variable, we can easily create integers that increment for each
# year, or for each month, etc.
# Likely we'd only really need one of these four, depending on our purposes
SPrail <- SPrail %>%
  dplyr::mutate(
    year_time_id = time_variable(insert_date, .method = "year"),
    month_time_id = time_variable(insert_date, .method = "month"),
    week_time_id = time_variable(insert_date, .method = "week"),
    day_time_id = time_variable(insert_date, .method = "day")
  )

# Let's see what we've got
SPrail %>% 
  select(insert_date, ends_with("time_id")) %>% 
  head()

# Perhaps I'd like quarterly data
# (although in this case there are only two months, not much variation there)
SPrail <- SPrail %>%
  dplyr::mutate(quarter_time_id = time_variable(insert_date,
    .method = "month",
    .breaks = c(1, 4, 7, 10)
  ))
# Should line up properly with month
SPrail %>% 
  count(month_time_id, quarter_time_id)

# Maybe I'd like Monday to come immediately after Friday!
SPrail <- SPrail %>%
  dplyr::mutate(weekday_time_id = time_variable(insert_date,
    .method = "day",
    .skip = c(6, 7)
  ))

# Perhaps I'm interested in ANY time period in the data and just want to enumerate them in order
SPrail <- SPrail %>%
  dplyr::mutate(any_present_time_id = time_variable(insert_date,
    .method = "present"
  ))

# Note the weekday_time_id NAs - these are weekends! We told it to skip those.
head(SPrail %>% select(insert_date, day_time_id, weekday_time_id, any_present_time_id))

# Maybe instead of being given a nice time variable, I was given it in string form
SPrail <- SPrail %>% dplyr::mutate(time_string = as.character(insert_date))
# As long as the character positions are consistent we can still use it
SPrail <- SPrail %>%
  dplyr::mutate(day_from_string_id = time_variable(time_string,
    .method = "day",
    .datepos = c(3, 4, 6, 7, 9, 10)
  ))
# Results are identical from using the actual Date variable
cor(SPrail$day_time_id, SPrail$day_from_string_id)


Try the pmdplyr package in your browser

Any scripts or data that you put into this service are public.

pmdplyr documentation built on July 2, 2020, 4:08 a.m.