tlag: Time-lag a variable

Description Usage Arguments Examples

View source: R/tlag.R

Description

This function retrieves the time-lagged values of a variable, using the time variable defined in .t in the function or by as_pibble(). tlag() is highly unusual among time-lag functions in that it is usable even if observations are not uniquely identified by .t (and .i, if defined).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
tlag(
  .var,
  .df = get(".", envir = parent.frame()),
  .n = 1,
  .default = NA,
  .quick = FALSE,
  .resolve = "error",
  .group_i = TRUE,
  .i = NULL,
  .t = NULL,
  .d = NA,
  .uniqcheck = FALSE
)

Arguments

.var

Unquoted variable from .df to be lagged.

.df

Data frame, pibble, or tibble (usually the object that contains .var) that contains the panel structure variables either listed in .i and .t, or earlier declared with as_pibble(). If tlag is called inside of a dplyr verb, this can be omitted and the data will be picked up automatically.

.n

Number of periods to lag by. 1 by default. Note that this is automatically scaled by .d. If .d = 2 and .n = 1, then the lag of .t = 3 will be .t = 1. Allows negative values, equivalent to tlead() with the same value but positive. Note that .n is ignored if .d = 0.

.default

Fill-in value used when lagged observation is not present. Defaults to NA.

.quick

If .i and .t uniquely identify observations in your data, **and** there either .d = 0 or there are no time gaps for any individuals (perhaps use panel_fill() first), set .quick = TRUE to improve speed. tlag() will not check if either of these things are true (except unique identification, which will be checked if .uniqcheck = 1 or if .i or .t are specified in-function), so make sure they are or you will get strange results.

.resolve

If there is more than one observation per individal/period, and the value of .var is identical for all of them, that's no problem. But what should tlag() do if they're not identical? Set .resolve = 'error' (or, really, any string) to throw an error in this circumstance. Or, set .resolve to a function (ideally, a vectorized one) that can be used within dplyr::summarize() to select a single value per individual/period. For example, .resolve = mean to get the mean value of all observations present for that individual/period.

.group_i

By default, if .i is specified or found in the data, tlag() will group the data by .i, ignoring any grouping already implemented. Set .group_i = FALSE to avoid this.

.i

Quoted or unquotes variable(s) that identify the individual cases. Note that setting any one of .i, .t, or .d will override all three already applied to the data, and will return data that is as_pibble()d with all three, unless .setpanel=FALSE.

.t

Quoted or unquoted variable indicating the time. pmdplyr accepts two kinds of time variables: numeric variables where a fixed distance .d will take you from one observation to the next, or, if .d=0, any standard variable type with an order. Consider using the time_variable() function to create the necessary variable if your data uses a Date variable for time.

.d

Number indicating the gap in .t between one period and the next. For example, if .t indicates a single day but data is collected once a week, you might set .d=7. To ignore gap length and assume that "one period ago" is always the most recent prior observation in the data, set .d = 0. The default .d = NA here will become .d = 1 if either .i or .t are declared.

.uniqcheck

Logical parameter. Set to TRUE to always check whether .i and .t uniquely identify observations in the data. By default this is set to FALSE and the check is only performed once per session, and only if at least one of .i, .t, or .d is set.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
data(Scorecard)

# The Scorecard data is uniquely identified by unitid and year.
# However, there are sometimes gaps between years.
# In cases like this, using dplyr::lag() will still use the row before,
# whereas tlag() will respect the gap and give a NA, much like plm::lag()
# (although tlag is slower than either, sorry)
Scorecard <- Scorecard %>%
  dplyr::mutate(pmdplyr_tlag = tlag(earnings_med,
    .i = unitid,
    .t = year
  ))
Scorecard <- Scorecard %>%
  dplyr::arrange(year) %>%
  dplyr::group_by(unitid) %>%
  dplyr::mutate(dplyr_lag = dplyr::lag(earnings_med)) %>%
  dplyr::ungroup()

# more NAs in the pmdplyr version - observations with a gap and thus no real lag present in data
sum(is.na(Scorecard$pmdplyr_tlag))
sum(is.na(Scorecard$dplyr_lag))

# If we want to ignore gaps, or have .d = 0, and .i and .t uniquely identify observations,
# we can use the .quick option to match dplyr::lag()
Scorecard <- Scorecard %>%
  dplyr::mutate(pmdplyr_quick_tlag = tlag(earnings_med,
    .i = unitid,
    .t = year,
    .d = 0,
    .quick = TRUE
  ))
sum(Scorecard$dplyr_lag != Scorecard$pmdplyr_quick_tlag, na.rm = TRUE)

# Where tlag shines is when you have multiple observations per .i/.t
# If the value of .var is constant within .i/.t, it will work just as you expect.
# If it's not, it will throw an error, or you can set
# .resolve to tell tlag how to select a single value from the many
# Maybe we want to get the lagged average earnings within degree award type
Scorecard <- Scorecard %>%
  dplyr::mutate(
    last_year_earnings_by_category =
      tlag(earnings_med,
        .i = pred_degree_awarded_ipeds, .t = year,
        .resolve = function(x) mean(x, na.rm = TRUE)
      )
  )
# Or maybe I want the lagged earnings across all types - .i isn't necessary!
Scorecard <- Scorecard %>%
  dplyr::mutate(last_year_earnings_all = tlag(earnings_med,
    .t = "year",
    .resolve = function(x) mean(x, na.rm = TRUE)
  ))
# Curious why the first nonmissing obs show up in 2012?
# It's because there's no 2008 or 2010 in the data, so when 2009 or 2011 look back
# a year, they find nothing!
# We could get around this by setting .d = 0 to ignore gap length
# Note this can be a little slow.
Scorecard <- Scorecard %>%
  dplyr::mutate(last_year_earnings_all = tlag(earnings_med,
    .t = year, .d = 0,
    .resolve = function(x) mean(x, na.rm = TRUE)
  ))

pmdplyr documentation built on July 2, 2020, 4:08 a.m.