mutate_cascade: Perform mutate one time period at a time ('Cascading mutate')

Description Usage Arguments Details Examples

View source: R/major_mutate_variations.R

Description

This function is a wrapper for dplyr::mutate() which performs mutate one time period at a time, allowing each period's calculation to complete before moving on to the next. This allows changes in one period to 'cascade down' to later periods. This is (number of time periods) slower than regular mutate() and, generally, is only used for mutations where an existing variable is being defined in terms of its own lag() or tlag(). This is similar in concept to (and also slower than) cumsum but is much more flexible, and works with data that has multiple observations per individual-period using tlag(). For example, this could be used to calculate the current value of a savings account given a variable with each period's deposits, withdrawals, and interest, or could calculate the cumulative number of credits a student has taken across all classes.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
mutate_cascade(
  .df,
  ...,
  .skip = TRUE,
  .backwards = FALSE,
  .group_i = TRUE,
  .i = NULL,
  .t = NULL,
  .d = NA,
  .uniqcheck = FALSE,
  .setpanel = TRUE
)

Arguments

.df

Data frame or tibble.

...

Specification to be passed to mutate().

.skip

Set to TRUE to skip the first period present in the data (or present within each group for grouped data) when applying mutate(). Since most uses of mutate_cascade() will involve a lag() or tlag(), this avoids creating an NA in the first period that then cascades down. By default this is TRUE. If you set this to FALSE you should probably have some method for avoiding a first-period NA in your ... entry, perhaps using the default option in dplyr::lag or the .default option in tlag.

.backwards

Set to TRUE to run mutate_cascade() from the last period to the first, rather than from the first to the last.

.group_i

By default, if .i is specified or found in the data, mutate_cascade will group the data by .i, ignoring any grouping already implemented (although the original grouping structure will be returned at the end). Set .group_i = FALSE to avoid this.

.i

Quoted or unquoted variables that identify the individual cases. Note that setting any one of .i, .t, or .d will override all three already applied to the data, and will return data that is as_pibble()d with all three, unless .setpanel=FALSE.

.t

Quoted or unquoted variables indicating the time. pmdplyr accepts two kinds of time variables: numeric variables where a fixed distance .d will take you from one observation to the next, or, if .d=0, any standard variable type with an order. Consider using the time_variable() function to create the necessary variable if your data uses a Date variable for time.

.d

Number indicating the gap in .t between one period and the next. For example, if .t indicates a single day but data is collected once a week, you might set .d=7. To ignore gap length and assume that "one period ago" is always the most recent prior observation in the data, set .d=0. The default .d = NA here will become .d = 1 if either .i or .t are declared.

.uniqcheck

Logical parameter. Set to TRUE to always check whether .i and .t uniquely identify observations in the data. By default this is set to FALSE and the check is only performed once per session, and only if at least one of .i, .t, or .d is set.

.setpanel

Logical parameter. TRUE by default, and so if .i, .t, and/or .d are declared, will return a pibble set in that way.

Details

To apply mutate_cascade() to non-panel data and without any grouping (perhaps to mimic standard Stata replace functionality), add a variable to your data indicating the order you'd like mutate performed in (perhaps using dplyr::row_number()) and .t to that new variable.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
if(interactive()){
data(Scorecard)
# I'd like to build a decaying function that remembers previous earnings but at a declining rate
# Let's only use nonmissing earnings
# And let's say we're only interested in four-year colleges in Colorado
# (mutate_cascade + tlag can be very slow so we're working with a smaller sample)
Scorecard <- Scorecard %>%
  dplyr::filter(
    !is.na(earnings_med),
    pred_degree_awarded_ipeds == 3,
    state_abbr == "CO"
  ) %>%
  # And declare the panel structure
  as_pibble(.i = unitid, .t = year)
Scorecard <- Scorecard %>%
  # Almost all instances involve a variable being set to a function of a lag of itself
  # we don't want to overwrite so let's make another
  # Note that earnings_med is an integer -
  # but we're about to make non-integer decay function, so call it a double!
  dplyr::mutate(decay_earnings = as.double(earnings_med)) %>%
  # Now we can cascade
  mutate_cascade(
    decay_earnings = decay_earnings +
      .5 * tlag(decay_earnings, .quick = TRUE)
  )
  }

pmdplyr documentation built on July 2, 2020, 4:08 a.m.