epi_slide: Slide a function over variables in an 'epi_df' object

View source: R/slide.R

epi_slideR Documentation

Slide a function over variables in an epi_df object

Description

Slides a given function over variables in an epi_df object. This is useful for computations like rolling averages. The function supports many ways to specify the computation, but by far the most common use case is as follows:

# Create new column `cases_7dm` that contains a 7-day trailing median of cases
epi_slide(edf, cases_7dav = median(cases), .window_size = 7)

For two very common use cases, we provide optimized functions that are much faster than epi_slide: epi_slide_mean() and epi_slide_sum(). We recommend using these functions when possible.

See vignette("epi_df") for more examples.

Usage

epi_slide(
  .x,
  .f,
  ...,
  .window_size = NULL,
  .align = c("right", "center", "left"),
  .ref_time_values = NULL,
  .new_col_name = NULL,
  .all_rows = FALSE
)

Arguments

.x

An epi_df object. If ungrouped, we group by geo_value and any columns in other_keys. If grouped, we make sure the grouping is by geo_value and other_keys.

.f

Function, formula, or missing; together with ... specifies the computation to slide. The return of the computation should either be a scalar or a 1-row data frame. Data frame returns will be tidyr::unpack()-ed, if named, and will be tidyr::pack-ed columns, if not named. See examples.

  • If .f is missing, then ... will specify the computation via tidy-evaluation. This is usually the most convenient way to use epi_slide. See examples.

  • If .f is a formula, then the formula should use .x (not the same as the input epi_df) to operate on the columns of the input epi_df, e.g. ~mean(.x$var) to compute a mean of var.

  • If a function, .f must have the form ⁠function(x, g, t, ...)⁠, where:

    • x is a data frame with the same column names as the original object, minus any grouping variables, with only the windowed data for one group-.ref_time_value combination

    • g is a one-row tibble containing the values of the grouping variables for the associated group

    • t is the .ref_time_value for the current window

    • ... are additional arguments

...

Additional arguments to pass to the function or formula specified via .f. Alternatively, if .f is missing, then the ... is interpreted as a "data-masking" expression or expressions for tidy evaluation.

.window_size

The size of the sliding window. The accepted values depend on the type of the time_value column in .x:

  • if time type is Date and the cadence is daily, then .window_size can be an integer (which will be interpreted in units of days) or a difftime with units "days"

  • if time type is Date and the cadence is weekly, then .window_size must be a difftime with units "weeks"

  • if time type is a yearmonth or an integer, then .window_size must be an integer

.align

The alignment of the sliding window.

  • If "right" (default), then the window has its end at the reference time. This is likely the most common use case, e.g. .window_size=7 and .align="right" slides over the past week of data.

  • If "left", then the window has its start at the reference time.

  • If "center", then the window is centered at the reference time. If the window size is odd, then the window will have floor(window_size/2) points before and after the reference time; if the window size is even, then the window will be asymmetric and have one more value before the reference time than after.

.ref_time_values

The time values at which to compute the slides values. By default, this is all the unique time values in .x.

.new_col_name

Name for the new column that will contain the computed values. The default is "slide_value" unless your slide computations output data frames, in which case they will be unpacked (as in tidyr::unpack()) into the constituent columns and those names used. New columns should not be given names that clash with the existing columns of .x.

.all_rows

If .all_rows = FALSE, the default, then the output epi_df will have only the rows that had a time_value in .ref_time_values. Otherwise, all the rows from .x are included by with a missing value marker (typically NA, but more technically the result of vctrs::vec_cast-ing NA to the type of the slide computation output).

Details

Advanced uses of .f via tidy evaluation

If specifying .f via tidy evaluation, in addition to the standard .data and .env, we make some additional "pronoun"-like bindings available:

  • .x, which is like .x in dplyr::group_modify; an ordinary object like an epi_df rather than an rlang pronoun like .data; this allows you to use additional dplyr, tidyr, and epiprocess operations. If you have multiple expressions in ..., this won't let you refer to the output of the earlier expressions, but .data will.

  • .group_key, which is like .y in dplyr::group_modify.

  • .ref_time_value, which is the element of .ref_time_values that determined the time window for the current computation.

Value

An epi_df object with one or more new slide computation columns added.

See Also

epi_slide_opt for optimized slide functions

Examples

# Get the 7-day trailing standard deviation of cases and the 7-day trailing mean of cases
cases_deaths_subset %>%
  epi_slide(
    cases_7sd = sd(cases, na.rm = TRUE),
    cases_7dav = mean(cases, na.rm = TRUE),
    .window_size = 7
  ) %>%
  dplyr::select(geo_value, time_value, cases, cases_7sd, cases_7dav)

# The same as above, but unpacking using an unnamed data.frame with a formula
cases_deaths_subset %>%
  epi_slide(
    ~ data.frame(
      cases_7sd = sd(.x$cases, na.rm = TRUE),
      cases_7dav = mean(.x$cases, na.rm = TRUE)
    ),
    .window_size = 7
  ) %>%
  dplyr::select(geo_value, time_value, cases, cases_7sd, cases_7dav)

# The same as above, but packing using a named data.frame with a tidy evaluation
# expression
cases_deaths_subset %>%
  epi_slide(
    slide_packed = data.frame(
      cases_7sd = sd(.x$cases, na.rm = TRUE),
      cases_7dav = mean(.x$cases, na.rm = TRUE)
    ),
    .window_size = 7
  ) %>%
  dplyr::select(geo_value, time_value, cases, slide_packed)

# nested new columns
cases_deaths_subset %>%
  group_by(geo_value) %>%
  epi_slide(
    function(x, g, t) {
      data.frame(
        cases_7sd = sd(x$cases, na.rm = TRUE),
        cases_7dav = mean(x$cases, na.rm = TRUE)
      )
    },
    .window_size = 7
  ) %>%
  dplyr::select(geo_value, time_value, cases, cases_7sd, cases_7dav)

# Use the geo_value or the ref_time_value in the slide computation
cases_deaths_subset %>%
  epi_slide(~ .x$geo_value[[1]], .window_size = 7)

cases_deaths_subset %>%
  epi_slide(~ .x$time_value[[1]], .window_size = 7)

cmu-delphi/epiprocess documentation built on Oct. 29, 2024, 5:37 p.m.