epix_slide: Slide a function over variables in an 'epi_archive' or...
In cmu-delphi/epiprocess: Tools for basic signal processing in epidemiology

epix_slide

R Documentation

Slide a function over variables in an `epi_archive` or `grouped_epi_archive`

Description

Slides a given function over variables in an epi_archive object. This behaves similarly to epi_slide(), with the key exception that it is version-aware: the sliding computation at any given reference time t is performed on data that would have been available as of t. This function is intended for use in accurate backtesting of models; see vignette("backtesting", package="epipredict") for a walkthrough.

Usage

epix_slide(
  .x,
  .f,
  ...,
  .before = Inf,
  .versions = NULL,
  .new_col_name = NULL,
  .all_versions = FALSE
)

## S3 method for class 'epi_archive'
epix_slide(
  .x,
  .f,
  ...,
  .before = Inf,
  .versions = NULL,
  .new_col_name = NULL,
  .all_versions = FALSE
)

## S3 method for class 'grouped_epi_archive'
epix_slide(
  .x,
  .f,
  ...,
  .before = Inf,
  .versions = NULL,
  .new_col_name = NULL,
  .all_versions = FALSE
)

Arguments

`.x`	An `epi_archive` or `grouped_epi_archive` object. If ungrouped, all data in `x` will be treated as part of a single data group.
`.f`	Function, formula, or missing; together with `...` specifies the computation to slide. To "slide" means to apply a computation over a sliding (a.k.a. "rolling") time window for each data group. The window is determined by the `.before` parameter (see details for more). If a function, `.f` must have the form `⁠function(x, g, t, ...)⁠`, where "x" is an epi_df with the same column names as the archive's `DT`, minus the `version` column "g" is a one-row tibble containing the values of the grouping variables for the associated group "t" is the ref_time_value for the current window "..." are additional arguments If a formula, `.f` can operate directly on columns accessed via `.x$var` or `.$var`, as in `~ mean (.x$var)` to compute a mean of a column `var` for each group-`ref_time_value` combination. The group key can be accessed via `.y` or `.group_key`, and the reference time value can be accessed via `.z` or `.ref_time_value`. If `.f` is missing, then `...` will specify the computation.
`...`	Additional arguments to pass to the function or formula specified via `f`. Alternatively, if `.f` is missing, then the `...` is interpreted as a "data-masking" expression or expressions for tidy evaluation; in addition to referring columns directly by name, the expressions have access to `.data` and `.env` pronouns as in `dplyr` verbs, and can also refer to `.x` (not the same as the input epi_archive), `.group_key`, and `.ref_time_value`. See details for more.
`.before`	How many time values before the `.ref_time_value` should each snapshot handed to the function `.f` contain? If provided, it should be a single value that is compatible with the time_type of the time_value column (more below), but most commonly an integer. This window endpoint is inclusive. For example, if `.before = 7`, `time_type` in the archive is "day", and the `.ref_time_value` is January 8, then the smallest time_value in the snapshot will be January 1. If missing, then the default is no limit on the time values, so the full snapshot is given.
`.versions`	Reference time values / versions for sliding computations; each element of this vector serves both as the anchor point for the `time_value` window for the computation and the `max_version` `epix_as_of` which we fetch data in this window. If missing, then this will set to a regularly-spaced sequence of values set to cover the range of `version`s in the `DT` plus the `versions_end`; the spacing of values will be guessed (using the GCD of the skips between values).
`.new_col_name`	Either `NULL` or a string indicating the name of the new column that will contain the derived values. The default, `NULL`, will use the name "slide_value" unless your slide computations output data frames, in which case they will be unpacked into the constituent columns and those names used. If the resulting column name(s) overlap with the column names used for labeling the computations, which are `group_vars(x)` and `"version"`, then the values for these columns must be identical to the labels we assign.
`.all_versions`	(Not the same as `.all_rows` parameter of `epi_slide`.) If `.all_versions = TRUE`, then the slide computation will be passed the version history (all `version <= .version` where `.version` is one of the requested `.versions`) for rows having a `time_value` of at least '.version before`⁠. Otherwise, the slide computation will be passed only the most recent ⁠`version`⁠for every unique⁠`time_value`⁠. Default is ⁠`FALSE'.

Details

A few key distinctions between the current function and epi_slide():

In .f functions for epix_slide, one should not assume that the input data to contain any rows with time_value matching the computation's .ref_time_value (accessible via ⁠attributes(<data>)$metadata$as_of⁠); for typical epidemiological surveillance data, observations pertaining to a particular time period (time_value) are first reported as_of some instant after that time period has ended.
The input class and columns are similar but different: epix_slide (with the default .all_versions=FALSE) keeps all columns and the epi_df-ness of the first argument to each computation; epi_slide only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the essential geo_value column. (With .all_versions=TRUE⁠, ⁠epix_slide⁠will will provide an⁠epi_archive⁠rather than an⁠epi-df' to each computation.)
The output class and columns are similar but different: epix_slide() returns a tibble containing only the grouping variables, time_value, and the new column(s) from the slide computations, whereas epi_slide() returns an epi_df with all original variables plus the new columns from the slide computations. (Both will mirror the grouping or ungroupedness of their input, with one exception: epi_archives can have trivial (zero-variable) groupings, but these will be dropped in epix_slide results as they are not supported by tibbles.)
There are no size stability checks or element/row recycling to maintain size stability in epix_slide, unlike in epi_slide. (epix_slide is roughly analogous to dplyr::group_modify, while epi_slide is roughly analogous to dplyr::mutate followed by dplyr::arrange) This is detailed in the "advanced" vignette.
.all_rows is not supported in epix_slide; since the slide computations are allowed more flexibility in their outputs than in epi_slide, we can't guess a good representation for missing computations for excluded group-.ref_time_value pairs.
The .versions default for epix_slide is based on making an evenly-spaced sequence out of the versions in the DT plus the versions_end, rather than the time_values.

Apart from the above distinctions, the interfaces between epix_slide() and epi_slide() are the same.

Furthermore, the current function can be considerably slower than epi_slide(), for two reasons: (1) it must repeatedly fetch properly-versioned snapshots from the data archive (via epix_as_of()), and (2) it performs a "manual" sliding of sorts, and does not benefit from the highly efficient slider package. For this reason, it should never be used in place of epi_slide(), and only used when version-aware sliding is necessary (as it its purpose).

Value

A tibble whose columns are: the grouping variables (if any), time_value, containing the reference time values for the slide computation, and a column named according to the .new_col_name argument, containing the slide values. It will be grouped by the grouping variables.

Examples

library(dplyr)

# Reference time points for which we want to compute slide values:
versions <- seq(as.Date("2020-06-02"),
  as.Date("2020-06-15"),
  by = "1 day"
)

# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    .f = ~ mean(.x$case_rate_7d_av),
    .before = 2,
    .versions = versions,
    .new_col_name = "case_rate_7d_av_recent_av"
  ) %>%
  ungroup()
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example).  In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
#                                                discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
#   of data latency, we'll never have an observation
#   `time_value == .ref_time_value` as of `.ref_time_value`.
# The example below shows this type of behavior in more detail.

# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        time_range = if (nrow(x) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("%s -- %s", min(x$time_value), max(x$time_value))
        },
        n = nrow(x),
        class1 = class(x)[[1L]]
      )
    },
    .before = 5, .all_versions = FALSE,
    .versions = versions
  ) %>%
  ungroup() %>%
  arrange(geo_value, version)

# --- Advanced: ---

# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `.all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:

archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        versions_start = if (nrow(x$DT) == 0L) {
          "NA (0 rows)"
        } else {
          toString(min(x$DT$version))
        },
        versions_end = x$versions_end,
        time_range = if (nrow(x$DT) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("%s -- %s", min(x$DT$time_value), max(x$DT$time_value))
        },
        n = nrow(x$DT),
        class1 = class(x)[[1L]]
      )
    },
    .before = 5, .all_versions = TRUE,
    .versions = versions
  ) %>%
  ungroup() %>%
  # Focus on one geo_value so we can better see the columns above:
  filter(geo_value == "ca") %>%
  select(-geo_value)

cmu-delphi/epiprocess documentation built on April 12, 2025, 12:51 p.m.