epix_slide: Slide a function over variables in an 'epi_archive' or...

View source: R/methods-epi_archive.R

epix_slideR Documentation

Slide a function over variables in an epi_archive or grouped_epi_archive

Description

Slides a given function over variables in an epi_archive object. This behaves similarly to epi_slide(), with the key exception that it is version-aware: the sliding computation at any given reference time t is performed on data that would have been available as of t. This function is intended for use in accurate backtesting of models; see vignette("backtesting", package="epipredict") for a walkthrough.

Usage

epix_slide(
  .x,
  .f,
  ...,
  .before = Inf,
  .versions = NULL,
  .new_col_name = NULL,
  .all_versions = FALSE
)

## S3 method for class 'epi_archive'
epix_slide(
  .x,
  .f,
  ...,
  .before = Inf,
  .versions = NULL,
  .new_col_name = NULL,
  .all_versions = FALSE
)

## S3 method for class 'grouped_epi_archive'
epix_slide(
  .x,
  .f,
  ...,
  .before = Inf,
  .versions = NULL,
  .new_col_name = NULL,
  .all_versions = FALSE
)

Arguments

.x

An epi_archive or grouped_epi_archive object. If ungrouped, all data in x will be treated as part of a single data group.

.f

Function, formula, or missing; together with ... specifies the computation to slide. To "slide" means to apply a computation over a sliding (a.k.a. "rolling") time window for each data group. The window is determined by the .before parameter (see details for more). If a function, .f must have the form ⁠function(x, g, t, ...)⁠, where

  • "x" is an epi_df with the same column names as the archive's DT, minus the version column

  • "g" is a one-row tibble containing the values of the grouping variables for the associated group

  • "t" is the ref_time_value for the current window

  • "..." are additional arguments

If a formula, .f can operate directly on columns accessed via .x$var or .$var, as in ~ mean (.x$var) to compute a mean of a column var for each group-ref_time_value combination. The group key can be accessed via .y or .group_key, and the reference time value can be accessed via .z or .ref_time_value. If .f is missing, then ... will specify the computation.

...

Additional arguments to pass to the function or formula specified via f. Alternatively, if .f is missing, then the ... is interpreted as a "data-masking" expression or expressions for tidy evaluation; in addition to referring columns directly by name, the expressions have access to .data and .env pronouns as in dplyr verbs, and can also refer to .x (not the same as the input epi_archive), .group_key, and .ref_time_value. See details for more.

.before

How many time values before the .ref_time_value should each snapshot handed to the function .f contain? If provided, it should be a single value that is compatible with the time_type of the time_value column (more below), but most commonly an integer. This window endpoint is inclusive. For example, if .before = 7, time_type in the archive is "day", and the .ref_time_value is January 8, then the smallest time_value in the snapshot will be January 1. If missing, then the default is no limit on the time values, so the full snapshot is given.

.versions

Reference time values / versions for sliding computations; each element of this vector serves both as the anchor point for the time_value window for the computation and the max_version epix_as_of which we fetch data in this window. If missing, then this will set to a regularly-spaced sequence of values set to cover the range of versions in the DT plus the versions_end; the spacing of values will be guessed (using the GCD of the skips between values).

.new_col_name

Either NULL or a string indicating the name of the new column that will contain the derived values. The default, NULL, will use the name "slide_value" unless your slide computations output data frames, in which case they will be unpacked into the constituent columns and those names used. If the resulting column name(s) overlap with the column names used for labeling the computations, which are group_vars(x) and "version", then the values for these columns must be identical to the labels we assign.

.all_versions

(Not the same as .all_rows parameter of epi_slide.) If .all_versions = TRUE, then the slide computation will be passed the version history (all version <= .version where .version is one of the requested .versions) for rows having a time_value of at least '.version

  • before⁠. Otherwise, the slide computation will be passed only the most recent ⁠version⁠for every unique⁠time_value⁠. Default is ⁠FALSE'.

Details

A few key distinctions between the current function and epi_slide():

  1. In .f functions for epix_slide, one should not assume that the input data to contain any rows with time_value matching the computation's .ref_time_value (accessible via ⁠attributes(<data>)$metadata$as_of⁠); for typical epidemiological surveillance data, observations pertaining to a particular time period (time_value) are first reported as_of some instant after that time period has ended.

  2. The input class and columns are similar but different: epix_slide (with the default .all_versions=FALSE) keeps all columns and the epi_df-ness of the first argument to each computation; epi_slide only provides the grouping variables in the second input, and will convert the first input into a regular tibble if the grouping variables include the essential geo_value column. (With .all_versions=TRUE⁠, ⁠epix_slide⁠will will provide an⁠epi_archive⁠rather than an⁠epi-df' to each computation.)

  3. The output class and columns are similar but different: epix_slide() returns a tibble containing only the grouping variables, time_value, and the new column(s) from the slide computations, whereas epi_slide() returns an epi_df with all original variables plus the new columns from the slide computations. (Both will mirror the grouping or ungroupedness of their input, with one exception: epi_archives can have trivial (zero-variable) groupings, but these will be dropped in epix_slide results as they are not supported by tibbles.)

  4. There are no size stability checks or element/row recycling to maintain size stability in epix_slide, unlike in epi_slide. (epix_slide is roughly analogous to dplyr::group_modify, while epi_slide is roughly analogous to dplyr::mutate followed by dplyr::arrange) This is detailed in the "advanced" vignette.

  5. .all_rows is not supported in epix_slide; since the slide computations are allowed more flexibility in their outputs than in epi_slide, we can't guess a good representation for missing computations for excluded group-.ref_time_value pairs.

  6. The .versions default for epix_slide is based on making an evenly-spaced sequence out of the versions in the DT plus the versions_end, rather than the time_values.

Apart from the above distinctions, the interfaces between epix_slide() and epi_slide() are the same.

Furthermore, the current function can be considerably slower than epi_slide(), for two reasons: (1) it must repeatedly fetch properly-versioned snapshots from the data archive (via epix_as_of()), and (2) it performs a "manual" sliding of sorts, and does not benefit from the highly efficient slider package. For this reason, it should never be used in place of epi_slide(), and only used when version-aware sliding is necessary (as it its purpose).

Value

A tibble whose columns are: the grouping variables, time_value, containing the reference time values for the slide computation, and a column named according to the .new_col_name argument, containing the slide values.

Examples

library(dplyr)

# Reference time points for which we want to compute slide values:
versions <- seq(as.Date("2020-06-02"),
  as.Date("2020-06-15"),
  by = "1 day"
)

# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    .f = ~ mean(.x$case_rate_7d_av),
    .before = 2,
    .versions = versions,
    .new_col_name = "case_rate_7d_av_recent_av"
  ) %>%
  ungroup()
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example).  In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
#                                                discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
#   of data latency, we'll never have an observation
#   `time_value == .ref_time_value` as of `.ref_time_value`.
# The example below shows this type of behavior in more detail.

# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        time_range = if (nrow(x) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("%s -- %s", min(x$time_value), max(x$time_value))
        },
        n = nrow(x),
        class1 = class(x)[[1L]]
      )
    },
    .before = 5, .all_versions = FALSE,
    .versions = versions
  ) %>%
  ungroup() %>%
  arrange(geo_value, version)

# --- Advanced: ---

# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `.all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:

archive_cases_dv_subset %>%
  group_by(geo_value) %>%
  epix_slide(
    function(x, gk, rtv) {
      tibble(
        versions_start = if (nrow(x$DT) == 0L) {
          "NA (0 rows)"
        } else {
          toString(min(x$DT$version))
        },
        versions_end = x$versions_end,
        time_range = if (nrow(x$DT) == 0L) {
          "0 `time_value`s"
        } else {
          sprintf("%s -- %s", min(x$DT$time_value), max(x$DT$time_value))
        },
        n = nrow(x$DT),
        class1 = class(x)[[1L]]
      )
    },
    .before = 5, .all_versions = TRUE,
    .versions = versions
  ) %>%
  ungroup() %>%
  # Focus on one geo_value so we can better see the columns above:
  filter(geo_value == "ca") %>%
  select(-geo_value)


cmu-delphi/epiprocess documentation built on Oct. 29, 2024, 5:37 p.m.