View source: R/methods-epi_archive.R
epix_slide | R Documentation |
epi_archive
or grouped_epi_archive
Slides a given function over variables in an epi_archive
object. This
behaves similarly to epi_slide()
, with the key exception that it is
version-aware: the sliding computation at any given reference time t is
performed on data that would have been available as of t. This function
is intended for use in accurate backtesting of models; see
vignette("backtesting", package="epipredict")
for a walkthrough.
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
## S3 method for class 'epi_archive'
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
## S3 method for class 'grouped_epi_archive'
epix_slide(
.x,
.f,
...,
.before = Inf,
.versions = NULL,
.new_col_name = NULL,
.all_versions = FALSE
)
.x |
An |
.f |
Function, formula, or missing; together with
If a formula, |
... |
Additional arguments to pass to the function or formula specified
via |
.before |
How many time values before the |
.versions |
Reference time values / versions for sliding
computations; each element of this vector serves both as the anchor point
for the |
.new_col_name |
Either |
.all_versions |
(Not the same as
|
A few key distinctions between the current function and epi_slide()
:
In .f
functions for epix_slide
, one should not assume that the input
data to contain any rows with time_value
matching the computation's
.ref_time_value
(accessible via attributes(<data>)$metadata$as_of
); for
typical epidemiological surveillance data, observations pertaining to a
particular time period (time_value
) are first reported as_of
some
instant after that time period has ended.
The input class and columns are similar but different: epix_slide
(with the default .all_versions=FALSE
) keeps all columns and the
epi_df
-ness of the first argument to each computation; epi_slide
only
provides the grouping variables in the second input, and will convert the
first input into a regular tibble if the grouping variables include the
essential geo_value
column. (With .all_versions=TRUE,
epix_slidewill will provide an
epi_archiverather than an
epi-df' to each
computation.)
The output class and columns are similar but different: epix_slide()
returns a tibble containing only the grouping variables, time_value
, and
the new column(s) from the slide computations, whereas epi_slide()
returns an epi_df
with all original variables plus the new columns from
the slide computations. (Both will mirror the grouping or ungroupedness of
their input, with one exception: epi_archive
s can have trivial
(zero-variable) groupings, but these will be dropped in epix_slide
results as they are not supported by tibbles.)
There are no size stability checks or element/row recycling to maintain
size stability in epix_slide
, unlike in epi_slide
. (epix_slide
is
roughly analogous to dplyr::group_modify
, while epi_slide
is roughly
analogous to dplyr::mutate
followed by dplyr::arrange
) This is detailed
in the "advanced" vignette.
.all_rows
is not supported in epix_slide
; since the slide
computations are allowed more flexibility in their outputs than in
epi_slide
, we can't guess a good representation for missing computations
for excluded group-.ref_time_value
pairs.
The .versions
default for epix_slide
is based on making an
evenly-spaced sequence out of the version
s in the DT
plus the
versions_end
, rather than the time_value
s.
Apart from the above distinctions, the interfaces between epix_slide()
and
epi_slide()
are the same.
Furthermore, the current function can be considerably slower than
epi_slide()
, for two reasons: (1) it must repeatedly fetch
properly-versioned snapshots from the data archive (via epix_as_of()
),
and (2) it performs a "manual" sliding of sorts, and does not benefit from
the highly efficient slider
package. For this reason, it should never be
used in place of epi_slide()
, and only used when version-aware sliding is
necessary (as it its purpose).
A tibble whose columns are: the grouping variables, time_value
,
containing the reference time values for the slide computation, and a
column named according to the .new_col_name
argument, containing the slide
values.
library(dplyr)
# Reference time points for which we want to compute slide values:
versions <- seq(as.Date("2020-06-02"),
as.Date("2020-06-15"),
by = "1 day"
)
# A simple (but not very useful) example (see the archive vignette for a more
# realistic one):
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
.f = ~ mean(.x$case_rate_7d_av),
.before = 2,
.versions = versions,
.new_col_name = "case_rate_7d_av_recent_av"
) %>%
ungroup()
# We requested time windows that started 2 days before the corresponding time
# values. The actual number of `time_value`s in each computation depends on
# the reporting latency of the signal and `time_value` range covered by the
# archive (2020-06-01 -- 2021-11-30 in this example). In this case, we have
# * 0 `time_value`s, for ref time 2020-06-01 --> the result is automatically
# discarded
# * 1 `time_value`, for ref time 2020-06-02
# * 2 `time_value`s, for the rest of the results
# * never the 3 `time_value`s we would get from `epi_slide`, since, because
# of data latency, we'll never have an observation
# `time_value == .ref_time_value` as of `.ref_time_value`.
# The example below shows this type of behavior in more detail.
# Examining characteristics of the data passed to each computation with
# `all_versions=FALSE`.
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
function(x, gk, rtv) {
tibble(
time_range = if (nrow(x) == 0L) {
"0 `time_value`s"
} else {
sprintf("%s -- %s", min(x$time_value), max(x$time_value))
},
n = nrow(x),
class1 = class(x)[[1L]]
)
},
.before = 5, .all_versions = FALSE,
.versions = versions
) %>%
ungroup() %>%
arrange(geo_value, version)
# --- Advanced: ---
# `epix_slide` with `all_versions=FALSE` (the default) applies a
# version-unaware computation to several versions of the data. We can also
# use `.all_versions=TRUE` to apply a version-*aware* computation to several
# versions of the data, again looking at characteristics of the data passed
# to each computation. In this case, each computation should expect an
# `epi_archive` containing the relevant version data:
archive_cases_dv_subset %>%
group_by(geo_value) %>%
epix_slide(
function(x, gk, rtv) {
tibble(
versions_start = if (nrow(x$DT) == 0L) {
"NA (0 rows)"
} else {
toString(min(x$DT$version))
},
versions_end = x$versions_end,
time_range = if (nrow(x$DT) == 0L) {
"0 `time_value`s"
} else {
sprintf("%s -- %s", min(x$DT$time_value), max(x$DT$time_value))
},
n = nrow(x$DT),
class1 = class(x)[[1L]]
)
},
.before = 5, .all_versions = TRUE,
.versions = versions
) %>%
ungroup() %>%
# Focus on one geo_value so we can better see the columns above:
filter(geo_value == "ca") %>%
select(-geo_value)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.