source("_common.R")
The epi_archive
data structure provided by epiprocess
provides convenient
ways to work with data sets that are subject to revision (a common occurrence in
the public health space, as situational awareness improves). In comparison to an
epi_df
object, which can viewed as a data snapshot at a point in time, an
epi_archive
object stores the full version history of a data set. Paying
attention to data revisions can be important for all sorts of downstream data
analysis and modeling tasks.
In this vignette we will:
epi_archive
object from a data frame,epi_df
form,epi_archive
objects together,epi_archive
formatWe will work with a signal on the percentage of doctor's visits with CLI (COVID-like illness) computed from medical insurance claims, available through the COVIDcast API. This signal is subject to very heavy and regular revision; you can read more about it on its API documentation page.
The data is included in this package (via the epidatasets
package) and can be loaded with:
library(epiprocess) library(data.table) library(dplyr) library(purrr) library(ggplot2) # This fetches the raw data backing the archive_cases_dv_subset object. dv <- archive_cases_dv_subset$DT %>% as_tibble()
The data can also be fetched from the Delphi Epidata API with the following query:
library(epidatr) dv <- pub_covidcast( source = "doctor-visits", signals = "smoothed_adj_cli", geo_type = "state", time_type = "day", geo_values = "ca,fl,ny,tx", time_values = epirange(20200601, 20211201), issues = epirange(20200601, 20211201) ) %>% rename(version = issue, percent_cli = value)
An epi_archive()
object can be constructed from a data frame, data table, or
tibble, provided that it has (at least) the following columns:
geo_value
: the geographic value associated with each row of measurements.time_value
: the time value associated with each row of measurements.version
: the time value specifying the version for each row of measurements.
For example, if in a given row the version
is January 15, 2022 and
time_value
is January 14, 2022, then this row contains the measurements of
the data for January 14, 2022 that were available one day later.As we can see from the above, the data frame returned by
epidatr::pub_covidcast()
has the columns required for the epi_archive
format, with issue
playing the role of version
. We can now use
as_epi_archive()
to bring it into epi_archive
format.
dv_archive <- dv %>% select(geo_value, time_value, version, percent_cli) %>% as_epi_archive(compactify = TRUE) dv_archive
See the epi_archive()
documentation for more information about its internal
structure.
epi_df
formA key method of an epi_archive
class is epix_as_of()
, which generates a
snapshot of the archive in epi_df
format. This represents the most up-to-date
values of the signal variables as of a given version.
edf <- epix_as_of(dv_archive, as.Date("2021-06-01")) print(edf) print(max(edf$time_value))
Note that that the max time value in the epi_df
object is May 29, 2021, even
though the specified version date was June 1, 2021 (note that the as_of
field
printed above helps us see the date of the snapshot). From this we can infer
that the doctor's visits signal was 2 days latent on June 1.
Now, let's investigate how much this data was revised. We will plot the most
up-to-date version of the time series in black (edf_latest
below) and then
overlay several revisions from the archive, spaced one month apart, as colored
lines (snapshots
below). We will also mark the version dates with dotted
vertical lines.
edf_latest <- epix_as_of(dv_archive, dv_archive$versions_end) max_version <- max(dv_archive$DT$version) versions <- seq(as.Date("2020-06-01"), max_version - 1, by = "1 month") monthly_snapshots <- map(versions, function(v) { epix_as_of(dv_archive, v) %>% mutate(version = v) }) %>% bind_rows( edf_latest %>% mutate(version = max_version) ) %>% mutate(latest = version == max_version) ggplot( monthly_snapshots %>% filter(!latest), aes(x = time_value, y = percent_cli) ) + geom_line(aes(color = factor(version)), na.rm = TRUE) + geom_vline(aes(color = factor(version), xintercept = version), lty = 2) + facet_wrap(~geo_value, scales = "free_y", ncol = 1) + scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "% of doctor's visits with CLI") + theme(legend.position = "none") + geom_line( data = monthly_snapshots %>% filter(latest), aes(x = time_value, y = percent_cli), inherit.aes = FALSE, color = "black", na.rm = TRUE )
We can see some interesting and highly nontrivial revision behavior: at some points in time the provisional data snapshots grossly underestimate the latest curve (look in particular at Florida close to the end of 2021), and at others they overestimate it (both states towards the beginning of 2021), though not quite as dramatically. Modeling the revision process, which is often called backfill modeling, is an important statistical problem in it of itself.
There are many ways to examine how signals change across revisions. We provide
the convenient analysis wrapper revision_summary()
, which computes simple
summary statistics for each key (by default, (geo_value,time_value)
pairs). In
addition to the per key summary, it also returns an overall summary. Here is an
a sample of the output:
revision_details <- revision_summary(dv_archive, print_inform = TRUE)
We can see from the output that, as mentioned above, this data set has a lot of revisions: there are no keys that have no revision at all and 34% of the keys change by 10% or more when revised.
To do more detailed analysis than is possible with the above printing, we can
inspect the returned revision_details
tibble. Here we collect a number of
statistics for each state:
revision_details %>% group_by(geo_value) %>% summarise( n_rev = mean(n_revisions), min_lag = min(min_lag), max_lag = max(max_lag), spread = mean(spread), rel_spread = mean(rel_spread), time_near_latest = mean(time_near_latest) )
Most of the states have similar stats on most of these features, except for the
time_near_latest
stat, which is the amount of time that it takes for the
revisions to converge to within 20% of the final value and stay there. It is the
highest for CA and the lowest for TX.
epi_archive
objectsA common operation on datasets is merging (or joining) them together, such as
when we grab data from multiple sources for joint analysis or modeling. Merging
two epi_archive
objects together is a bit tricky however, since we need to handle
datasets that might get revised at different times. The function epix_merge()
is made to smooth this out. Below we merge the working epi_archive
of versioned
percentage CLI from outpatient visits to another one of versioned COVID-19 case
reporting data, which we fetch the from the COVIDcast
API, on the
rate scale (counts per 100,000 people in the population).
library(epidatr) y <- pub_covidcast( source = "jhu-csse", signals = "confirmed_7dav_incidence_prop", geo_type = "state", time_type = "day", geo_values = "ca,fl,ny,tx", time_values = epirange(20200601, 20211201), issues = epirange(20200601, 20211201) ) %>% select(geo_value, time_value, version = issue, case_rate_7d_av = value) %>% as_epi_archive(compactify = TRUE) dv_cases_archive <- epix_merge(dv_archive, y, sync = "locf", compactify = TRUE) print(dv_cases_archive)
dv_cases_archive <- archive_cases_dv_subset print(dv_cases_archive)
Note that we have used the sync = "locf"
argument to specify that we want to
synchronize the two datasets on their disjoint revisions by using the last
observation carried forward (LOCF). For more information, see epix_merge()
.
One of the most common use cases of epiprocess::epi_archive()
object is for
accurate model backtesting. See vignette("backtesting", package="epipredict")
for an in-depth demo, using a pre-built forecaster in that package.
This document contains a dataset that is a modified part of the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University as republished in the COVIDcast Epidata API. This data set is licensed under the terms of the Creative Commons Attribution 4.0 International license by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
The percent_cli
data is a modified part of the COVIDcast Epidata API Doctor
Visits
data.
This dataset is licensed under the terms of the Creative Commons Attribution
4.0 International license.
Copyright Delphi Research Group at Carnegie Mellon University 2020.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.