source("_common.R")
The epi_df
data structure provided by epiprocess
provides convenient ways to
perform common processing tasks. In this vignette, we will:
epi_df
from a data frameepi_slide()
sum_groups_epi_df()
complete.epi_df()
and {tsibble}
epi_df
formatAs in vignette("epiprocess")
, we will fetch daily reported COVID-19 cases from
CA, FL, NY, and TX (note: here we're using new, not cumulative cases) using the
epidatr
package, and then convert
this to epi_df
format.
library(epiprocess) library(dplyr)
The data is included in this package (via the epidatasets
package) and can be loaded with:
edf <- cases_deaths_subset %>% select(geo_value, time_value, cases) %>% arrange(geo_value, time_value)
The data can also be fetched from the Delphi Epidata API with the following query:
library(epidatr) d <- as.Date("2024-03-20") edf <- pub_covidcast( source = "jhu-csse", signals = "confirmed_incidence_num", geo_type = "state", time_type = "day", geo_values = "ca,fl,ny,tx,ga,pa", time_values = epirange(20200301, 20211231), as_of = d ) %>% select(geo_value, time_value, cases = value) %>% arrange(geo_value, time_value) %>% as_epi_df(as_of = d)
The data has 2,684 rows and 3 columns.
epi_slide
A very common operation in time series processing is aggregating the values of
the time series by applying some function on a rolling time window of data
points. The key tool that allows this is epi_slide()
. The function always
first makes sure to group the data by the grouping variables of the epi_df
object, which includes the geo_value
and possibly other_keys
columns. It
then applies the rolling slide computation inside each group.
The epi_slide()
function has three ways to specify the computation to be
performed:
Usually, the most convenient way to setup a computation in epi_slide()
is to
pass in an expression for tidy evaluation. In this case, we can simply define
the name of the new column directly as part of the expression, setting it equal
to a computation in which we can access any columns of .x
by name, just as we
would in a call to, say, dplyr::mutate()
. For example:
slide_output <- edf %>% epi_slide(cases_7sd = sd(cases, na.rm = TRUE), .window_size = 7)
As a simple sanity check, we visualize the 7-day trailing averages computed on top of the original counts:
library(ggplot2) ggplot(slide_output, aes(x = time_value)) + geom_col(aes(y = cases, fill = geo_value), alpha = 0.5, show.legend = FALSE) + geom_line(aes(y = cases_7sd, col = geo_value), show.legend = FALSE) + facet_wrap(~geo_value, scales = "free_y") + scale_x_date(minor_breaks = "month", date_labels = "%b %y") + labs(x = "Date", y = "Reported COVID-19 cases")
As we can see from the Texas plot, the state moved to weekly reporting of COVID-19 cases in summer of 2021.
Note that without epi_slide()
, the computation is much less convenient. For
instance, a rough equivalent of the above computation would be the following,
which is easy to get wrong:
edf %>% complete(geo_value, time_value = seq.Date(min(time_value), max(time_value), by = "day")) %>% arrange_canonical() %>% group_by(geo_value) %>% mutate(cases_7sd = slider::slide_dbl(cases, .f = sd, na.rm = TRUE, .before = 7, .after = 0))
Furthermore epi_slide()
allows for selecting .ref_time_value
, which the
latter recipe does not support.
We can also pass a function to the second argument in epi_slide()
. In this
case, the passed function .f
must have the form function(x, g, t, ...)
,
where
x
is an epi_df with the same column names as the input epi_dfg
is a one-row tibble containing the values of the grouping variables
for the associated group, for instance g$geo_value
t
is the ref_time_value for the current window...
are additional argumentsThe same computation as above can be done with a function:
edf %>% epi_slide(.f = function(x, g, t) sd(x$cases, na.rm = TRUE), .window_size = 7)
epi_slide()
with a formulaThe same computation as above can be done with a formula, where all references
to the columns must be made with the prefix .x$...
, for instance:
edf %>% epi_slide(~ sd(.x$cases, na.rm = TRUE), .window_size = 7)
Note that the name of the column defaults to slide_value
in the unnamed
formula or function case. This can be adjusted with .new_col_name
.
If your formula (or function) returns a data.frame, then the columns of the
data.frame will be unpacked into the resulting epi_df
(in the sense of
tidyr::unpack()
). For example, the following computes the 7-day trailing
average of daily cases as well as the the 7-day trailing standard deviation of
daily cases:
edf %>% epi_slide( ~ data.frame(cases_mean = mean(.x$cases, na.rm = TRUE), cases_sd = sd(.x$cases, na.rm = TRUE)), .window_size = 7 )
For the two most common sliding operations, we offer two optimized versions:
epi_slide_mean()
and epi_slide_sum()
. These are much faster than
epi_slide()
, so we recommend using them when you are only interested in the
mean or sum of a column. The following computes the 7-day trailing mean of daily
cases:
edf %>% group_by(geo_value) %>% epi_slide_mean("cases", .window_size = 7, na.rm = TRUE) edf %>% group_by(geo_value) %>% epi_slide_sum("cases", .window_size = 7, na.rm = TRUE)
The natural next step is to use the sliding window to forecast future values.
However to do this correctly, we should make sure that our data is historically
accurate. The data structure we use for that is the epi_archive
and the
analogous slide function is epix_slide()
. To read further along this train of
thought, see vignette("epi_archive")
.
epi_df
and aggregating groups with sum_groups_epi_df
An epi_df
object can have more key columns than just geo_value
and
time_value
. For example, if we have demographic attributes like age group, we
can add this as a key column. We can then aggregate the data by these key
columns using sum_groups_epi_df()
. Let's use influenza hospitalization rate
data from the CDC system FluSurv as an example. We can get it from the Delphi
Epidata API
library(epidatr) flu_data <- pub_flusurv( locations = "ca", epiweeks = epirange(201801, 202001), ) %>% select(location, epiweek, issue, rate_age_0, rate_age_1, rate_age_2, rate_age_3, rate_age_4) %>% tidyr::pivot_longer(cols = starts_with("rate_age_"), names_to = "age_group", values_to = "rate") flu_data
We can now convert this data to an epi_df
object and set the age_group
column as an additional group key:
flu_data <- flu_data %>% as_epi_df(other_keys = "age_group", as_of = as.Date("2024-03-20")) flu_data
Note that the epi_df
object now has an additional key column age_group
. This
means that there should only be one row for each combination of geo_value
,
time_value
, and age_group
in the dataset (this is enforced at construction
time).
Now we can aggregate the data by age_group
, if we want to compute the total:
group_cols <- key_colnames(exclude = "age_group") flu_data %>% sum_groups_epi_df("rate", group_cols = group_cols)
complete.epi_df
Sometimes you may have missing data in your time series. This can be due to
actual missing data, or it can be due to the fact that the data is only reported
on certain days. In the latter case, it is often useful to fill in the missing
data with explicit zeros. This can be done with the complete.epi_df()
function.
First, let's create a data set with some missing data. We will reuse the dataset
edf
from above, but modify it slightly.
edf_missing <- edf %>% filter(geo_value %in% c("ca", "tx")) %>% group_by(geo_value) %>% slice(1:3, 5:6) edf_missing %>% print(n = 10)
Now let's fill in the missing data with explicit zeros:
edf_missing %>% complete( time_value = seq.Date(min(time_value), max(time_value), by = "day"), fill = list(cases = 0) ) %>% print(n = 12)
tsibble
We can also use the tsibble
package to detect and fill time gaps. We'll work
with county-level reported COVID-19 cases in MA and VT.
The data is included in this package (via the epidatasets
package) and can be loaded with:
library(epiprocess) library(dplyr) library(readr) x <- covid_incidence_county_subset
The data can also be fetched from the Delphi Epidata API with the following query:
library(epidatr) d <- as.Date("2024-03-20") # Get mapping between FIPS codes and county&state names: y <- read_csv("https://github.com/cmu-delphi/covidcast/raw/c89e4d295550ba1540d64d2cc991badf63ad04e5/Python-packages/covidcast-py/covidcast/geo_mappings/county_census.csv", # nolint: line_length_linter col_types = c( FIPS = col_character(), CTYNAME = col_character(), STNAME = col_character() ) ) %>% filter(STNAME %in% c("Massachusetts", "Vermont"), STNAME != CTYNAME) %>% select(geo_value = FIPS, county_name = CTYNAME, state_name = STNAME) # Fetch only counties from Massachusetts and Vermont, then append names columns as well x <- pub_covidcast( source = "jhu-csse", signals = "confirmed_incidence_num", geo_type = "county", time_type = "day", geo_values = paste(y$geo_value, collapse = ","), time_values = epirange(20200601, 20211231), as_of = d ) %>% select(geo_value, time_value, cases = value) %>% inner_join(y, by = "geo_value", relationship = "many-to-one", unmatched = c("error", "drop")) %>% as_epi_df(as_of = d)
The data contains 16,212 rows and 5 columns.
tsibble
formatFor manipulating and wrangling time series data, the
tsibble
already provides a host of
useful tools. A tsibble object (formerly, of class tbl_ts
) is basically a
tibble (data frame) but with two specially-marked columns: an index column
representing the time variable (defining an order from past to present), and a
key column identifying a unique observational unit for each time point. In
fact, the key can be made up of any number of columns, not just a single one.
In an epi_df
object, the index variable is time_value
, and the key variable
is typically geo_value
(though this need not always be the case: for example,
if we have an age group variable as another column, then this could serve as a
second key variable). The epiprocess
package thus provides an implementation
of as_tsibble()
for epi_df
objects, which sets these variables according to
these defaults.
library(tsibble) xt <- as_tsibble(x) head(xt) key(xt) index(xt) interval(xt)
We can also set the key variable(s) directly in a call to as_tsibble()
.
Similar to SQL keys, if the key does not uniquely identify each time point (that
is, the key and index together do not not uniquely identify each row), then
as_tsibble()
throws an error:
head(as_tsibble(x, key = "county_name"))
As we can see, there are duplicate county names between Massachusetts and Vermont, which caused the error.
head(duplicates(x, key = "county_name"))
Keying by both county name and state name, however, does work:
head(as_tsibble(x, key = c("county_name", "state_name")))
One of the major advantages of the tsibble
package is its ability to handle
implicit gaps in time series data. In other words, it can infer what time
scale we're interested in (say, daily data), and detect apparent gaps (say, when
values are reported on January 1 and 3 but not January 2). We can subsequently
use functionality to make these missing entries explicit, which will generally
help avoid bugs in further downstream data processing tasks.
Let's first remove certain dates from our data set to create gaps:
state_naming <- read_csv("https://github.com/cmu-delphi/covidcast/raw/c89e4d295550ba1540d64d2cc991badf63ad04e5/Python-packages/covidcast-py/covidcast/geo_mappings/state_census.csv", # nolint: line_length_linter col_types = c(NAME = col_character(), ABBR = col_character()) ) %>% transmute(state_name = NAME, abbr = tolower(ABBR)) %>% as_tibble() # First make geo value more readable for tables, plots, etc. x <- x %>% inner_join(state_naming, by = "state_name", relationship = "many-to-one", unmatched = c("error", "drop")) %>% mutate(geo_value = paste(substr(county_name, 1, nchar(county_name) - 7), state_name, sep = ", ")) %>% select(geo_value, time_value, cases) xt <- as_tsibble(x) %>% filter(cases >= 3)
The functions has_gaps()
, scan_gaps()
, count_gaps()
in the tsibble
package each provide useful summaries, in slightly different formats.
head(has_gaps(xt)) head(scan_gaps(xt)) head(count_gaps(xt))
We can also visualize the patterns of missingness:
library(ggplot2) ggplot( count_gaps(xt), aes( x = reorder(geo_value, desc(geo_value)), color = geo_value ) ) + geom_linerange(aes(ymin = .from, ymax = .to)) + geom_point(aes(y = .from)) + geom_point(aes(y = .to)) + coord_flip() + labs(x = "County", y = "Date") + theme(legend.position = "none")
Using the fill_gaps()
function from tsibble
, we can replace all gaps by an
explicit value. The default is NA
, but in the current case, where missingness
is not at random but rather represents a small value that was censored (only a
hypothetical with COVID-19 reports, but certainly a real phenomenon that occurs
in other signals), it is better to replace it by zero, which is what we do here.
(Other approaches, such as LOCF: last observation carried forward in time, could
be accomplished by first filling with NA
values and then following up with a
second call to tidyr::fill()
.)
fill_gaps(xt, cases = 0) %>% head()
Note that the time series for Addison, VT only starts on August 27, 2020, even
though the original (uncensored) data set itself was drawn from a period that
went back to June 6, 2020. By setting .full = TRUE
, we can at zero-fill over
the entire span of the observed (censored) data.
xt_filled <- fill_gaps(xt, cases = 0, .full = TRUE) head(xt_filled)
Explicit imputation for missingness (zero-filling in our case) can be important
for protecting against bugs in all sorts of downstream tasks. For example, even
something as simple as a 7-day trailing average is complicated by missingness.
The function epi_slide()
looks for all rows within a window of 7 days anchored
on the right at the reference time point (when .window_size = 7
).
But when some days in a given week are missing because they were censored
because they had small case counts, taking an average of the observed case
counts can be misleading and is unintentionally biased upwards. Meanwhile,
running epi_slide()
on the zero-filled data brings these trailing averages
(appropriately) downwards, as we can see inspecting Plymouth, MA around July 1,
2021.
xt %>% as_epi_df(as_of = as.Date("2024-03-20")) %>% group_by(geo_value) %>% epi_slide(cases_7dav = mean(cases), .window_size = 7) %>% ungroup() %>% filter( geo_value == "Plymouth, MA", abs(time_value - as.Date("2021-07-01")) <= 3 ) %>% print(n = 7) xt_filled %>% as_epi_df(as_of = as.Date("2024-03-20")) %>% group_by(geo_value) %>% epi_slide(cases_7dav = mean(cases), .window_size = 7) %>% ungroup() %>% filter( geo_value == "Plymouth, MA", abs(time_value - as.Date("2021-07-01")) <= 3 ) %>% print(n = 7)
We do not yet provide tools for geographic aggregation in epiprocess
. However,
we have some Python geocoding utilities available. Reach out to us if this is
functionality you would like to see us add to epiprocess
.
The percent_cli
data is a modified part of the COVIDcast Epidata API Doctor
Visits
data.
This dataset is licensed under the terms of the Creative Commons Attribution
4.0 International license.
Copyright Delphi Research Group at Carnegie Mellon University 2020.
This document contains a dataset that is a modified part of the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University as republished in the COVIDcast Epidata API. This data set is licensed under the terms of the Creative Commons Attribution 4.0 International license by the Johns Hopkins University on behalf of its Center for Systems Science in Engineering. Copyright Johns Hopkins University 2020.
From the COVIDcast Epidata API: These signals are taken directly from the JHU CSSE COVID-19 GitHub repository without changes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.