This package provides tools for building COVID-19 models, with a focus on
forecasting and hotspot prediction models. It is designed to be used in
conjunction with the covidcast
and evalcast
packages.
This package is not on CRAN yet, so it can be installed using the
devtools
package:
devtools::install_github("cmu-delphi/covidcast", ref = "main", subdir = "R-packages/modeltools")
Building the vignettes, such as this getting started guide, takes a significant amount of time. They are not included in the package by default. If you want to include vignettes, then use this modified command:
devtools::install_github("cmu-delphi/covidcast", ref = "main", subdir = "R-packages/modeltools", build_vignettes = TRUE, dependencies = TRUE)
For this getting started vignette, we'll fetch COVID-19 case rates, both raw and smoothed via 7-day trailing averages, from the USAFacts data source. We'll look at data at the state level for four states, between the start of June and mid November.
library(covidcast) start_day <- "2020-06-01" end_day <- "2020-11-15" geo_values <- c("ca", "fl", "ny", "tx") signals <- suppressMessages( covidcast_signals(data_source = "usa-facts", signal = c("confirmed_incidence_prop", "confirmed_7dav_incidence_prop"), start_day = start_day, end_day = end_day, geo_type = "state", geo_values = geo_values)) summary(signals[[1]]) summary(signals[[2]])
One of the most basic tools in the modeltools
package is slide_by_geo()
,
which is based on the family of functions provided by the slider
package. In
modeltools
, to "slide" means to apply the function or formula over a trailing
window of n
days of data, grouped by geo_value
. Many other functions in the
package, such as pct_change()
and estimate_deriv()
, use slide_by_geo()
as
their workhorse.
For example, to apply a 7-day trailing average to the raw values in the first
covidcast_signal
data frame, we can specify a formula via the slide_fun
argument of slide_by_geo()
:
library(modeltools) library(dplyr) slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7) %>% select(geo_value, time_value, value, slide_value) %>% arrange(geo_value) %>% head(10)
The formula specified via slide_fun
has access to all columns present in the
original covidcast_signal
data frame, and must refer to them with the prefix
.x$
. Here the function Mean()
is a simple wrapper around mean()
that omits
NA
values by default (provided by the modeltools
package).
Notice that slide_by_geo()
returns a data frame with a new column appended
that contains the results (from sliding the formula), named "slide_value" by
default. We can instead specify a name up front using the col_name
argument:
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7, col_name = "7dav") %>% select(geo_value, time_value, value, `7dav`) %>% arrange(geo_value) %>% head(10)
As a simple sanity check, we compare the 7-day trailing average computed via
slide_by_geo()
to the values of the smoothed signal from the API:
slide_by_geo(signals[[1]], slide_fun = ~ Mean(.x$value), n = 7, col_name = "7dav") %>% full_join(signals[[2]], by = c("geo_value", "time_value")) %>% mutate(difference = `7dav` - value.y) %>% filter(time_value >= "2020-06-07") %>% summarize(max(abs(difference)))
Note that this check would fail before June 7 because the API access to data
before June 1, whereas slide_by_geo()
does not (when a trailing window of n
days isn't available, slide_by_geo()
just applies the function or formula to
whatever data is available).
We can also pass a function for the slide_fun
argument in slide_by_geo()
. In
this case, the passed function must have the following argument structure: x
,
a data frame the same column names as the original data frame; followed by any
number of named additional arguments; and ending with ...
, to capture general
additional arguments. Recreating the last example of a 7-day trailing average:
slide_by_geo(signals[[1]], slide_fun = function(x, ...) Mean(x$value), n = 7, col_name = "7dav") %>% select(geo_value, time_value, value, `7dav`) %>% arrange(geo_value) %>% head(10)
As a more sophisticated example, here we show how to sensorize the doctor visits signal. The sensor values are defined by the predicted values from a local (in time) regression of past case rates (response) on past doctor visits (covariate).
# Regression for doctor visits sensorization dv_regression = function(x, m = 3, ...) { n = nrow(x) if (n <= m+1) return(NA) # Take care of trivial case return(tryCatch(suppressWarnings(suppressMessages({ # Fit a regression, leaving out the last m days of data lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ]) # Form prediction for the last day of data, and return predict(lm_obj, newdata = x[n, ]) })), error = function(e) return(NA))) } # Fetch doctor visits signal and join it to case rates joined = suppressMessages( covidcast_signal(data_source = "doctor-visits", signal = "smoothed_adj_cli", start_day = start_day, end_day = end_day, geo_type = "state", geo_values = geo_values)) %>% full_join(signals[[2]], by = c("geo_value", "time_value")) # Perform sensorization for each state; use the last n = 56 days (8 weeks) of # data, minus the last m = 3 days, for the regression slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3, col_name = "sensor") %>% select(geo_value, time_value, value.x, value.y, sensor) %>% arrange(geo_value) %>% head(10)
Above, the first 4 elements are NA
because of insufficient training data (we
need at least 2 training samples for simple linear regression, and we omit the
last 3 days of data).
As a final example, we show how to use slide_by_geo()
to output more complex
objects. We amend the last example so that dv_regression()
returns both the
predicted value (sensor) and the fitted linear model object. In order to use
this with slide_by_geo()
, we need to set the col_type
argument to be "list":
dv_regression = function(x, m = 3, ...) { n = nrow(x) if (n <= m+1) return(list(lm_obj = NA, sensor = NA)) # Trivial case return(tryCatch(suppressWarnings(suppressMessages({ # Fit a regression, leaving out the last days of data n = nrow(x) lm_obj = lm(value.y ~ value.x, data = x[1:(n-m+1), ]) # Form prediction for the last day of data sensor = predict(lm_obj, newdata = x[n, ]) # Return the fitted lm object and prediction list(lm_obj = lm_obj, sensor = sensor) })), error = function(e) return(list(lm_obj = NA, sensor = NA)))) } joined <- slide_by_geo(joined, slide_fun = dv_regression, n = 56, m = 3, col_name = "sensor_obj", col_type = "list") class(joined$sensor_obj) names(joined$sensor_obj[[5]]) # The first 4 are lists filled with NAs joined$sensor_obj[[5]]$lm_obj joined$sensor_obj[[5]]$sensor
This allows for post-inspection of the sensor models, which may be helpful for
diagnostic purposes. Note the way in which we structure the return argument in
dv_regression()
in its failure cases: it is always a list with the same named
arguments. This helps keep the post-inspection code just a bit more simple (it
is already a hairy enough, to protect against extracting coefficients from NA
objects).
joined %>% rowwise() %>% mutate(sensor = sensor_obj$sensor, intercept = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA, coef(sensor_obj$lm_obj)[1]), slope = ifelse(isTRUE(is.na(sensor_obj$lm_obj)), NA, coef(sensor_obj$lm_obj)[2])) %>% select(geo_value, time_value, sensor, intercept, slope) %>% arrange(geo_value) %>% head(20)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.