In this vignette, we'll take a look at estimating derivatives of signals using
the estimate_deriv()
function. We'll again demonstrate this functionality on
state-level COVID-19 case rates, smoothed via 7-day trailing averages, from the
USAFacts data source.
library(covidcast) start_day <- "2020-06-01" end_day <- "2020-11-15" geo_values <- c("ca", "fl", "ny", "tx") case_rates <- suppressMessages( covidcast_signal(data_source = "usa-facts", signal = "confirmed_7dav_incidence_prop", start_day = start_day, end_day = end_day, geo_type = "state", geo_values = geo_values)) summary(case_rates)
The function for estimating derivatives is called estimate_deriv()
, and (aside
from a covidcast_signal
data frame) takes two primary arguments: method
,
indicating the method to use for derivative estimation; and n
indicating the
trailing sample size (number of days) to use in training the given method. Here
we use method = "lin"
, the default, which uses the slope from a simple linear
regression, and n = 14
, also the default.
library(modeltools) library(dplyr) case_rates <- estimate_deriv(case_rates, method = "lin", n = 14) case_rates %>% arrange(geo_value) %>% select(geo_value, time_value, value, deriv)
We can see that a column deriv
has been added to the output data frame, which
contains the derivative estimates. Below we visualize these estimates in tandem
with the signal itself. The red dots mark time points at which the derivative
estimate exceeds a threshold (arbitrarily chosen) of 0.25. These seem to roughly
but reasonably mark times of upswing in the underlying signal.
library(ggplot2) library(gridExtra) state = "fl" threshold = 0.25 p1 <- ggplot(case_rates %>% filter(geo_value == state), aes(x = time_value, y = value)) + geom_line() + geom_point(data = case_rates %>% filter(geo_value == state, deriv >= threshold), aes(x = time_value, y = value), color = "red") + labs(x = "Date", y = "Cases per 100,00 people") p2 <- ggplot(case_rates %>% filter(geo_value == state), aes(x = time_value, y = deriv)) + geom_line() + geom_hline(yintercept = threshold, linetype = 2) + labs(x = "Date", y = "Derivative (linear regression)") grid.arrange(p1, p2, nrow = 1)
Now we consider method = "ss"
, which uses a smoothing spline for the estimate
of the derivative. That is, at each time point, we fit a natural cubic spline to
the data from the trailing n
days, and return the derivative of the underlying
fitted spline at the current time as the estimate. Here we set n = 28
, a bit
higher sample size, and fit the spline in two ways: first, using a fixed degrees
of freedom of 8; and second, using cross-validation to choose the amount of
regularization (tuning parameter). This is accomplished by passing additional
arguments to estimate_deriv()
, which are in turn passed on to the underlying
function it uses to fit smoothing splines, stats::smooth.spline()
. Note that
we also set a custom name for the output column with the estimated derivatives,
via the col_name
argument.
case_rates <- estimate_deriv(case_rates, method = "ss", n = 28, col_name = "deriv_ss1", df = 8) case_rates <- estimate_deriv(case_rates, method = "ss", n = 28, col_name = "deriv_ss2", cv = TRUE) p1 <- ggplot(case_rates %>% filter(geo_value == state), aes(x = time_value, y = value)) + geom_line() + geom_point(data = case_rates %>% filter(geo_value == state, deriv_ss1 >= threshold), aes(x = time_value, y = value), color = "red") + geom_point(data = case_rates %>% filter(geo_value == state, deriv_ss2 >= threshold), aes(x = time_value, y = value), color = "blue", shape = 21) + labs(x = "Date", y = "Cases per 100,00 people") p2 <- ggplot(case_rates %>% filter(geo_value == state), aes(x = time_value)) + geom_line(aes(y = deriv_ss1), color = "red") + geom_line(aes(y = deriv_ss2), color = "blue") + geom_hline(yintercept = threshold, linetype = 2) + labs(x = "Date", y = "Derivative (smoothing spline)") grid.arrange(p1, p2, nrow = 1)
The estimated derivates---in red for the smoothing spline with a fixed degrees of freedom of 8, and in blue for that tuned by cross-validation---appear less smooth than those above, from linear regression. Using cross-validation offers more adaptivity to the time-varying level of smoothness, as is apparent from comparing the red and blue derivative estimates in October and November.
Lastly we consider method = tf"
, which uses trend filtering for estimating the
derivative. That is, at each time point, we fit a discrete spline of quadratic
order to the data from the trailing n
days, and return the discrete derivative
of the underlying fitted spline at the current time as the estimate. As before,
we fit the spline in two ways: first, using a fixed degrees of freedom of 8; and
second, using cross-validation to choose the amount of regularization. Since the
optimization here takes a while (it's based on computing a full solution path
for the trend filtering problem, via the genlasso::trendfilter()
function), we
only compute derivatives for Florida.
case_rates_state <- case_rates %>% filter(geo_value == state) case_rates_state <- estimate_deriv(case_rates_state, method = "tf", n = 28, col_name = "deriv_tf1", df = 8) case_rates_state <- estimate_deriv(case_rates_state, method = "tf", n = 28, col_name = "deriv_tf2", cv = TRUE) p1 <- ggplot(case_rates_state, aes(x = time_value, y = value)) + geom_line() + geom_point(data = case_rates_state %>% filter(deriv_tf1 >= threshold), aes(x = time_value, y = value), color = "red") + geom_point(data = case_rates_state %>% filter(deriv_tf2 >= threshold), aes(x = time_value, y = value), color = "blue", shape = 21) + labs(x = "Date", y = "Cases per 100,00 people") p2 <- ggplot(case_rates_state, aes(x = time_value)) + geom_line(aes(y = deriv_tf1), color = "red") + geom_line(aes(y = deriv_tf2), color = "blue") + geom_hline(yintercept = threshold, linetype = 2) + labs(x = "Date", y = "Derivative (trend filtering)") grid.arrange(p1, p2, nrow = 1)
The estimated derivates now appear a bit smoother than the last ones, from the smoothing spline methods. Again, using cross-validation offers a noticeable improvement in adapting to to the time-varying level of smoothness, as is very clear from the differences between red and blue derivative estimates in October and November.
In the call to estimate_deriv()
, we can set keep_obj = TRUE
to keep around a
second column with the fitted model objects. For example, here, we can look at
the p-values associated with the estimated slopes from lsfit()
.
case_rates <- estimate_deriv(case_rates, method = "lin", n = 14, keep_obj = TRUE) class(case_rates$deriv_obj) ls.print(case_rates$deriv_obj[[7]]) case_rates <- case_rates %>% rowwise() %>% mutate(p_value = quiet( tryCatch(ls.print(deriv_obj)$coef.table[[1]][2,"Pr(>|t|)"], error = function(e) NA))) case_rates %>% arrange(geo_value) %>% select(geo_value, time_value, value, deriv, deriv_obj, p_value)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.