knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(EpiEstim) library(ggplot2)
The EpiEstim
package has been extended to allow users to estimate the time varying reproduction number (R~t~) from temporally aggregated incidence data (Nash et al.). This approach reconstructs daily incidence from data supplied at any timescale. This vignette will take you through the different ways aggregated incidence data can be supplied, the additional parameters needed, and an example using aggregated UK COVID-19 data. Please also see the FAQs section.
Incidence data
Many diseases are not reported on a daily basis. EpiEstim can now use incidence data that has been aggregated in multiple ways, e.g.:
Daily incidence data is reconstructed from aggregated data using an expectation-maximisation (EM) algorithm. There are three stages of the EM algorithm:
The daily incidence that is reconstructed after the final iteration of the EM algorithm is then used to estimate R~t~ using the same process as the original estimate_R()
function, with sliding weekly time windows used as the default.
Aggregation windows
Aggregation windows can be specified using the parameter dt
, and can be provided in one of three ways:
dt = 7L
for weekly datadt = c(2L,2L,3L)
Serial interval distribution
The serial interval can be provided on a daily timescale (as usual), either as the mean and standard deviation (parametric distribution) or the full distribution (non-parametric distribution). See 'full_EpiEstim_vignette' for more details.
To estimate R~t~ from temporally aggregated incidence data, we simply use the estimate_R()
function with two additional parameters required, dt
and dt_out
, and some optional parameters, recon_opt
, iter
, tol
, and grid
.
estimate_R()
for aggregated dataestimate_R(incid = aggregated_incidence, dt = 7L, dt_out = 7L, recon_opt = "naive", iter = 10L, tol = 1e-6, grid = list(precision = 0.001, min = -1, max = 1), config = config, method = method)
As described above, dt
can be supplied as a single integer, a vector of repeating integers, or a full vector of integers matching the length of the incidence data.
dt_out
is the length of the sliding windows used to estimate R~t~ from the reconstructed daily incidence data, this is dt = 7L
(weekly sliding windows) by default. We recommend that dt_out
is at least equal to the length of the longest aggregation window (dt
) in the data.
recon_opt
can be one of two options: "naive"
or "match"
. This specifies how to handle the initial incidence data that cannot be reconstructed by the EM algorithm (e.g. the incidence data for the aggregation window that precedes the first aggregation window that R can be estimated for). If "naive"
is chosen, the naive disaggregation of the incidence data will be kept. If "match"
is chosen, the incidence in the preceding aggregation window will be reconstructed by assuming that the growth rate matches that of the first estimation window. This is "naive"
by default.
There are three other optional parameters that can be modified, however, we recommend that the default values are used:
iter
is the number of iterations of the EM algorithm used to reconstruct the daily incidence data. This is iter = 10L
by default.
tol
is the tolerance value used for the convergence check. The tolerance is how much the final iteration of the reconstructed daily incidence is allowed to differ from the reconstructed incidence produced in the previous iteration without returning a warning. This is tol = 1e-6
by default.
grid
is a list of "precision", "min", and "max" values to define a grid of growth rate parameters used inside the EM algorithm. The grid is used to convert reproduction number estimates for each aggregation of incidence data into growth rates, which are then used to reconstruct the daily incidence data assuming exponential growth. The grid will auto-adjust if it is not large enough, so we recommend using the default values.
The SI distibution can be specified as normal using the method
and config
parameters (see the full_EpiEstim_vignette for more details).
This example will take you through a workflow using weekly incidence data for UK COVID-19 cases. (For detailed description of the data see Nash et al.)
incid <- readRDS("./aggregated_data/UK_covid_cases.rds")
Incidence
Let us say we have a vector of weekly incidence data for COVID-19 cases.
incid
We need to specify how the data is aggregated, which in this case, is by constant weekly aggregation windows. We do this by supplying dt
with a single integer (7L).
dt <- 7L
We can take an estimate from the literature to specify a parametric SI with a mean of 6.3 days and a standard deviation of 4.2 days (Bi et al 2020).
mean_si <- 6.3 std_si <- 4.2 method <- "parametric_si" config <- make_config(list(mean_si = mean_si, std_si = std_si))
Estimate R~t~
Now that we have our aggregated incidence, our aggregation time window, and SI distribution, we can supply these to the estimate_R()
function. We do not need to specify dt_out
, iter
, tol
, or grid
because we are going to use the default values.
output <- EpiEstim::estimate_R(incid = incid, dt = dt, recon_opt = "match", method = method, config = config)
The output consists of multiple elements, including the reconstructed daily incidence data:
head(output$I)
And the R~t~ estimates (in this case, using the default weekly sliding windows):
head(output$R)
In this example, you will notice that R~t~ estimation does not start until day 8. This is because the daily incidence data cannot be reconstructed, and R~t~ estimation cannot start, until the first day of the second aggregation window. The start of R~t~ estimation may also be delayed if incidence is too low, but this was not the case here.
Plot results
As normal, simply plot the full or partial output.
plot(output) # full output plot(output, "incid") # Reconstructed daily incidence only plot(output, "R") # Rt estimates only plot(output, "SI") # SI estimates only
Convergence
Convergence is checked automatically to ensure that the final iteration of the reconstructed daily incidence does not differ from the previous iteration beyond a tolerance of 10$^{-6}$ by default. If convergence is not reached, a warning will be returned and the algorithm can be run with more iterations by modifying iter =
. The tolerance threshold can also be modified using tol =
.
In order to reconstruct daily incidence data, the method requires that R~t~ is estimated for each aggregation window in turn, which is translated into a growth rate and used to reconstruct daily incidence assuming exponential growth. As there is no past incidence data beyond the first aggregation window, R~t~ cannot be estimated and the daily incidence cannot be reconstructed until the first day of the second aggregation window.
Additionally, R~t~ estimation will not start until case numbers are sufficiently high.
Please also see the FAQ section in the main "EpiEstim Vignette".
Nash RK, Cori A, Nouvellet P. Estimating the epidemic reproduction number from temporally aggregated incidence data: a statistical modelling approach and software tool. medRxiv pre-print.
Bi Q, et al. Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study. Lancet. 2020.
Wallinga J, Lipsitch M. How generation intervals shape the relationship between growth rates and reproductive numbers. Proceedings of the Royal Society B: Biological Sciences. 2007 Feb 22;274(1609):599–604.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.