knitr::opts_chunk$set(echo = TRUE)
library(kableExtra)
# Model Implementation {#sec:modelimplementation}
Here we give a high-level overview of the workflow required for defining and fitting a model with **epidemia**. The primary model fitting function is `epim()`. This takes a model description and additional arguments relating to the fitting algorithm, and proceeds to fit the model using a precompiled **Stan** program. This is similar to the workflow for fitting Bayesian regression models with **rstanarm**. A key difference, however, is that the models fit by **epidemia** are generally complex, and are therefore inherently more difficult to specify. We simplify this process by taking a modular approach; models are defined through three distinct parts: transmission, infections and observations. These components of the model are defined with the functions `epirt()`, `epiinf()` and `epiobs()` respectively. The package contains an example data set `EuropeCovid` which contains data on daily death counts from Covid-19 in 11 European Countries from February through May 2020, and a set of binary indicators of non-pharmaceutical interventions. This is used as an example throughout. wzxhzdk:1 We begin by describing `epim()` in more detail, and then proceed to discuss the three modeling functions. ## Model Fitting {#sec:fitting}
`epim()` is the only model fitting function in **epidemia**. It has arguments `rt`, `inf`, and `obs` which expect a description of the transmission model, infection model and all observational models respectively. Together, these fully define the joint distribution of data and parameters. Each of these model components are described in terms of variables that are expected to live in a single data frame, `data`. This data frame must be compatible with the model components, in the sense that *it holds all variables defined in these models*. For our example, these variables are the following. wzxhzdk:2 The `data` argument is described in more detail in Section \@ref(sec:data). In addition to taking a model description and a data frame, `epim()` has various additional arguments which specify how the model should be fit. If `algorithm = "sampling"` then the model will be fit using **Stan**'s adaptive Hamiltonian Monte Carlo sampler [@hoffman_2014]. This is done internally by calling `sampling()` from **rstan**. If instead this is `"meanfield"` or `"fullrank"`, then **Stan**'s Variational Bayes algorithms [@Kucukelbir_2015; @Kucukelbir_2017] are employed by calling `vb()` from **rstan**. Any unnamed arguments in the call to `epim()` are passed directly onto the **rstan** sampling function. `epim()` returns a fitted model object of class `epimodel`, which contains posterior samples from the model along with other useful objects. In general, Hamiltonian Monte Carlo should be used for final inference. Nonetheless, this is often computationally demanding, and Variational Bayes can often be used fruitful for quickly iterating models. All arguments for `epim()` are described in Table \@ref(tab:model-imp-epim-args).
wzxhzdk:3
## Transmission {#sec:imp_transmission}
`epirt()` defines the model for time-varying reproduction numbers, which was described in Section 1.4 of the model description [article](model-description.html). Recall that these are modeled as a transformed linear predictor. `epirt()` has a `formula` argument which defines the linear predictor $\eta$, an argument `link` defining the link function `g`, and additional arguments to specify priors on parameters making up $\eta$. A general **R** formula gives a symbolic description of a model. It takes the form `y ~ model`, where `y` is the response and `model` is a collection of terms separated by the `+` operator. `model` fully defines a linear predictor used to predict `y`. In this case, the "response" being modeled are reproduction numbers which are unobserved. `epirt()` therefore requires that the left hand side of the formula takes the form `R(group, date)`, where `group` and `date` refer to variables representing the modeled populations and dates respectively. The right hand side can consist of fixed effects, random effects, and autocorrelation terms. For our example, a viable call to `epirt()` is the following. wzxhzdk:4 Here, two fixed effects are included which represent the effects of implementing lockdown and banning public events. These effects are assumed constant across countries. They could alternatively be partially pooled by using the term `(lockdown + public_events | country)`. For information on how to interpret such terms, please see [partial pooling](partial-pooling.html). Using `link = scaled_logit(7)` lets the link function be the scaled logit link described by Equation (1.7) in the model description article, where $K = 7$ is the maximum possible value for reproduction numbers. For simplicity, we have omitted any prior arguments, however these should generally be specified explicitly. Please see [prior](priors.html) for detailed information on how to use priors. All arguments for `epirt()` are listed in Table \@ref(tab:model-imp-epim-args).
wzxhzdk:5
## Infections {#sec:imp_infections}
The infection model is represented by `epiinf()`. In the most basic version, this defines the distribution of the generation time of the disease, the number of days for which to seed infections, and the prior distribution on seeded infections. These three parameters are controlled by the arguments `gen`, `seed_days` and `prior_seeds` respectively. A possible model is the following. wzxhzdk:6 `EuropeCovid$si` is a numeric vector representing the distribution for the serial interval of Covid-19. There is an implicit assumption that the generation time can be approximated well by the serial interval. Seeds are modeled hierarchically, and are described by (1.4) and (1.5) in the model description. $\tau$ has been assigned an exponential prior with a mean of 50. Seeded infections are assumed to occur over a period of 6 days. `epiinf()` has additional arguments that allow the user to extend the basic model. Using `latent = TRUE` replaces the renewal equation with Equation (1.8). Daily infections are then treated as latent parameters that are sampled along with other parameters. The `family` argument specifies the distribution $p(i'_t, d)$, while `prior_aux` defines the prior on the coefficient of dispersion $d$. Recall from the model description that the infection process may be modified to explicitly account for changes in infection rates as the remaining susceptible population is depleted. In particular, the adjustment ensures that cumulative infections never breaches the population size. It can be employed by setting `pop_adjust = TRUE` and using the `pop` argument to point towards a static variable in the data frame giving the population size. All argument to `epiinf()` are described in Table \@ref(tab:model-imp-epiinf-args).
wzxhzdk:7
## Observations {#sec:imp_observations}
An observational model is defined by a call to `epiobs()`. In particular, this must also make explicit the model for the multipliers $\alpha_t$, and must also specify the coefficients $\pi_k$. `epiobs()` has a `formula` argument. The left hand side must indicate the observation vector to be modeled, while the right hand side defines a linear predictor for $\alpha_t$. The argument `i2o` plays a similar role to the `gen` argument in `epiinf()`, however it instead corresponds the vector $\pi$ in Equation (1.2). Take for example the task of modeling daily `deaths`, which as we saw is a variable in `data`. A possible model is the following. wzxhzdk:8 Here $\alpha_t$ corresponds to the infection fatality rate (IFR), and is modeled as an intercept transformed by the scaled-logit link. This implies that the IFR is constant over time and its value lies somewhere between 0\% and 2\%. If the prior on the intercept (specified by the `prior_intercept` argument) is chosen to be symmetric around zero, then the prior mean for the IFR is 1\%. `EuropeCovid$inf2death` is a numeric simplex vector that gives the same delay distribution as used in @Flaxman2020. This is a density function for a discretized mixture of Gamma random variables. Additional arguments include `family`, which specifies the sampling distribution $p(y_t, \phi)$. There are also arguments allowing the user to control prior distributions for effects in the linear predictor for $\alpha_t$, and the prior on the auxiliary variable $\phi$. All arguments to `epiobs()` are shown in Table \@ref(tab:model-imp-epiobs-args).
wzxhzdk:9
## Data {#sec:data}
Before fitting our first model in Section \@ref(sec:first-fit), we elaborate on the `data` argument to `epim()`. Recall that this must contain all variables used in the transmission and infection models, and in all observational models. For our example, `data` looks like wzxhzdk:10 The columns `country` and `date` define the region and time period corresponding to each of the remaining variables. `epim()` assumes that the first seeding day (i.e. the start of the epidemic) in each region is the first date found in the data frame. The last data found for each region is the final data at which the epidemic is simulated. It is up to the user to appropriately choose these dates. For our example, the first and last dates for each group can be seen as follows. wzxhzdk:11 Here, the start dates have been heuristically chosen to be 30 days prior to observing 10 cumulative deaths in each country. ## A First Fit {#sec:first-fit}
We are now ready to fit our first model. For this we return to the model fitting function `epim()`. The following command is used to instruct **epidemia** to run Markov chains in parallel, rather than sequentially, if multiple cores are detected. wzxhzdk:12 Our call to `epim()` is as follows. We use `refresh = 0` to suppress printing output in this article, however, this should not generally be used as such output is useful. wzxhzdk:13 The print method for `epimodel` objects prints summary statistics for model parameters. These are obtained from the sampled posterior distribution. Parameter are displayed according to which part of the model they belong to (transmission, observations, infections). An estimate of the standard deviation, labeled `MAD_SD` is displayed. This is the median absolute deviation from the median, and is more robust than naive estimates of the standard deviation for long-tailed distributions. wzxhzdk:14 Alternatively, the summary method can be used. This gives quantiles of the posterior draws, and also displays some MCMC diagnostics. wzxhzdk:15 # References


ImperialCollegeLondon/epidemia documentation built on June 26, 2021, 7:40 a.m.