# Model Implementation {#sec:modelimplementation}
Here we give a high-level overview of the workflow required for defining and fitting a model with **epidemia**. The primary model fitting function is `epim()`. This takes a model description and additional arguments relating to the fitting algorithm, and proceeds to fit the model using a precompiled **Stan** program. This is similar to the workflow for fitting Bayesian regression models
with **rstanarm**. A key difference, however, is that the models fit by **epidemia** are generally complex, and are therefore inherently more difficult to specify. We simplify this process by taking a modular approach; models are defined through three distinct parts: transmission, infections and observations. These components of the model are defined with the functions `epirt()`, `epiinf()` and
`epiobs()` respectively.
The package contains an example data set `EuropeCovid` which contains data
on daily death counts from Covid-19 in 11 European Countries from February through May 2020, and a set
of binary indicators of non-pharmaceutical interventions. This is
used as an example throughout.
wzxhzdk:1
We begin by describing `epim()` in more detail, and then proceed to discuss the three modeling functions.
## Model Fitting {#sec:fitting}
`epim()` is the only model fitting function in **epidemia**. It has arguments `rt`, `inf`, and `obs` which
expect a description of the transmission model, infection model and all observational models respectively.
Together, these fully define the joint distribution of data and parameters. Each of these model components are
described in terms of variables that are expected to live in a single data frame, `data`. This data frame
must be compatible with the model components, in the sense that *it holds all variables defined in these
models*. For our example, these variables are the following.
wzxhzdk:2
The `data` argument is described in more detail in Section \@ref(sec:data).
In addition to taking a model description and a data frame,
`epim()` has various additional arguments which specify how the model should be
fit. If `algorithm = "sampling"` then the model
will be fit using **Stan**'s adaptive Hamiltonian Monte Carlo sampler [@hoffman_2014]. This
is done internally by calling `sampling()` from **rstan**. If instead this is `"meanfield"` or `"fullrank"`, then **Stan**'s Variational Bayes
algorithms [@Kucukelbir_2015; @Kucukelbir_2017] are employed by calling `vb()` from **rstan**. Any unnamed
arguments in the call to `epim()` are passed directly onto the **rstan**
sampling function. `epim()` returns a fitted model object of class
`epimodel`, which contains posterior samples from the model along with other
useful objects.
In general, Hamiltonian Monte Carlo should be used for final inference. Nonetheless, this is often computationally demanding, and Variational Bayes can often be
used fruitful for quickly iterating models. All arguments for `epim()` are described in Table
\@ref(tab:model-imp-epim-args).
wzxhzdk:3
## Transmission {#sec:imp_transmission}
`epirt()` defines the model for time-varying reproduction numbers, which
was described in Section 1.4 of the model description [article](model-description.html). Recall that these are modeled as a transformed linear predictor. `epirt()` has a `formula` argument which defines the linear predictor $\eta$, an argument `link` defining the link function `g`, and additional arguments to specify priors on parameters making up $\eta$.
A general **R** formula gives a symbolic description of a model. It takes the form
`y ~ model`, where `y` is the response and `model` is a collection of terms
separated by the `+` operator. `model` fully defines a linear predictor used to
predict `y`. In this case, the "response" being modeled are reproduction numbers which are
unobserved. `epirt()` therefore requires that
the left hand side of the formula takes the form `R(group, date)`, where `group`
and `date` refer to variables representing the modeled populations and dates respectively.
The right hand side can consist of fixed effects, random effects, and
autocorrelation terms. For our example, a viable call to `epirt()` is the
following.
wzxhzdk:4
Here, two fixed effects are included which represent the
effects of implementing lockdown and banning public events. These effects are assumed constant across
countries. They could alternatively be partially
pooled by using the term `(lockdown + public_events | country)`.
For information on how to interpret such terms, please see [partial pooling](partial-pooling.html). Using `link = scaled_logit(7)` lets the link function be the
scaled logit link described by Equation (1.7) in the model description article,
where $K = 7$ is the maximum possible value for reproduction numbers.
For simplicity, we have omitted any prior arguments, however these
should generally be specified explicitly. Please see
[prior](priors.html) for detailed information on how to use priors. All arguments
for `epirt()` are listed in Table \@ref(tab:model-imp-epim-args).
wzxhzdk:5
## Infections {#sec:imp_infections}
The infection model is represented by `epiinf()`. In the most basic version, this defines the distribution of the generation time of the disease, the number of days for which to seed infections, and the prior distribution on seeded infections. These three parameters are
controlled by the arguments `gen`, `seed_days` and `prior_seeds` respectively.
A possible model is the following.
wzxhzdk:6
`EuropeCovid$si` is a numeric vector representing the distribution for the serial interval of Covid-19. There is an implicit assumption that the generation time can be approximated well by the serial interval. Seeds are modeled hierarchically, and are described by (1.4) and (1.5) in the model description. $\tau$ has been assigned an exponential prior with a mean of 50. Seeded infections are assumed to occur over a period of 6 days.
`epiinf()` has additional arguments that allow the user to extend the basic
model. Using `latent = TRUE` replaces the renewal equation with
Equation (1.8). Daily infections are then treated as latent parameters that are
sampled along with other parameters. The `family` argument specifies
the distribution $p(i'_t, d)$, while `prior_aux` defines the prior on the
coefficient of dispersion $d$.
Recall from the model description that the infection process may be
modified to explicitly account for changes in infection rates as the remaining
susceptible population is depleted. In particular, the adjustment ensures that cumulative infections
never breaches the population size. It can be employed by
setting `pop_adjust = TRUE` and using the
`pop` argument to point towards a static variable in the data frame giving the population size. All argument to `epiinf()` are
described in Table \@ref(tab:model-imp-epiinf-args).
wzxhzdk:7
## Observations {#sec:imp_observations}
An observational model is defined by a call to `epiobs()`. In particular,
this must also make explicit the model for the multipliers $\alpha_t$, and must
also specify the coefficients $\pi_k$. `epiobs()` has a `formula` argument. The left hand
side must indicate the observation vector to be modeled, while the
right hand side defines a linear predictor for $\alpha_t$. The argument `i2o` plays a similar role to the `gen` argument in `epiinf()`,
however it instead corresponds the vector $\pi$ in Equation (1.2).
Take for example the task of modeling daily `deaths`, which as we saw is
a variable in `data`. A possible model is the following.
wzxhzdk:8
Here $\alpha_t$ corresponds to the infection fatality rate (IFR), and is modeled as an
intercept transformed by the scaled-logit link. This implies that the IFR is constant over time and its value lies somewhere between 0\% and 2\%. If the prior on
the intercept (specified by the `prior_intercept` argument) is chosen to be
symmetric around zero, then the prior mean for the IFR is 1\%. `EuropeCovid$inf2death` is a numeric simplex vector that gives the same delay distribution as used in
@Flaxman2020. This is a density function for a discretized mixture of Gamma
random variables.
Additional arguments include `family`, which specifies the sampling distribution
$p(y_t, \phi)$. There are also arguments allowing the user to control
prior distributions for effects in the linear predictor for $\alpha_t$, and the prior on the
auxiliary variable $\phi$. All arguments to `epiobs()` are shown in Table
\@ref(tab:model-imp-epiobs-args).
wzxhzdk:9
## Data {#sec:data}
Before fitting our first model in Section \@ref(sec:first-fit), we elaborate on
the `data` argument to `epim()`. Recall that this must contain all variables used in the transmission and infection models, and in all observational models. For our example,
`data` looks like
wzxhzdk:10
The columns `country` and `date` define the region and time period corresponding to each of the remaining
variables. `epim()` assumes that the first seeding day (i.e. the start of the epidemic) in each region is the first date found in the data frame. The last data found for each region is the final data at which the epidemic is simulated. It is up to the user to appropriately choose these dates. For our example,
the first and last dates for each group can be seen as follows.
wzxhzdk:11
Here, the start dates have been heuristically chosen to be 30 days prior to
observing 10 cumulative deaths in each country.
## A First Fit {#sec:first-fit}
We are now ready to fit our first model. For this we return to the model
fitting function `epim()`. The following command is used to instruct
**epidemia** to run Markov chains in parallel, rather than sequentially, if
multiple cores are detected.
wzxhzdk:12
Our call to `epim()` is as follows. We use `refresh = 0`
to suppress printing output in this article, however, this should
not generally be used as such output is useful.
wzxhzdk:13
The print method for `epimodel` objects prints summary statistics for
model parameters. These are obtained from the sampled posterior distribution.
Parameter are displayed
according to which part of the model they belong to (transmission, observations,
infections). An estimate of the standard deviation, labeled `MAD_SD` is
displayed. This is the median absolute deviation from the median, and is more
robust than naive estimates of the standard deviation for long-tailed
distributions.
wzxhzdk:14
Alternatively, the summary method can be used. This gives quantiles of the
posterior draws, and also displays some MCMC diagnostics.
wzxhzdk:15
# References