# Model Description {#sec:modeldescription}
We present the modeling framework implemented by the package.
Section \@ref(sec:basic-model) outlines the bare-bones version of the
model, which is elaborated on in Sections \@ref(sec:observations),
\@ref(sec:infections) and \@ref(sec:transmission). Section
\@ref(sec:extensions) extends the model and introduces multilevel
modeling, treating infections as parameters, and accounting for
population effects.
## Basic Model {#sec:basic-model}
We now formulate the basic version of the model for one homogeneous
population. The same model can be used for multiple regions or groups
jointly. Suppose we observe a non-negative time series of count data
$Y = (Y_1, \ldots Y_n)$ for a single population. This could for example
be daily death or case incidence. $Y_t$ is modeled as deriving from past
new infections $i_s$, $s < t$, and some parameter $\alpha_t > 0$, a
multiplier, which in most contexts represents an instantaneous
*ascertainment rate*. The general model can be expressed as
\begin{align}
Y_t & \sim p(y_t , \phi), (\#eq:ysampling) \\\\
y_t & = \alpha_t \sum_{s < t} i_s \pi_{t-s} (\#eq:obsmean),
\end{align}
where $y_t$ is the expected value of
the data distribution and $\phi$ is an auxiliary parameter. $\pi_{k}$ is
typically the time distribution from an infection to an observation,
which we refer to as the *infection to observation* distribution. More
generally, however, $\pi_k$ can be used to obtain any linear combination
of past infections. New infections $i_t$ at times $t>0$ are modeled
through a renewal equation, and are tempered by a non-negative parameter
$R_t$ which represents the reproduction number at time $t$. Formally
\begin{align}
i_t &= R_t \sum_{s < t} i_s g_{t-s}, (\#eq:renewal)
\end{align}
where $g_k$ is a probability mass
function for the time between infections. The recursion is initialized
with *seeded* infections $i_{v:0}$, $v < 0$, which are treated as
unknown parameters. All parameters are assigned priors, i.e.
\begin{equation}
i_{v:0}, R, \phi, \alpha \sim p(\cdot),
\end{equation}
where $R = (R_1, \ldots, R_n)$ and $\alpha =
(\alpha_1, \ldots, \alpha_n)$. The posterior distribution is then
proportional to prior and likelihood, i.e.
\begin{equation}
p(i_{v:0}, R, \phi, \alpha \mid Y) \propto p(i_{v:0})p(R)p(\phi)p(\alpha) \prod_{t} p(Y_t \mid y_t, \phi).
\end{equation}
This posterior distribution is
represented in a **Stan** program, and an adaptive Hamiltonian
Monte Carlo sampler [@hoffman_2014] is used to approximately draw
samples from it. These samples allow for inference on the parameters, in
addition to simulating data from the posterior predictive distribution.
Reproduction numbers $R$ and multipliers $\alpha$ can be modeled
flexibly with Bayesian regression models, and by sharing parameters, are
the means by which multiple regions or groups are tied together through
multilevel models. One can, for example, model $R$ as depending on a
binary covariate for a control measure, say full lockdown. The
coefficient for this can be *partially pooled* between multiple
populations. The effect is to share information between groups, while
still permitting between group variation.
## Observations {#sec:observations}
As mentioned, $Y_t$ is usually a count of some event type occurring at
time $t$. These events are precipitated by past infections.
Prototypical examples include daily cases or deaths. $\alpha_t$ is a
multiplier, and when modeling count data, it typically is interpreted as
an *ascertainment rate*, i.e. the proportion of events at time $t$
that are recorded in the data. For case or death data this would be the
infection ascertainment rate (IAR) or the infection fatality rate (IFR)
respectively.
The multiplier $\alpha$ plays a similar role for observations as
$R$ does for infections; tempering expected observations for
time-specific considerations. As such, **epidemia** treats
$\alpha$ in a similar manner to reproductions number, and allows the
user to specify a regression model for it. Section
\@ref(sec:transmission) discusses this in detail in the context of
reproduction numbers, and this discussion is not repeated here. The model schematic](model-schematic.html) details the
model for $\alpha$, as well as for observational models in general.
The sampling distribution $p(y_t, \phi)$ (Equation
\@ref(eq:ysampling)) should generally be informed by parts of the data
generating mechanism not captured by the mean $y_t$: i.e. any
mechanisms which may induce additional variation around $y_t$. Options
for $p(y_t, \phi)$ include the Poisson, quasi-Poisson and
negative-binomial families. The Poisson family has no auxiliary
parameter $\phi$, while for the latter two families this represents a
non-negative *dispersion parameter* which is assigned a prior.
**epidemia** allows simultaneous modeling of multiple observation
vectors. In this case, we simply superscript $Y_t^{(l)}$,
$\alpha_{t}^{(l)}$ and $\pi^{(l)}$, and assign independent
sampling distributions for each type. Separate regression models are
then specified for each multiplier $\alpha_{t}^{(l)}$. Leveraging
multiple observation types can often enhance a model. For example, high
quality death data existed during the first wave of the Covid-19
pandemic in Europe. Case data gradually increased in reliability over
time, and has the advantage of picking up changes in transmission
dynamics much quicker than death data.
## Infections {#sec:infections}
Infections $i_t$ propagate over time through the discrete renewal
equation \@ref(eq:renewal). This is *self-exciting*: past infections
give rise to new infections. The theoretical motivation for this lies in
counting processes and is explained in more detail in @Bhatt2020. The
equation is connected to Hawkes processes and the Bellman Harris
branching process [@bellman_1948; @bellman_1952; @mishra2020]. Such
processes have been used in numerous previous studies [@fraser_2007;
@Cori2013; @nouvellet_2018; @cauchemez_2008], and are also connected
to compartmental models such as the SEIR model [@champredon_2018].
Equation \@ref(eq:renewal) implies that infections $i_t$, $t > 0$
are deterministic given $R$ and seeded infections $i_{v:0}$.
**epidemia** sets a prior on $i_{v:0}$ by first assuming that
daily seeds are constant over the seeding period. Formally, $i_{k} =
i$ for each $k \in \{v,\ldots 0\}$. The parameter $i$ can be assigned
a range of prior distributions. One option is to model it hierarchically;
for example as
\begin{align}
i & \sim \text{Exp}(\tau^{-1}), (\#eq:seeds)\\\\
\tau & \sim \text{Exp}(\lambda_0) (\#eq:tau),
\end{align}
where $\lambda_0 > 0$ is a rate hyperparameter. This
prior is uninformative, and allows seeds to be largely determined by
initial transmission rates and the chosen start date of the epidemic.
Several extensions to the infection model are possible in
**epidemia**, including extending \@ref(eq:renewal) to better
capture dynamics such as super-spreading events, and also adjusting the
process for the size of the remaining susceptible population. These
extensions are discussed in Section \@ref(sec:latent) and
\@ref(sec:popadjust) respectively. The basic infection model is shown in
the model [schematic](model-schematic.html).
## Transmission {#sec:transmission}
Reproduction numbers are modeled flexibly. One can form a linear
predictor consisting of fixed effects, random effects and
autocorrelation terms, which is then transformed via a suitable link
function. Formally
\begin{equation}
R = g^{-1}(\eta),
\end{equation}
where $g$ is a link function and $\eta$ is a
linear predictor. In full generality, $\eta$ can be expressed as
\begin{equation}
\eta = \beta_0 + X \beta + Z b + Q \gamma, (\#eq:linpred)
\end{equation} where $X$ is an $n \times p$ model matrix, $Z$ is an $n \times q$ model matrix for the $q$-vector of group-specific parameters $b$. $Q$ is an $n \times r$ model matrix for the $r$-vector of autocorrelation terms. The
columns of $X$ are predictors explaining changes in transmission.
These could, for example, be binary vectors encoding non-pharmaceutical
interventions, as in @Flaxman2020. A number of families can be used for
the prior on $\beta$, including normal, cauchy, and hierarchical
shrinkage families. The parameters $b$ are modeled hierarchically as
\begin{equation}
b \sim N(0, \Sigma),
\end{equation}
where $\Sigma$ is a covariance matrix that is itself assigned a prior. The
particular form for $\Sigma$, as well as its prior is discussed in
more detail in this [article](priors.html). These partially-pooled
parameters are particularly useful when multiple regions are being
modeled simultaneously. In this case, they allow information on
transmission rates to be shared between groups.
$Q$ is a binary matrix specifying which of the autocorrelation terms
in $\gamma$ to include for each period $t$. Currently,
**epidemia** supports only random walk processes. However multiple
such processes can be included, and can have increments that occur at a
different time scale to $R$; for example weekly increments can be
used.
### Link Functions {#sec:links}
Choosing an appropriate link function $g$ is difficult. $R_t$ is
non-negative, but is clearly not able to grow exponentially: regardless
of the value of the linear predictor $\eta_t$, one expects $R_t$ to
be bounded by some maximum value $K$. In other words, $R_t$ has some
*carrying capacity*. One of the simplest options for $g$ is the
log-link. This satisfies non-negativity, and also allows for easily
interpretable effect sizes; a one unit change in a predictor scales
$R_t$ by a constant factor. Nonetheless, it does not respect the carry
capacity $K$, often placing too much prior mass on large values of
$R_t$. With this in mind, **epidemia** offers an alternative link
function satisfying
\begin{equation}
g^{-1}(x) = \frac{K}{1 + e^{-x}} (\#eq:scaledlogit).
\end{equation} This is a
generalization of the logit-link, and we refer to it as the
*scaled-logit*.
## Extensions {#sec:extensions}
Various extensions to the basic model just presented are possible,
including multilevel modeling, adding variation to the infection
process, and explicitly accounting for population effects. These are
discussed in turn.
### Joint Modeling of Multiple Populations {#sec:mult}
Consider modeling the evolution of an epidemic across multiple regions
or populations. Of course, separate models can be specified for each
group. This approach is fast as each model can be fit in parallel.
Nonetheless, often there is little high quality data for some groups,
particularly in the early stages of an epidemic. A joint model can
benefit from improved parameter estimation by *sharing signal across
groups*. This can be done by partially or fully pooling effects
underlying reproduction numbers $R$.
We give an example for concreteness. Suppose the task is to infer the
effect of a series of $p$ control measures on transmission rates.
Letting $R^{(m)}$ be the vector of reproduction numbers for the
$m$\textsuperscript{th} group, one could write
\begin{equation}
R^{(m)} = g^{-1}\left( \beta_0 + b_0^{(m)} + X^{(m)}
(\beta + b^{(m)}) \right),
\end{equation}
where $X^{(m)}$
is an $n \times p$ matrix whose rows are binary vectors indicating
which of the $p$ measures have been implemented in the
$m$\textsuperscript{th} group at that point in time. The parameters
$b_{0}^{(m)}$ allow each region to have its own initial reproduction
number $R_0$, while $b^{(m)}$ allow for region-specific policy
effects. These parameters can be partially pooled by letting
\begin{equation}
(b_0^{(m)}, b^{(m)}) \sim N(0,\tilde{\Sigma}),
\end{equation}
for each $m$, and assigning a
hyperprior to the covariance matrix $\tilde{\Sigma}$.
In addition to hierarchical modeling of parameters making up $R$,
seeded infections can also be modeled hierarchically. Equations
\@ref(eq:seeds) and \@ref(eq:tau) are replaced with
\begin{align}
i^{(m)} &\sim \text{Exp}(\tau^{-1}), \\\\
\tau & \sim \text{Exp}(\lambda_0),
\end{align}
where
$i^{(m)}$ is the daily seeded infections for the
$m$\textsuperscript{th} group.
### Infections as Parameters {#sec:latent}
Recall the renewal equation (Equation \@ref(eq:renewal)) which
describes how infections propagate in the basic model. Infections
$i_t$ for $t > 0$ are a deterministic function of seeds
$i_{v:0}$ and reproduction numbers $R$. If infections counts are
large, then this process may be realistic enough. However, when
infection counts are low, there could variation in day-to-day infections
caused by a heavy tailed offspring distribution and super-spreader
events. This may cause actual infections to deviate from those implied
by the renewal equation. Although the *expected* number of offspring
of any given infection is driven by $R$, in practice the actual number
of offspring can exhibit considerable variation around this. To capture
this randomness, replace Equation \@ref(eq:renewal) with
\begin{align}
i_t &\sim p(i_t', d), (\#eq:infextended) \\\\
i_{t}' &= R_t \sum_{s < t} i_s g_{t-s}.
\end{align}
This treats $i_t$ as latent
parameters which must be sampled. Instead, the \textit{mean value} is
described by the renewal equation. $p(i_t', d)$ is parameterised by
the mean and the coefficient of dispersion $d$, which is assigned a
prior. This extension can be motivated formally through counting
processes. Please see @Bhatt2020 for more details.
### Depletion of the Susceptible Population {#sec:popadjust}
Nothing in Equation \@ref(eq:renewal) prevents cumulative infections
from exceeding the total population size $P$. In particular if $R_t > 1$ then infections can grow exponentially over time. This does not
always present a problem for modeling. Indeed the posterior distribution
usually constrains past infections to reasonable values. Nonetheless,
forecasting in the basic model will be unrealistic if projected
infections grow too large. As the susceptible population diminishes, the
transmission rate is expected to fall.
**epidemia** can apply a simple transformation to ensure that
cumulative infections remain bounded by $P$, and that transmission
rates are adjusted for changes in the susceptible population. Let
$S_t \in [0,P]$ be the number of susceptible individuals in the
population at time $t$. Just like infections, this is treated as a
continuous quantity. $S_t$ consists of those who have not been
infected by time $t$, and have not been removed from the susceptible
class by other means; i.e. vaccination.
Let $i'_t$ denote \textit{unadjusted infections} from the model.
This is given by \@ref(eq:renewal) in the basic model or by
\@ref(eq:infextended) if the extension of Section \@ref(sec:latent) is
applied. These are interpreted as the number of infections if the entire population were susceptible. These are adjusted with
\begin{equation}
i_t = S_{t-1}\left( 1 - \exp\left(-\frac{i'_t}{P}\right)\right).
(\#eq:pop-adjust)
\end{equation}
The motivation for this is provided in @Bhatt2020. Equation \@ref(eq:pop-adjust) satisfies intuitive properties: if $i'_t = 0$ then $i_t = 0$, and as $i'_t
\to \infty$ we have that $i_t \to S_{t-1}$. All infections at
time $t$ are then removed from the susceptible population, so that
\begin{equation}
S_t = S_{t-1} - i_t
(\#eq:st-rec)
\end{equation}
We are left to define $S_{v-1}$, the susceptible population the day
before modeling begins. If this is the start of an epidemic,
it is natural to take $S_{v-1} = P$. Nonetheless, it is often of
interest to begin modeling later, when a degree of immunity already
exists exists within the population. In this case, **epidemia** allows
the user to assign a prior distribution to $S_{v-1} / P$. This must lie
between $0$ and $1$.
#### Accounting for Vaccinations
Previous infection is one avenue through which individuals are removed
from the susceptible population. Immunity can also be incurred through
vaccination. **epidemia** provides a basic way to incorporate such
effects.
Let $v_t$ be the proportion of the susceptible population at time $t$
who are removed through some means other than infection. These are
individuals who have never been infected but may have been previously
vaccinated, and their immunity is assumed to have developed at
time $t$.
**epidemia** requires $v_t$ to be supplied by the user. Then
\@ref(eq:st-rec) is replaced with
\begin{equation}
S_t = \left(S_{t-1} - i_t\right) \left(1 - v_t \right).
\end{equation}
Of course, $v_t$ is a difficult quantity to estimate. It requires the user
to estimate the time-lag for a jab to become effective, and to also
adjust for potentially different efficacies of jabs and doses.
Recognizing this, we allow the update
\begin{equation}
S_t = \left(S_{t-1} - i_t\right) \left(1 - v_t \xi \right),
\end{equation}
where $\xi$ is a noise term that is assigned a prior
distribution. $\xi$ helps to account for potentially systematic biases
in calculating vaccine efficacy.
# References