R
package hhh4addon
The R
package hhh4addon
extends the functionality of the surveillance
package (Meyer et al 2017), more specifically the implementation of the endemic-epidemic model class in the function hhh4
. It adds the following features:
As hhh4addon
can only be used in combination with surveillance
we assume familiarity with this package and in particular the hhh4
function in the following.
We only give a brief description of the endemic-epidemic framework and the hhh4
function, details can be found in Meyer et al (2017) and the vignettes vignette("hhh4")
and vignette("hhh4_spacetime")
from the surveillance
package. Counts $X_{it}$ from units $i = 1, ..., m$ and time $t$ are modelled as
$$
X_{it} \mid \mathbf{X}{t - 1} \sim \text{NegBin}(\mu{it}, \psi_i); X_{it} \bot X_{jt} \mid \mathbf{X}{t - 1}
$$
$$
\mu{it} = e_{it}\nu_{it} + \lambda_{it}X_{i, t - 1} + \phi_{it}\sum_{j \neq i} \lfloor w_{ji}\rfloor X_{j, t - 1}.
$$
Here, the negative binomial distribution is parametrized by its mean $\mu_{it}$ and an overdispersion parameter $\psi_i$ so that $\text{Var}(X_t \mid \mathbf{X}{t - 1}) = \mu{it}\cdot(1 + \psi_i\mu_{it})$. The term $e_{it}\nu_{it}$ is referred to as the endemic component of incidence, where $e_{it}$ is a population offset. The remaining autoregressive terms form the epidemic component, with $\lambda_{it}X_{i, t - 1}$ often called the autoregressive part and $\phi_{it}\sum_{j \neq i} w_{ji}X_{j, t - 1}$ the neighbourhood part. Various specifications for the normalized weights $\lfloor w_{ij}\rfloor$ exist, see vignette("hhh4_spacetime")
from the surveillance package. The parameters $\nu_{it}, \lambda_{it}$ and $\phi_{it}$ are themselves modelled in a log-linear fashion. While in principle covariates can enter here, it is common to include only an intercept and sine/cosine terms for seasonality, e.g.
$$
\log(\nu_{it}) = \alpha^{(\nu)}i + \beta^{(\nu)}_i \sin(2\pi t/\omega) + \gamma^{(\nu)}_i \cos(2\pi t/\omega),
$$
$$
\log(\lambda{it}) = \alpha^{(\lambda)}i + \beta^{(\lambda)}_i \sin(2\pi t/\omega) + \gamma^{(\lambda)}_i \cos(2\pi t/\omega),
$$
$$
\log(\phi{it}) = \alpha^{(\phi)}i + \beta^{(\phi)}_i \sin(2\pi t/\omega) + \gamma^{(\phi)}_i \cos(2\pi t/\omega).
$$
Models of this type can be fitted using the function hhh4
from surveillance
. Numerous aspects of the model can be specified via the control
list, with the parametrization of $\nu{it}$, $\lambda_{it}$ and $\phi_{it}$ steered by the elements end
, ar
and ne
. Whether unit-specific parameters are necessary and identifiable depends on the data at hand, see the vignettes from surveillance
for more information on model building.
The additional functionality offered by hhh4addon
is the inclusion of higher-order lags, i.e. it provides methods to fit models of the form
$$
X_{it} \mid \mathbf{X}{t - 1}, ..., \mathbf{X}{t - D}, \sim \text{NegBin}(\mu_{it}, \psi_i)
$$
$$
\mu_{it} = \nu_{it} + \lambda_{it}\sum_{d = 1}^D \lfloor u_d\rfloor X_{i, t - d} + \phi_{it}\sum_{j\neq i}\sum_{d = 1}^D \lfloor w_{ji}\rfloor \lfloor u_d\rfloor X_{j, t - d}
$$
where the weights $u_d$ are normalized so that $\sum_{d = 1}^D \lfloor u_d\rfloor = 1$. This means that instead of the previous observation $\mathbf{X}{t - 1}$ a weighted average of the $D$ preceding observations $\mathbf{X}{t - 1},\dots, \mathbf{X}_{t - D}$ enters. Note that the weights $u_d$ are shared between the autoregressive and neighbourhood component and it is not possible to specify them separately.
Currently four parameterizations of the weights $u_d$ are implemented. The default is a geometric, i.e. exponentially decaying lag weighting structure $$ u_d = \alpha(1 - \alpha)^{d - 1}, \ \ \alpha \in (0, 1). $$ A second option is a (shifted) Poisson weighting, $$ u_d = \frac{\alpha^{d - 1}}{(d - 1)!}\exp(-\alpha), \ \ \alpha > 0 $$ which unlike the geometric formulation does not force the first lag weight $u_1$ to be the largest. Thirdly there are linearly decaying lag weights, $$ u_d = \max(1 - \alpha d, 0), \ \ \alpha \in (0, 1). $$ The last pre-implemented lag structure is a simple AR(2) version with $$ u_1 = \alpha,\ \ \ u_2 = 1 - \alpha, \ \ \ \alpha \in (0, 1). $$
When hhh4addon
is loaded, a modified version hhh4lag
of hhh4
is available. The following additional specifications can be made in the control
list:
funct_lag
: a function to compute lag weights from a (scalar) parameter par_lag
. Moreover the function needs to take the smallest and largest lag to receive a positive weight as arguments (min_lag
and max_lag
). The four parameterizations mentioned above can be specified (implemented in the functions geometric lag
, poisson_lag
, linear_lag
and ar2_lag
). Alternatively a user-defined function can be provided. For this case we recommend to consult the source code of e.g. geometric_lag
and adapt it accordingly.par_lag
: the weighting parameter entering into funct_lag
; for geometric_lag
, linear_lag
and ar2_lag
this is $\text{logit}(\alpha)$, for poisson_lag
it is $\log(\alpha)$. These choices enable unconstrained optimization of the (profile) likelihood, when $\alpha$ is estimated from the data (see below).min_lag
: the lowest lag to receive a positive weight (the weights for lags 1 through min_lag - 1
are forced to zero). The default value 1 should be kept in most cases.max_lag
: the highest lag to be included. Note that the subset
specified in control
needs to be compatible with max_lag
(specifically subset[1] > max_lag
needs to hold).The return object of hhh4lag
is an object of class hhh4lag
, which inherits from the regular hhh4
class.
We exemplify this extension with a simple univariate analysis of the salmonella.agona
data from surveillance
(see plot below). All syntax also translates to the multivariate case.
library(surveillance) library(hhh4addon) data("salmonella.agona") # get data # convert old "disProg" to new "sts" data class salmonella <- disProg2sts(salmonella.agona) # plot data: plot(salmonella)
First we fit a univariate hhh4
model with only first lags and seasonality in both the endemic (en
) and epidemic/autoregressive (ar
) component, i.e.
$$
\mu_t = \nu_t + \phi_t X_{t - 1}
$$
$$
\log(\nu_t) = \alpha^{(\nu)} + \beta^{(\nu)} \sin(2\pi t/\omega) + \gamma^{(\nu)} \cos(2\pi t/\omega)
$$
$$
\log(\phi_t) = \alpha^{(\phi)} + \beta^{(\phi)} \sin(2\pi t/\omega) + \gamma^{(\phi)} \cos(2\pi t/\omega)
$$
control_salmonella <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1)), family = "NegBin1", subset = 6:312) # Note: we fit to subset 6:312 to ensure comparability with the higher-order lag # models fitted next. fit_salmonella_ar1 <- hhh4(salmonella, control_salmonella) # use regular hhh4 function AIC(fit_salmonella_ar1)
Next we fit a higher-order lag model with $$ \mu_t = \nu_t + \lambda_{it}\sum_{d = 1}^5 \lfloor u_d\rfloor X_{t - d}, $$ geometric lag weights $u_d$ and a fixed value of $\alpha = 0.8$. We fit the model to data from week 6 onwards.
par_lag_08 <- log(0.8/(1 - 0.8)) # the par_lag value corresponding to alpha = 0.8 control_salmonella_08 <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1)), funct_lag = geometric_lag, # default max_lag = 5, # default par_lag = par_lag_08, # new parameter family = "NegBin1", subset = 6:312) # Note that funct_lag = geometric_lag and max_lag = 5 are the defaults in hhh4lag and # would not need to be specified explicitly. fit_salmonella_geom_08 <- hhh4_lag(salmonella, control_salmonella_08) # now use hhh4lag plot(fit_salmonella_geom_08, names = "") AIC(fit_salmonella_geom_08)
We can see that in terms of AIC this model is better than the previously fitted model with only first lags. To estimate par_lag
(i.e. a suitable transformation of $\alpha$) from the data we can use the wrapper profile_par_lag
which re-fits the model for different values of par_lag
and uses numerical optimization (optim
) to find the optimal one ($\alpha$ is thus estimated via a profile likelihood approach).
control_salmonella_geom <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1)), funct_lag = geometric_lag, max_lag = 5, family = "NegBin1", subset = 6:312) fit_salmonella_geom <- profile_par_lag(salmonella, control_salmonella_geom) AIC(fit_salmonella_geom) summary(fit_salmonella_geom)
The best fit is achieved with $\alpha = 0.56$, i.e. almost half of the contribution of the epidemic contribution comes from lags of order larger than one (the standardized lag weights are stored in fit_salmonella_geom$distr_lag
).
An (older) alternative in order to estimate $\alpha$ is fit_par_lag
which instead of applying optim
fits the model for a vector of values for par_lag
provided by the user (argument range_par
). Under certain circumstances this is computationally faster than the use of optim
.
grid_alpha <- seq(from = 0.01, to = 0.99, by = 0.02) grid_par_lag <- log(grid_alpha/(1 - grid_alpha)) # move to logit scale fit_salmonella_geom_grid <- fit_par_lag(salmonella, control_salmonella_geom, range_par = grid_par_lag)
The function fit_par_lag
returns a list containing the best model ($best_mod
) and the AIC values of the models corresponding to the different values of par_lag
($AICs
). We can thus plot the AIC as a function of $\alpha$.
plot(grid_alpha, fit_salmonella_geom_grid$AICs, type = "l", xlab = expression(alpha), ylab = "AIC")
A remark on the computation of the AIC values: The AIC of a model which was fitted with fit_par_lag
or profile_par_lag
is 2 points higher than that of a model with the same value of par_lag
, but specified manually instead of being estimated. This is due to the loss of one degree of freedom.
par_lag_0.56 <- log(0.56/(1 - 0.56)) # par_lag value corresponding to alpha = 0.56 control_salmonella_geom.056 <- list(end = list(f = addSeason2formula(~ 1)), ar = list(f = addSeason2formula(~ 1), use_distr_lag = TRUE), par_lag = par_lag_0.56, family = "NegBin1", subset = 6:312) fit_salmonella_geom.056 <- hhh4_lag(salmonella, control_salmonella_geom.056) AIC(fit_salmonella_geom.056); AIC(fit_salmonella_geom); AIC(fit_salmonella_geom_grid$best_mod)
For comparison we also fit models with Poisson and AR(2) lags.
# Poisson lags: control_salmonella_pois <- control_salmonella control_salmonella_pois$funct_lag <- poisson_lag fit_salmonella_pois <- profile_par_lag(salmonella, control_salmonella_pois) AIC(fit_salmonella_pois) # AR(2) lags: control_salmonella_ar2 <- control_salmonella control_salmonella_ar2$funct_lag <- ar2_lag fit_salmonella_ar2 <- profile_par_lag(salmonella, control_salmonella_ar2) AIC(fit_salmonella_ar2)
These parameterizations lead to a slightly worse model fit than the geometric lags weights.
Most of the functionality for hhh4
objects is by now also available for hhh4lag
objects (as returned by hhh4lag
, profile_par_lag
and fit_par_lag
). For instance we can simulate from a fitted model. Note that in this case the starting values need to be specified as a matrix with max_lag
rows.
# simulate 1000 trajectories for weeks 261 through 270, using values from # 256 through 260 as starting values: set.seed(17) sim <- simulate(fit_salmonella_geom, subset = 261:270, y.start = fit_salmonella_geom$stsObj@observed[256:260, , drop = FALSE], nsim = 1000, simplify = TRUE) # plot one trajectory: plot(261:270, sim[, , 1], type = "h", xlab = "week", ylab = "No. infected")
Also, we can generate rolling one-week-ahead forecasts (i.e. from models which are iteratively re-fitted each week). This is done using the new function hhh4addon:oneStepAhead_hhh4lag
(the function surveillance:oneStepAhead
cannot be applied and will throw an error). By default the lag weighting parameter par_lag
is not re-fitted (as this can lead to quite long computation times); to re-fit it set the argument refit_par_lag
to TRUE
. The result of hhh4addon:oneStepAhead_hhh4lag
can be handled just like the return from surveillance:oneStepAhead
would. We illustrate this for weeks 261--312 as the validation period.
owa_forecasts_geom <- oneStepAhead_hhh4lag(fit_salmonella_geom, tp = c(260, 311), refit_par_lag = FALSE) # forecasts are done for tp[1] + 1, tp[2] + 1, ... colMeans(scores(owa_forecasts_geom)) # for comparison: model with only first lags: owa_forecasts_ar1 <- oneStepAhead(fit_salmonella_ar1, tp = c(260, 311)) colMeans(scores(owa_forecasts_ar1)) # average scores
The geometric-lag model thus performs slightly better in terms of the logarithmic and Dawid-Sebastiani scores, but worse in terms of the ranked probability score (all scores are negatively oriented).
The second functionality of hhh4addon
concerns predictive and marginal (stationary or periodically stationary) moments. These quantities can be computed without the need for simulation for the endemic-epidemic class, see Held, Meyer and Bracher (2017) and Bracher and Held (2019) for the theoretical background.
To illustrate the use of predictive moments we re-fit the above models to weeks 6--260 of the salmonella.agona
data. The rest is again kept for validation of our predictions.
control_salmonella.sub <- list(end = list(f = addSeason2formula(~ 1), lag = 1), ar = list(f = addSeason2formula(~ 1), lag = 1), family = "NegBin1", subset = 6:260) fit_salmonella_ar1.sub <- hhh4(salmonella, control_salmonella.sub) fit_salmonella_geom.sub <- profile_par_lag(salmonella, control_salmonella.sub)
Predictive moments for weeks 261, 262, ... can now be calculated and plotted. Note that these correspond to a path forecast rather than a weekly updated rolling forecast as before, i.e. the forecasts for weeks 261, 262,... are all conditioned on the observation from week 260.
pred_mom_ar1 <- predictive_moments(fit_salmonella_ar1.sub, t_condition = 260, lgt = 52, return_Sigma = TRUE) # if return_Sigma == TRUE the full predictive covariance function is returned # this may be costly in terms of storage for complex models and long forecasts. pred_mom_geom <- predictive_moments(fit_salmonella_geom.sub, t_condition = 260, lgt = 52, return_Sigma = TRUE) plot(fit_salmonella_ar1.sub, names = "") fanplot_prediction(pred_mom_geom, add = TRUE)
The fanplots shown here are based on negative binomial approximations of the predictive distributions via the first two moments. We can also use these predictive moments to evaluate the Dawid-Sebastiani score of the path forecasts (a proper scoring rule for predictive model assessment, Held et al 2017).
# Note that ds_score requires that the full predictive covariance matrix is # available, i.e. compute_Sigma = TRUE is used in predictive_moments. ds_score_hhh4(pred_mom_ar1) ds_score_hhh4(pred_mom_geom)
The David-Sebastiani score is negatively oriented here, i.e. the prediction from the model fit_salmonella_geom.sub
with geometric lags is slightly better than the one from the simpler fit_salmonella.sub
.
The predictive_moments
function can also be used to check that the simulation functions are doing the right thing. Here we can re-use the previously simulated data:
cond_mom_geom_260 <- predictive_moments(fit_salmonella_geom, t_condition = 260, lgt = 10) plot(261:270, apply(sim, 1:2, mean), xlab = "week", ylab = "predictive mean") lines(261:270, cond_mom_geom_260$mu_matrix) legend("topright", pch = c(1, NA), lty = c(NA, 1), legend = c("simulation-based", "analytical"), bty = "n") plot(261:270, apply(sim, 1:2, sd), xlab = "week", ylab = "predictive standard deviation") lines(261:270, sqrt(cond_mom_geom_260$var_matrix))
The agreement between the analytical results and the simulation-based estimates of the predictive moments is relatively poor with 1,000 simulated paths, but gets very good with 10,000 (not applied here as costly in terms of storage and computation time).
The function stationary_moments
can be applied in the same way (without specifying a t_condition
and lgt
) to obtain the marginal moments of a fitted model. Note that this is only possible for models with either time-constant parameters (stationary moments) or periodically varying parameters (periodically stationary moments). For horizons exceeding a few weeks the predictive and marginal moments are indistinguishable (as the forecast goes to the periodically stationary behaviour of the model).
marg_mom_geom <- stationary_moments(fit_salmonella_geom) fanplot_stationary(marg_mom_geom, timepoints = 1:52, add_legend = TRUE)
Bracher, J. and L. Held (2019). Endemic-epidemic models with discrete-time serial interval distributions for infectious disease prediction. Preprint: https://arxiv.org/abs/1901.03090.
Bracher, J. and Held, L. (2017) Periodically stationary multivariate autoregressive models. Preprint: https://arxiv.org/abs/1707.04635
Held, L., S. Meyer, and J. Bracher (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Statistics in Medicine 36 (22), 3443–3460.
Meyer, S., L. Held, and M. Höhle (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. Journal of Statistical Software 77 (11), 1–55.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.