In HIDDA/forecasting: Forecasting Based on Surveillance Data

knitr::opts_chunk$set(message = FALSE, warning = FALSE, error = FALSE,
                      fig.align = "center", dev.args = list(pointsize = 10))

options(digits = 4)  # for more compact numerical outputs
library("HIDDA.forecasting")
library("ggplot2")

In this vignette, we use modelling and forecasting methods provided by:

library("surveillance")

The corresponding software reference is:

cat("<blockquote>")
print(citation(package = "surveillance", auto = TRUE), style = "html")
cat("</blockquote>\n")

Data

We use age-stratified norovirus surveillance data from Berlin, Germany, together with an age-structured social contact matrix from the POLYMOD survey, as provided in the R package hhh4contacts:

library("hhh4contacts")

These data have been originally analyzed in:

Meyer S and Held L (2017): "Incorporating social contact data in spatio-temporal models for infectious disease spread". Biostatistics, 18(2), pp. 338--351. DOI: 10.1093/biostatistics/kxw051

Here we only consider models for spatially aggregated counts, i.e., without additional stratification by city district. More specifically, we will analyse age-stratified weekly counts $Y_{gt}$ in a similar way as for the reference model 6 in:

Held L, Meyer S and Bracher J (2017): "Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture". Statistics in Medicine, 36(22), pp. 3443--3460. DOI: 10.1002/sim.7363

Berlin norovirus counts

BNV <- noroBE(by = "agegroup", agegroups = c(1, 2, 2, 4, 4, 2),
              timeRange = c("2011-w27", "2016-w26"))
BNV
(NGROUPS <- ncol(BNV))
(GROUPS <- colnames(BNV))

We can plot the observed age-stratified counts using the default plot method for "sts" objects:

plot(BNV)

We will use the first four seasons, from week 2011/27 to week 2015/26, as training data, and assess forecasts during the following year (2015/27 to 2016/26).

TRAIN <- 2:(4*52)
TEST <- max(TRAIN) + 1:52

There is also an autoplot variant based on ggplot2, which we use to plot the age-specific incidence based on the population fractions contained in the BNV object,

(popfracsBE <- population(BNV)[1,])

and Berlin's total population,

(popBE <- sum(pop2011))  # also from the "hhh4contacts" package, see ?pop2011

as follows:

autoplot(BNV, population = 100000/popBE) +
    ## population: divides observed(BNV) by population(BNV)/(100000/popBE)
    ## Mod 1: highlight the Christmas period in each year
    geom_col(aes(fill = epochInYear %in% c(52, 1)),
             width = 7, show.legend = FALSE) +
    scale_fill_manual(values = c("black", "red")) +
    ## Mod 2: separate training and test periods by a vertical line
    geom_vline(aes(xintercept = as.numeric(date)[4*52] + .5), linetype = 2) +
    ylab("Incidence [per 100 000 inhabitants]") + scale_y_sqrt()

Age-structured contact matrix

The neighbourhood slot of the BNV object contains a social contact matrix derived from the German subset of the POLYMOD survey, aggregated to match the age groups of the surveillance data:

neighbourhood(BNV)

Each entry gives the average number of contact persons of a certain age group a participant (of a certain age group) reports on a randomly assigned day, see Mossong et al. (2008, PLoS Medicine, DOI: 10.1371/journal.pmed.0050074). The "agedistri" attribute gives the age distribution of the participants.

We will employ an improved estimate of this contact matrix, which ensures reciprocity on the population level, i.e., the overall number of contacts of age group i with age group j should be the same as vice versa, see Wallinga et al (2006, American Journal of Epidemiology, DOI: 10.1093/aje/kwj317):

(C_reci <- contactmatrix(which = "reciprocal", grouping = c(1, 2, 2, 4, 4, 2)))

We can check reciprocity with respect to Berlin's age distribution (also given in the "agedistri" attribute):

is_reciprocal <- function (C, population, tol = 0.001) {
    Cpop <- C * population
    all.equal(Cpop, t(Cpop), tolerance = tol, check.attributes = FALSE)
}
stopifnot(is_reciprocal(C_reci, popfracsBE))

The hhh4contacts package provides a simple plotting function for such contact matrices:

plotC(C_reci, scales = list(cex = 0.8, x = list(rot = 45)))

NB: A general implementation to extract social contact matrices from surveys, including from POLYMOD, is available via the dedicated R package socialmixr, and I recommend to use that package in future projects.

C <- contactmatrix(grouping = c(1, 2, 2, 4, 4, 2))

## reproduce that matrix using socialmixr's POLYMOD data and contact_matrix()
library("socialmixr")
data("polymod")
cite(polymod)
C2 <- contact_matrix(polymod, countries = "Germany", age.limits = c(0,5,15,25,45,65),
                     missing.participant.age = "remove", missing.contact.age = "keep")
round(C2$matrix, 2)
round(C, 2)
stopifnot(all.equal(C2$matrix[,-ncol(C2$matrix)], C, check.attributes = FALSE))

## reciprocity is obtained differently in socialmixr
polymod2 <- polymod
## remove contacts with unknown age for same data approach
polymod2$contacts <- subset(polymod2$contacts, !is.na(cnt_age_exact) |
                                               (!is.na(cnt_age_est_min) & !is.na(cnt_age_est_max)))
C2_reci <- contact_matrix(polymod2, countries = "Germany", age.limits = c(0,5,15,25,45,65),
                          missing.participant.age = "remove", #missing.contact.age = "keep",
                          survey.pop = data.frame(lower.age.limit = c(0,5,15,25,45,65),
                                                  population = popfracsBE * popBE),
                          symmetric = TRUE)
round(C2_reci$matrix, 2)
stopifnot(is_reciprocal(C2_reci$matrix, popfracsBE))
round(C_reci, 2)

Modelling

Given the counts from the previous week, $Y_{.,t-1}$, we assume $Y_{gt}$ to follow a negative binomial distribution with a group-specific overdispersion parameter and mean $$ \mu_{gt} = \nu_{gt} + \phi_{gt} \sum_{g'} c_{g'g} Y_{g',t-1} . $$ The endemic log-linear predictor $\nu_{gt}$ contains group-specific intercepts, a Christmas effect (via a simple indicator for the calendar weeks 52 and 1), and group-specific seasonal effects of $\sin(\omega t)$ and $\cos(\omega t)$ terms, $\omega=2\pi/52$. The epidemic log-linear predictor $\phi_{gt}$ also contains group-specific intercept, but shared seasonality and no Christmas effect. For the contact matrix we use C_reci from above, normalized to a transition matrix C_reci/rowSums(C_reci), and compare this to models assuming homogeneous or no mixing between age groups, and a model where we estimate a power transformation $C^\kappa$ via profile likelihood (hhh4contacts::fitC()) as in Meyer and Held (2017, see demo("hhh4contacts") for a spatially disaggregated version of the model).

DATAt <- list(t = epoch(BNV) - 1,
              christmas = as.integer(epochInYear(BNV) %in% c(52, 1)))
mg_Creci <- hhh4(BNV, list(
    end = list(f = addSeason2formula(~0 + fe(1, unitSpecific = TRUE) + christmas,
                                     S = rep(1, NGROUPS))),
    ne = list(f = addSeason2formula(~0 + fe(1, unitSpecific = TRUE)),
              weights = matrix(1, NGROUPS, NGROUPS),
              scale = C_reci, normalize = TRUE),
    family = "NegBinM", data = DATAt, subset = TRAIN))

Alternative 1: assuming homogeneous mixing between age groups

mg_Chom <- update(mg_Creci, ne = list(scale = NULL))

Alternative 2: assuming no mixing between age groups

mg_Cdiag <- update(mg_Creci, ne = list(scale = diag(NGROUPS)))

Alternative 3: with a power transformation of the contact matrix

mg_Cpower <- fitC(mg_Creci, C_reci, normalize = TRUE, truncate = TRUE)

Simple AIC comparison of the model fits to the training period:

AIC(mg_Creci, mg_Chom, mg_Cdiag, mg_Cpower)

Parameter estimates from the model with power-adjusted contact matrix:

summary(mg_Cpower, maxEV = TRUE, reparamPsi = TRUE,
        amplitudeShift = TRUE, idx2Exp = TRUE)

## plot estimated endemic-epidemic decomposition
##plot(mg_Cpower, units = NULL, pch = 20,
##     legend = 2, legend.args = list(legend = c("epidemic", "endemic")))

## additional decomposition into AR effects and effects of other age groups
plotHHH4_fitted_groups(mg_Cpower, groups = GROUPS, units = NULL, pch = 20,
  legend = 2, legend.args = list(legend = c("from other groups", "within group", "endemic")))

par(mfrow = c(2,1), mar = c(0,5,1,1), las = 1)
plotHHH4_season_groups(mg_Cpower, component = "end", seasonStart = 27,
                       col = c("#D53E4F", "#FC8D59", "#FEE08B", "#E6F598", "#99D594", "#3288BD"),
                       conf.level = NULL, xaxt = "n", xlab = "", ylim = c(0, 5), yaxs = "i",
                       ylab = "endemic seasonal effects")
par(mar = c(3,5,1,1))
with(data.frame(time = epochInYear(BNV) + (year(BNV)-2011)*52,
                maxEV = getMaxEV(mg_Cpower))[1:52,],
     plot(maxEV ~ time, ylim = c(0, 1), type = "l", lwd = 3, yaxs = "i", xlab = "",
          ylab = "epidemic proportion", panel.first = quote(abline(v=52.5, lty=3))))

One-week-ahead forecasts

Parameters are updated sequentially in the one-step-ahead procedure. However, the power parameter of C is held fixed at the initial estimate to reduce the runtime.

For each model, we compute the r length(TEST) "rolling" one-week-ahead forecasts during the last season. This takes roughly 4 seconds per model (we could parallelize using the cores argument of oneStepAhead()).

fits <- list("reciprocal" = mg_Creci,
             "homogeneous" = mg_Chom,
             "no mixing" = mg_Cdiag,
             "power-adjusted" = mg_Cpower)

owas <- lapply(fits, oneStepAhead, tp = range(TEST)-1, type = "rolling",
               which.start = "final", verbose = FALSE)
save(owas, file = "BNV_owa.RData")

load("BNV_owa.RData")

par(mfrow = c(2,2), mar = c(5,5,1,1), las = 1)
mapply(function(x, main) pit(x, plot = list(main = main, ylab = "Density")),
       x = owas, main = names(owas))

unlist(sapply(owas, calibrationTest, which = "dss")["p.value",])
unlist(sapply(owas, calibrationTest, which = "logs")["p.value",])

owas_scores <- lapply(owas, scores, which = c("dss", "logs"), individual = TRUE, reverse = FALSE)
lapply(owas_scores, function(x) cbind(apply(x, 3:2, mean), overall=apply(x, 3, mean)))

set.seed(1)
sapply(c("dss", "logs"), function (score)
    permutationTest(owas_scores[["reciprocal"]][,,score],
                    owas_scores[["power-adjusted"]][,,score],
                    nPermutation = 999))

par(las = 1, mar = c(0,4,1,0)+.5)
owas_quantiles <- lapply(owas, quantile, probs = 1:99/100)
.osaplot <- function (model) {
    sapply(seq_along(GROUPS), function (group) {
        osaplot(quantiles = owas_quantiles[[model]][,group,], probs = 1:99/100,
                observed = owas[[model]]$observed[,group],
                scores = owas_scores[[model]][,group,],
                start = TEST[1], xlab = "", main = GROUPS[group],
                fan.args = list(ln = c(0.1,0.9), rlab = NULL),
                key.args = if(group==length(GROUPS)) list(start=max(TEST)-2, rcex=0.6),
                scores.args = list(xaxt = "n", ylim = range(owas_scores[[model]]),
                                   lab = c(7,3,0), panel.first = grid(nx=NA,ny=NULL,lty=1)),
                legend.args = if(group==length(GROUPS)) list())
    })
}
.osaplot("no mixing")