twinSIR: Fit an Additive-Multiplicative Intensity Model for SIR Data

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Description

twinSIR is used to fit additive-multiplicative intensity models for epidemics as described in Höhle (2009). Estimation is driven by (penalized) maximum likelihood in the point process frame work. Optimization (maximization) of the (penalized) likelihood function is performed by means of optim.

Usage

1
2
3
twinSIR(formula, data, weights, subset,
        knots = NULL, nIntervals = 1, lambda.smooth = 0, penalty = 1,
        optim.args = list(), model = TRUE, keep.data = FALSE)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the intensity model to be estimated. The details of model specification are given under Details.

data

an object inheriting from class "epidata".

weights

an optional vector of weights to be used in the fitting process. Should be NULL (the default, i.e. all observations have unit weight) or a numeric vector.

subset

an optional vector specifying a subset of observations to be used in the fitting process. The subset atRiskY == 1 is automatically chosen, because the likelihood only depends on those observations.

knots

numeric vector or NULL (the default). Specification of the knots, where we suppose a step of the log-baseline. With the current implementation, these must be existing "stop" time points in subset(data, atRiskY == 1). The intervals of constant log-baseline hazard rate then are (minTime;knots_1], (knots_1;knots_2], ..., (knots_K;maxTime]. By default, the knots are automatically chosen at the quantiles of the infection time points such that nIntervals intervals result. Non-NULL knots take precedence over nIntervals.

nIntervals

the number of intervals of constant log-baseline hazard. Defaults to 1, which means an overall constant log-baseline hazard will be fitted.

lambda.smooth

numeric, the smoothing parameter λ. By default it is 0 which leads to unpenalized likelihood inference. In case lambda.smooth=-1, the automatic smoothing parameter selection based on a mixed model approach is used (cf. Höhle, 2009).

penalty

either a single number denoting the order of the difference used to penalize the log-baseline coefficients (defaults to 1), or a more specific penalty matrix K for the parameter sub-vector β. In case of non-equidistant knots – usually the case when using quantile based knot locations – a 1st order differences penalty matrix as in Fahrmeir and Lang (2001) is available. For non-equidistant knots higher orders than one are not implemented.

optim.args

a list with arguments passed to the optim function. Especially useful are the following ones:

par:

to specify initial parameter values. Those must be in the order c(alpha, h0, beta), i.e. first the coefficients of the epidemic covariates in the same order as they appear in the formula, then the log-baseline levels in chronological order and finally the coefficients of the endemic covariates in the same order as they appear in the cox terms of the formula. The default is to start with 1's for alpha and 0's for h0 and beta.

control:

for more detailed trace-ing (default: 1), another REPORT-ing frequency if trace is positive (default: 10), higher maxit (maximum number of iterations, default: 300) or another factr value (default: 1e7, a lower value means higher precision).

method:

the optimization algorithm defaults to "L-BFGS-B" (for box-constrained optimization), if there are any epidemic (non-cox) variables in the model, and to "BFGS" otherwise.

lower:

if method = "L-BFGS-B" this defines the lower bounds for the model coefficients. By default, all effects α of epidemic variables are restricted to be non-negative. Normally, this is exactly what one would like to have, but there might be reasons for other lower bounds, see the Note below.

hessian:

An estimation of the Expected Fisher Information matrix is always part of the return value of the function. It might be interesting to see the Observed Fisher Information (= negative Hessian at the maximum), too. This will be additionally returned if hessian = TRUE.

model

logical indicating if the model frame, the weights, lambda.smooth, the penalty matrix K and the list of used distance functions f (from attributes(data)) should be returned for further computation. This defaults to TRUE as this information is necessary e.g. in the profile and plot methods.

keep.data

logical indicating if the "epidata" object (data) should be part of the return value. This is only necessary for use of the simulate-method for "twinSIR" objects. The reason is that the twinSIR function only uses and stores the rows with atRiskY == 1 in the model component, but for the simulation of new epidemic data one needs the whole data set with all individuals in every time block. The default value is FALSE, so if you intent to use simulate.twinSIR, you have to set this to TRUE.

Details

A model is specified through the formula, which has the form

~ epidemicTerm1 + epidemicTerm2 + cox(endemicVar1) * cox(endemicVar2),

i.e. the right hand side has the usual form as in lm with some variables marked as being endemic by the special function cox. The left hand side of the formula is empty and will be set internally to cbind(start, stop, event), which is similar to Surv(start, stop, event, type="counting").

Basically, the additive-multiplicative model for the infection intensity λ_i(t) for individual i is

λ_i(t) = Y_i(t) * (e_i(t) + h_i(t))

where

Y\_i(t)

is the at-risk indicator, indicating if individual i is “at risk” of becoming infected at time point t. This variable is part of the event history data.

e\_i(t)

is the epidemic component of the infection intensity, defined as

e_i(t) = ∑_{j \in I(t)} f(||s_i - s_j||)

where I(t) is the set of infectious individuals just before time point t, s_i is the coordinate vector of individual i and the function f is defined as

f(u) = ∑_{m=1}^p α_m B_m(u)

with unknown transmission parameters α and known distance functions B_m. This set of distance functions results in the set of epidemic variables normally calculated by the converter function as.epidata, considering the equality

e_i(t) = ∑_{m=1}^p α_m x_{im}(t)

with x_{im}(t) = ∑_{j \in I(t)} B_m(||s_i - s_j||) being the m'th epidemic variable for individual i.

h\_i(t)

is the endemic (cox) component of the infection intensity, defined as

h_i(t) = \exp(h_0(t) + z_i(t)' β)

where h_0(t) is the log-baseline hazard function, z_i(t) is the vector of endemic covariates of individual i and β is the vector of unknown coefficients. To fit the model, the log-baseline hazard function is approximated by a piecewise constant function with known knots, but unknown levels, which will be estimated. The approximation is specified by the arguments knots or nIntervals.

If a big number of knots (or nIntervals) is chosen, the corresponding log-baseline parameters can be rendered identifiable by the use of penalized likelihood inference. At present, it is the job of the user to choose an adequate value of the smoothing parameter lambda.smooth. Alternatively, a data driven lambda.smooth smoothing parameter selection based on a mixed model representation of an equivalent truncated power spline is offered (see reference for further details). The following two steps are iterated until converegence:

  1. Given fixed smoothing parameter, the penalized likelihood is optimized for the regression components using a L-BFGS-B approach

  2. Given fixed regression parameters, a Laplace approximation of the marginal likelihood for the smoothing parameter is numerically optimized.

Depending on the data convergence might take a couple of iterations.

Note also that it is unwise to include endemic covariates with huge values, as they affect the intensities on the exponential scale after having been multiplied by the parameter vector β. With big covariates the optim method "L-BFGS-B" will likely terminate due to an infinite log-likelihood or score function in some iteration.

Value

twinSIR returns an object of class "twinSIR". An object of this class is a list containing the following components:

coefficients

a named vector of coefficients.

loglik

the maximum of the (penalized) log-likelihood function.

counts

the number of log-likelihood and score function evaluations.

converged

logical indicating convergence of the optimization algorithm.

fisherinfo.observed

if requested, the negative Hessian from optim.

fisherinfo

an estimation of the Expected Fisher Information matrix.

method

the optimization algorithm used.

intervals

a numeric vector (c(minTime, knots, maxTime)) representing the consecutive intervals of constant log-baseline.

nEvents

a numeric vector containing the number of infections in each of the above intervals.

model

if requested, the model information used. This is a list with components "survs" (data.frame with the id, start, stop and event columns), "X" (matrix of the epidemic variables), "Z" (matrix of the endemic variables), "weights" (the specified weights), "lambda.smooth" (the specified lambda.smooth), "K" (the penalty matrix used), and "f" and "w" (the functions to generate the used epidemic covariates). Be aware that the model only contains those rows with atRiskY == 1!

data

if requested, the supplied "epidata" data.

call

the matched call.

formula

the specified formula.

terms

the terms object used.

Note

There are some restrictions to modelling the infection intensity without a baseline hazard rate, i.e. without an intercept in the formula. Reason: At some point, the optimization algorithm L-BFGS-B tries to set all transmission parameters α to the boundary value 0 and to calculate the (penalized) score function with this set of parameters (all 0). The problem then is that the values of the infection intensities lambda_i(t) are 0 for all i and t and especially at observed event times, which is impossible. Without a baseline, it is not allowed to have all alpha's set to 0, because then we would not observe any infections. Unfortunately, L-BFGS-B can not consider this restriction. Thus, if one wants to fit a model without baseline hazard, the control parameter lower must be specified in optim.args so that some alpha is strictly positive, e.g. optim.args = list(lower = c(0,0.001,0.001,0)) and the initial parameter vector par must not be the zero vector.

Author(s)

Michael Höhle and Sebastian Meyer

References

Höhle, M. (2009), Additive-Multiplicative Regression Models for Spatio-Temporal Epidemics, Biometrical Journal, 51(6):961-978.

See Also

as.epidata for the necessary data input structure, plot.twinSIR for plotting the path of the infection intensity, profile.twinSIR for profile likelihood estimation. and simulate.twinSIR for the simulation of epidemics following the fitted model.

Furthermore, the standard extraction methods vcov, logLik, AIC and extractAIC are implemented for objects of class "twinSIR".

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
data("fooepidata")
summary(fooepidata)

# fit an overall constant baseline hazard rate
fit1 <- twinSIR(~ B1 + B2 + cox(z2), data = fooepidata)
fit1
summary(fit1)

# fit1 is what is used as data("foofit") in other examples
data("foofit")
stopifnot(all.equal(fit1, foofit))

# fit a piecewise constant baseline hazard rate with 3 intervals using 
# _un_penalized ML and estimated coefs from fit1 as starting values 
fit2 <- twinSIR(~ B1 + B2 + cox(z2), data = fooepidata, nIntervals = 3,
    optim.args = list(par=c(coef(fit1)[1:2],rep(coef(fit1)[3],3),coef(fit1)[4])))
fit2
summary(fit2)

# fit a piecewise constant baseline hazard rate with 9 intervals
# using _penalized_ ML and estimated coefs from fit1 as starting values 
fit3 <- twinSIR(~ B1 + B2 + cox(z2), data = fooepidata, nIntervals = 9,
    lambda.smooth = 0.1, penalty = 1, optim.args = list(
        par=c(coef(fit1)[1:2], rep(coef(fit1)[3],9), coef(fit1)[4])))
fit3
summary(fit3)
# plot of the 9 log-baseline levels
plot(x=fit3$intervals, y=coef(fit3)[c(3,3:11)], type="S")

### -> for more sophisticated intensity plots, see 'plot.twinSIR'
plot(fit3)

jimhester/surveillance documentation built on May 19, 2019, 10:33 a.m.