In jameshay218/lazymcmc: MCMC for the lazy

library(lazymcmc)
knitr::opts_chunk$set(echo = TRUE, fig.width=7,fig.height=6,
                      message=FALSE, results=FALSE,warning=FALSE)

Foreword

This vignette is intended to show users how to fit an SIR model (ie. estimate parameter values) to some artificial data. Once you've understood this, it should be fairly straightforward to write your own generic model and fit to it. There are some tips at the bottom of this vignette to anticipate potential questions.

The SIR model

The first step is to specify our really simple SIR model. See here, or any other reference for a summary. In particular, we'll implement the Kermack-McKendrick Model. We'll ignore vital dynamics so that all we have to worry about is infections and recoveries. The key here is that the SIR model is defined by a set of nonlinear ordinary differential equations, so we need to numerically solve these.

## Define the set of ODEs for the model. Return such that we can solve with deSolve
SIR_odes <- function(t, x, params) {
    S <- x[1]
    I <- x[2]
    R <- x[3]
    beta <- params[1]
    gamma <- params[2]
    dS <- -beta*S*I
    dI <- beta*S*I - gamma*I
    dR <- gamma*I
    list(c(dS,dI,dR))
}
## Make an R0 of something flu like
R0 <- 1.5

## People recover after 5 days, so recovery rate is 1/5
gamma <- 1/5

## So contact rate for this would be beta = gamma*R0
beta <- R0*gamma

## Times to solve model over
t <- seq(0,365,by=0.1)

## Note starting conditions for population size - we're working per capita here
results <- as.data.frame(deSolve::ode(y=c(S=1,I=0.0001,R=0),
                             times=t, func=SIR_odes,
                             parms=c(beta,gamma)))

## Plot the results
plot(results$I~results$time,type='l',col="red",
     ylim=c(0,1),xlab="Time (days)", ylab="Per capita",
     main="SIR dynamics")
lines(results$R~results$time,col="blue")
lines(results$S~results$time,col="green")
legend("topright", title="Key",
       legend=c("Susceptible","Infected","Recovered"),
       col=c("green","red","blue"),
       lty=c(1,1,1))

Artificial data

Let's pretend that this model generated some real data that was observed, so we need to move to real numbers rather than per capita. Let's say it's a population of 1000 people, and 1 person was infected.

Of course, there was some noise introduced when observing the data, and only infected people came into the healthcare system to be observed. We could use the "Infected" column directly, but this is actually a measure of disease prevalence. Incidence makes more sense, as we observe individuals when they become ill (with some reporting delay which we'll ignore).

Our observed data will therefore be the cumulative difference of the susceptible column. Let's also assume that we observe the total incidence every 7 days, and that the observations are drawn from a poisson distribution where the mean is the true incidence. Our artificial data is generated as:

## Starting population size
S0 <- 999
I0 <- 1
R0 <- 0

## Changing variable name because of clash with recovereds
basic_repro_number <- 1.5
## So contact rate for this would be beta = gamma*basic_repro_number/N
beta <- basic_repro_number*gamma/sum(S0, I0, R0)

## Times to solve model over
t <- seq(1,52*7,by=1)

## Note starting conditions for population size - we're working per capita here
results <- as.data.frame(deSolve::ode(y=c(S=S0,I=I0,R=R0),
                             times=t, func=SIR_odes,
                             parms=c(beta,gamma)))

## Real data is discrete and has errors. 
## Let's generate this from the infected population:
incidence <- c(0,-diff(results$S))
## Get weekly incidence
weekly_incidence <- colSums(matrix(incidence,nrow=7))
## 
real_data <- data.frame(times=seq(1,52*7,by=7),inc=rpois(length(weekly_incidence),weekly_incidence))

## Plot these data with the observations (dots) overlaid
plot(real_data,type='l',col="red",
     xlab="Time (days)", ylab="Number of new infected people",
     main="Observed weekly incidence")

Likelihood function

Now that we have our data generating process, model, and assumptions about the error, we can write a likelihood function. The basic idea is that we assume the model generates the true incidence, and we make an observation from this with some error ie. the poisson distribution. The likelihood function alongside the model solver can be written as:

dat <- real_data
## Note that S0, I0, R0, t and SIR_odes are declared outside of
## this function, but are still accessible in the wider R
## environment
likelihood_func <- function(pars){
  beta <- pars[1]
  gamma <- pars[2]
  results <- as.data.frame(deSolve::ode(y=c(S=S0,I=I0,R=R0),
                             times=t, func=SIR_odes,
                             parms=c(beta,gamma)))
  incidence <- c(0,-diff(results$S))

  ## Get weekly incidence
  weekly_incidence <- colSums(matrix(incidence,nrow=7))

  lik <- sum(dpois(dat$inc,weekly_incidence,log=TRUE))
  lik
}
real_pars <- c(beta,gamma)

## Likelihood of true pars
print(likelihood_func(real_pars))

lazymcmc

We now have everything we need to run the MCMC and reestimate the parameters of the SIR model conditional on the observed data.

We need to make some changes to make parameter exploration a bit easier. This is where some more complicated packages might help you - lazymcmc leaves more to the user.

Firstly, let's work in the parameter space of R0 and gamma rather than beta and gamma. The reason being is that deSolve struggles with SIR models for silly values of R0, which is difficult to manage if we are working with beta. This way, we can bound R0 from below at 1 (and above by something reasonable, say 10 here). We'll make some pretty strong assumptions about the limits for gamma as well, just to get the ball rolling here. Let's assume that our prior on gamma is that recovery time is somewhere between 1 and Inf days.

The only thing we haven't looked at for lazymcmc is the parTab setup, which is straightforward. We just need columns for the parameter starting values, names, whether they are fixed or to be fitted, initial proposal step size, lower and upper bounds.

parTab <- data.frame(names=c("R0","gamma"),
                     values=c(1.5,0.2),
                     fixed=c(0,0),
                     steps=c(0.1,0.1),
                     lower_bound=c(1,0),
                     upper_bound=c(10,1))

mcmcPars <- c("iterations"=2000,"popt"=0.44,"opt_freq"=100,
              "thin"=1,"adaptive_period"=1000,"save_block"=100)


## Putting model solving code in a function for later use
solve_model <- function(pars, t, S0, I0, R0, SIR_odes){
  N <- sum(S0, I0, R0)
  basic_repro_number <- pars["R0"]
  gamma <- pars["gamma"]
  beta <- basic_repro_number*gamma/N
  results <- as.data.frame(deSolve::ode(y=c(S=S0,I=I0,R=R0),
                               times=t, func=SIR_odes,
                               parms=c(beta,gamma)))
  incidence <- c(0,-diff(results$S))
  return(incidence)
}

## We need to put our likleihood function in a closure environment
## It's important to write the function in this form!
create_lik <- function(parTab, data, PRIOR_FUNC,...){
  par_names <- parTab$names

  ## Extract observed incidence
  inc <- data$inc

  ## Get times to solve model over
  tstep <- 1
  max_weeks <- 52
  t <- seq(tstep,max_weeks*7,by=tstep)

  ## We will pass S0, I0, R0 and SIR_odes 
  N <- S0 + I0 + R0
  ## using the `...` bit.
  likelihood_func <- function(pars){
    names(pars) <- par_names
    incidence <- solve_model(pars, t, S0, I0, R0, SIR_odes)
    ## Get weekly incidence
    weekly_incidence <- colSums(matrix(incidence,nrow=7/tstep))

    lik <- sum(dpois(x=inc,lambda=weekly_incidence,log=TRUE))
    if(!is.null(PRIOR_FUNC)) lik <- lik + PRIOR_FUNC(pars)
    lik
  }
}

## Starting points, chosen pseudo at random
## seeding chains for SIR models is hard with deSolve,
## so I've chosen points near the true values.
startTab <- parTab
startTab$values <- c(2.3, 0.17)

output <- run_MCMC(parTab=startTab, data=dat, mcmcPars=mcmcPars, filename="SIR_fitting",
                   CREATE_POSTERIOR_FUNC = create_lik, mvrPars = NULL, PRIOR_FUNC=NULL,
                   S0=999,I0=1,R0=0,SIR_odes=SIR_odes)
chain <- read.csv(output$file)
plot(coda::as.mcmc(chain[,c("R0","gamma")]))

We've restimated our original R0 and gamma, great! Convergence isn't amazing yet. Because R0 as a term contains gamma, there is probably some correlation which makes a univariate sampler quite inefficient:

plot(chain$R0~chain$gamma)

A sampler that takes covariance into account is going to be more efficient here. Let's add a simple prior function as well, just to show how it works. Let's run the chain again using the multivariate sampler this time:

## Prior function
my_prior <- function(pars){
  ## Diffuse gaussian priors on both parameters
  r0_prior <- dnorm(pars[1],1.5,100,1)
  gamma_prior <- dnorm(pars[2],0.2,100,1)
  return(r0_prior + gamma_prior)
}

## Use the previous chain to get a good starting
## covariance matrix
startTab$values <- get_best_pars(chain)
chain <- chain[chain$sampno >= mcmcPars["adaptive_period"],2:(ncol(chain)-1)]
covMat <- cov(chain)
mvrPars <- list(covMat,2.38/sqrt(nrow(parTab[parTab$fixed==0,])),w=0.8)

## Remove restrictions on parameter bounds -
## not important with the multivariate sampler
startTab$lower_bound <- -Inf
startTab$upper_bound <- Inf

## Run with mvrPars
mcmcPars <- c("iterations"=20000,"popt"=0.234,"opt_freq"=500,
              "thin"=1,"adaptive_period"=7500,"save_block"=100)

output2 <- run_MCMC(parTab=startTab, data=dat, mcmcPars=mcmcPars, filename="SIR_fitting",
                   CREATE_POSTERIOR_FUNC = create_lik, mvrPars = mvrPars, PRIOR_FUNC=my_prior,
                   S0=999,I0=1,R0=0,SIR_odes=SIR_odes)
chain2 <- read.csv(output2$file)
chain2 <- chain2[chain2$sampno >= mcmcPars["adaptive_period"],]
plot(coda::as.mcmc(chain2[,c("R0","gamma")]))

Convergence looks much nicer - we've reestimated our R0 of 1.5 and gamma of 0.2. We could do some diagnostics with the coda package on the chain object if we want, but that's beyond the scope of this vignette. The last thing to do is to plot our estimates against the data to make sure that our answer makes sense.

## Get the maximum likelihood parameters and estimate
## the model trajectory for each week
best_pars <- get_best_pars(chain2)
times <- seq(1,52*7,by=1)
best_trajectory <- solve_model(best_pars,times,S0, I0, R0, SIR_odes)
best_incidence <- colSums(matrix(best_trajectory,nrow=7))
best_dat <- data.frame(t=seq(1,52,by=1), y=best_incidence)

## Bit of code to generate prediction intervals
n_samps <- 100
trajectories <- matrix(nrow=n_samps,ncol=52)
samps <- sample(nrow(chain2),n_samps)
for(i in 1:n_samps){
  pars <- as.numeric(chain2[samps[i],c("R0","gamma")])
  names(pars) <- c("R0","gamma")
  tmp <- solve_model(pars, times, S0, I0, R0, SIR_odes)
  trajectories[i,] <- colSums(matrix(tmp, nrow=7))
}
bounds <- apply(trajectories,2,function(x) quantile(x, c(0.025,0.975)))

plot(dat$inc~seq(1,52,by=1),col="red",pch=16, 
     xlab="Time (weeks)",ylab="Incidence",ylim=c(0,100))
lines(best_dat,col="blue")
lines(bounds[1,],col="blue",lty=2,lwd=0.5)
lines(bounds[2,],col="blue",lty=2,lwd=0.5)
legend("topright",c("Data","Model","95% credible intervals"),
       col=c("red","blue","blue"),lty=c(0,1,2),pch=c(16,NA,NA))

The fit looks good, so we can be pretty confident that our model fitting has worked correctly.

Summary

The key points to remember when using the package are:

Make sure that your likelihood calculating function returns a single value (the log likelihood), and takes only parTab, data, PRIOR_FUNC, and then any other arguments using .... It's helpful to abstract away as much of your solving code as possible.
The key bit of syntax is creating a function that creates another function taking only a single vector of model parameters, pars as in the example.
From the MCMC output, you can read in the MCMC and do any manipulation/tests that you like.
The default univariate sampler doesn't work so well when we have correlated parameters. I tend to use the univariate version to get an initial covariance matrix, and then run the multivariate sampler.

Tips

Starting points for your MCMC can be tricky. Ideally, you'd start at any random point from your prior. For the SIR model here, there's actually quite a narrow band of parameter values that are easily solvable with deSolve, so we put fairly strong bounds in the parTab argument. It's left to the user to decide the best way to seed your chain.
The steps column of parTab indicates the size of the univariate proposals. It is important to note that the sampler converts your model parameters to a unit scale between 0 and 1. If your true parameter range is -50 to 50, then a step size of 0.1 is actually performing jumps of size 1 in terms of real values. Have a look at the code for univ_proposal if this is confusing. A REALLY important result of this is that when using a univariate sampler, using lower and upper bounds of Inf will really mess up the interpretation of step size. You therefore need to have finite bounds with the univariate sampler.
It's really easy to change the flag for fixed in parTab - set it to 1 to fix a parameter.
There's minimal error checking with the MCMC code - I would advise testing your model solving and likelihood calculating functions before running the MCMC. It's not happy when your functions return NaN or NA!
You need to be aware that the parTab puts implicit uniform priors on your parameters via the bounds. Be aware of this when specifying additional prior functions (ie. avoid double specification of priors).