library(lazymcmc) knitr::opts_chunk$set(echo = TRUE, fig.width=7,fig.height=6, message=FALSE, results=FALSE,warning=FALSE)
This vignette is intended to show users how to fit an SIR model (ie. estimate parameter values) to some artificial data. Once you've understood this, it should be fairly straightforward to write your own generic model and fit to it. There are some tips at the bottom of this vignette to anticipate potential questions.
The first step is to specify our really simple SIR model. See here, or any other reference for a summary. In particular, we'll implement the Kermack-McKendrick Model. We'll ignore vital dynamics so that all we have to worry about is infections and recoveries. The key here is that the SIR model is defined by a set of nonlinear ordinary differential equations, so we need to numerically solve these.
## Define the set of ODEs for the model. Return such that we can solve with deSolve SIR_odes <- function(t, x, params) { S <- x[1] I <- x[2] R <- x[3] beta <- params[1] gamma <- params[2] dS <- -beta*S*I dI <- beta*S*I - gamma*I dR <- gamma*I list(c(dS,dI,dR)) } ## Make an R0 of something flu like R0 <- 1.5 ## People recover after 5 days, so recovery rate is 1/5 gamma <- 1/5 ## So contact rate for this would be beta = gamma*R0 beta <- R0*gamma ## Times to solve model over t <- seq(0,365,by=0.1) ## Note starting conditions for population size - we're working per capita here results <- as.data.frame(deSolve::ode(y=c(S=1,I=0.0001,R=0), times=t, func=SIR_odes, parms=c(beta,gamma))) ## Plot the results plot(results$I~results$time,type='l',col="red", ylim=c(0,1),xlab="Time (days)", ylab="Per capita", main="SIR dynamics") lines(results$R~results$time,col="blue") lines(results$S~results$time,col="green") legend("topright", title="Key", legend=c("Susceptible","Infected","Recovered"), col=c("green","red","blue"), lty=c(1,1,1))
Let's pretend that this model generated some real data that was observed, so we need to move to real numbers rather than per capita. Let's say it's a population of 1000 people, and 1 person was infected.
Of course, there was some noise introduced when observing the data, and only infected people came into the healthcare system to be observed. We could use the "Infected" column directly, but this is actually a measure of disease prevalence. Incidence makes more sense, as we observe individuals when they become ill (with some reporting delay which we'll ignore).
Our observed data will therefore be the cumulative difference of the susceptible column. Let's also assume that we observe the total incidence every 7 days, and that the observations are drawn from a poisson distribution where the mean is the true incidence. Our artificial data is generated as:
## Starting population size S0 <- 999 I0 <- 1 R0 <- 0 ## Changing variable name because of clash with recovereds basic_repro_number <- 1.5 ## So contact rate for this would be beta = gamma*basic_repro_number/N beta <- basic_repro_number*gamma/sum(S0, I0, R0) ## Times to solve model over t <- seq(1,52*7,by=1) ## Note starting conditions for population size - we're working per capita here results <- as.data.frame(deSolve::ode(y=c(S=S0,I=I0,R=R0), times=t, func=SIR_odes, parms=c(beta,gamma))) ## Real data is discrete and has errors. ## Let's generate this from the infected population: incidence <- c(0,-diff(results$S)) ## Get weekly incidence weekly_incidence <- colSums(matrix(incidence,nrow=7)) ## real_data <- data.frame(times=seq(1,52*7,by=7),inc=rpois(length(weekly_incidence),weekly_incidence)) ## Plot these data with the observations (dots) overlaid plot(real_data,type='l',col="red", xlab="Time (days)", ylab="Number of new infected people", main="Observed weekly incidence")
Now that we have our data generating process, model, and assumptions about the error, we can write a likelihood function. The basic idea is that we assume the model generates the true incidence, and we make an observation from this with some error ie. the poisson distribution. The likelihood function alongside the model solver can be written as:
dat <- real_data ## Note that S0, I0, R0, t and SIR_odes are declared outside of ## this function, but are still accessible in the wider R ## environment likelihood_func <- function(pars){ beta <- pars[1] gamma <- pars[2] results <- as.data.frame(deSolve::ode(y=c(S=S0,I=I0,R=R0), times=t, func=SIR_odes, parms=c(beta,gamma))) incidence <- c(0,-diff(results$S)) ## Get weekly incidence weekly_incidence <- colSums(matrix(incidence,nrow=7)) lik <- sum(dpois(dat$inc,weekly_incidence,log=TRUE)) lik } real_pars <- c(beta,gamma) ## Likelihood of true pars print(likelihood_func(real_pars))
We now have everything we need to run the MCMC and reestimate the parameters of the SIR model conditional on the observed data.
We need to make some changes to make parameter exploration a bit easier. This is where some more complicated packages might help you - lazymcmc leaves more to the user.
Firstly, let's work in the parameter space of R0
and gamma
rather than beta
and gamma
. The reason being is that deSolve
struggles with SIR models for silly values of R0
, which is difficult to manage if we are working with beta
. This way, we can bound R0
from below at 1 (and above by something reasonable, say 10 here). We'll make some pretty strong assumptions about the limits for gamma as well, just to get the ball rolling here. Let's assume that our prior on gamma is that recovery time is somewhere between 1 and Inf
days.
The only thing we haven't looked at for lazymcmc
is the parTab
setup, which is straightforward. We just need columns for the parameter starting values, names, whether they are fixed or to be fitted, initial proposal step size, lower and upper bounds.
parTab <- data.frame(names=c("R0","gamma"), values=c(1.5,0.2), fixed=c(0,0), steps=c(0.1,0.1), lower_bound=c(1,0), upper_bound=c(10,1)) mcmcPars <- c("iterations"=2000,"popt"=0.44,"opt_freq"=100, "thin"=1,"adaptive_period"=1000,"save_block"=100) ## Putting model solving code in a function for later use solve_model <- function(pars, t, S0, I0, R0, SIR_odes){ N <- sum(S0, I0, R0) basic_repro_number <- pars["R0"] gamma <- pars["gamma"] beta <- basic_repro_number*gamma/N results <- as.data.frame(deSolve::ode(y=c(S=S0,I=I0,R=R0), times=t, func=SIR_odes, parms=c(beta,gamma))) incidence <- c(0,-diff(results$S)) return(incidence) } ## We need to put our likleihood function in a closure environment ## It's important to write the function in this form! create_lik <- function(parTab, data, PRIOR_FUNC,...){ par_names <- parTab$names ## Extract observed incidence inc <- data$inc ## Get times to solve model over tstep <- 1 max_weeks <- 52 t <- seq(tstep,max_weeks*7,by=tstep) ## We will pass S0, I0, R0 and SIR_odes N <- S0 + I0 + R0 ## using the `...` bit. likelihood_func <- function(pars){ names(pars) <- par_names incidence <- solve_model(pars, t, S0, I0, R0, SIR_odes) ## Get weekly incidence weekly_incidence <- colSums(matrix(incidence,nrow=7/tstep)) lik <- sum(dpois(x=inc,lambda=weekly_incidence,log=TRUE)) if(!is.null(PRIOR_FUNC)) lik <- lik + PRIOR_FUNC(pars) lik } } ## Starting points, chosen pseudo at random ## seeding chains for SIR models is hard with deSolve, ## so I've chosen points near the true values. startTab <- parTab startTab$values <- c(2.3, 0.17) output <- run_MCMC(parTab=startTab, data=dat, mcmcPars=mcmcPars, filename="SIR_fitting", CREATE_POSTERIOR_FUNC = create_lik, mvrPars = NULL, PRIOR_FUNC=NULL, S0=999,I0=1,R0=0,SIR_odes=SIR_odes) chain <- read.csv(output$file) plot(coda::as.mcmc(chain[,c("R0","gamma")]))
We've restimated our original R0
and gamma
, great! Convergence isn't amazing yet. Because R0
as a term contains gamma
, there is probably some correlation which makes a univariate sampler quite inefficient:
plot(chain$R0~chain$gamma)
A sampler that takes covariance into account is going to be more efficient here. Let's add a simple prior function as well, just to show how it works. Let's run the chain again using the multivariate sampler this time:
## Prior function my_prior <- function(pars){ ## Diffuse gaussian priors on both parameters r0_prior <- dnorm(pars[1],1.5,100,1) gamma_prior <- dnorm(pars[2],0.2,100,1) return(r0_prior + gamma_prior) } ## Use the previous chain to get a good starting ## covariance matrix startTab$values <- get_best_pars(chain) chain <- chain[chain$sampno >= mcmcPars["adaptive_period"],2:(ncol(chain)-1)] covMat <- cov(chain) mvrPars <- list(covMat,2.38/sqrt(nrow(parTab[parTab$fixed==0,])),w=0.8) ## Remove restrictions on parameter bounds - ## not important with the multivariate sampler startTab$lower_bound <- -Inf startTab$upper_bound <- Inf ## Run with mvrPars mcmcPars <- c("iterations"=20000,"popt"=0.234,"opt_freq"=500, "thin"=1,"adaptive_period"=7500,"save_block"=100) output2 <- run_MCMC(parTab=startTab, data=dat, mcmcPars=mcmcPars, filename="SIR_fitting", CREATE_POSTERIOR_FUNC = create_lik, mvrPars = mvrPars, PRIOR_FUNC=my_prior, S0=999,I0=1,R0=0,SIR_odes=SIR_odes) chain2 <- read.csv(output2$file) chain2 <- chain2[chain2$sampno >= mcmcPars["adaptive_period"],] plot(coda::as.mcmc(chain2[,c("R0","gamma")]))
Convergence looks much nicer - we've reestimated our R0
of 1.5 and gamma
of 0.2. We could do some diagnostics with the coda
package on the chain
object if we want, but that's beyond the scope of this vignette. The last thing to do is to plot our estimates against the data to make sure that our answer makes sense.
## Get the maximum likelihood parameters and estimate ## the model trajectory for each week best_pars <- get_best_pars(chain2) times <- seq(1,52*7,by=1) best_trajectory <- solve_model(best_pars,times,S0, I0, R0, SIR_odes) best_incidence <- colSums(matrix(best_trajectory,nrow=7)) best_dat <- data.frame(t=seq(1,52,by=1), y=best_incidence) ## Bit of code to generate prediction intervals n_samps <- 100 trajectories <- matrix(nrow=n_samps,ncol=52) samps <- sample(nrow(chain2),n_samps) for(i in 1:n_samps){ pars <- as.numeric(chain2[samps[i],c("R0","gamma")]) names(pars) <- c("R0","gamma") tmp <- solve_model(pars, times, S0, I0, R0, SIR_odes) trajectories[i,] <- colSums(matrix(tmp, nrow=7)) } bounds <- apply(trajectories,2,function(x) quantile(x, c(0.025,0.975))) plot(dat$inc~seq(1,52,by=1),col="red",pch=16, xlab="Time (weeks)",ylab="Incidence",ylim=c(0,100)) lines(best_dat,col="blue") lines(bounds[1,],col="blue",lty=2,lwd=0.5) lines(bounds[2,],col="blue",lty=2,lwd=0.5) legend("topright",c("Data","Model","95% credible intervals"), col=c("red","blue","blue"),lty=c(0,1,2),pch=c(16,NA,NA))
The fit looks good, so we can be pretty confident that our model fitting has worked correctly.
The key points to remember when using the package are:
parTab
, data
, PRIOR_FUNC
, and then any other arguments using ...
. It's helpful to abstract away as much of your solving code as possible.pars
as in the example.deSolve
, so we put fairly strong bounds in the parTab
argument. It's left to the user to decide the best way to seed your chain.steps
column of parTab
indicates the size of the univariate proposals. It is important to note that the sampler converts your model parameters to a unit scale between 0 and 1. If your true parameter range is -50 to 50, then a step size of 0.1 is actually performing jumps of size 1 in terms of real values. Have a look at the code for univ_proposal
if this is confusing. A REALLY important result of this is that when using a univariate sampler, using lower and upper bounds of Inf
will really mess up the interpretation of step size. You therefore need to have finite bounds with the univariate sampler.fixed
in parTab
- set it to 1 to fix a parameter.NaN
or NA
!parTab
puts implicit uniform priors on your parameters via the bounds. Be aware of this when specifying additional prior functions (ie. avoid double specification of priors).Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.