library(EpiGenR) library(GGally) library(network) library(ape) library(ggplot2) library(grid) library(gridExtra) fig.counter <- list() knitr::opts_chunk$set(warning=FALSE, error=TRUE, message=FALSE, echo=TRUE)
In the example here, I simulate an epidemic according to a stochastic SIR model, which is a state space model with 3 state variables: Susceptible, Infected, and Removed. Two events can occur to change the state variable values: infection and recovery. Simulation takes in discrete steps indexed by $t$, where each step size is $dt$. During each small time interval $[(t-1)\cdot dt, t\cdot dt]$, the number of recovery events given $I_t$ infected individuals and a recovery rate of $\gamma$ is approximate binomial $\textrm{recoveries}_t\sim Bin(I_t,\gamma \Delta t)$. Assuming all onward transmissions occur at recovery, the number of infection events during a time interval $[(t-1)\cdot dt, t\cdot dt]$ follows the offspring distribution which I model using the negative binomial $\textrm{infections}_t \sim NBin(\textrm{recoveries}_t\times R_t,\textrm{recoveries}_t \times k)$.
we assume an S->I->R model of disease progression in which susceptible individuals become infected and capable of infecting others, and later recover and stop being infectious. The time to recovery is exponentially distributed with rate $\gamma$. Upon recovery, an infector infects $b_{0}$ number of individuals. The number of onward infections, i.e. `offspring', caused by each infected individual is a random variable drawn from a negative binomial offspring distribution $B \sim NBin(R, k)$ with mean $R=\frac{\sum_{i=0}^{N-1}b_i}{N}$, dispersion parameter $k$, and variance $\sigma^2 = R(1+\frac{R}{k})$. The mean of the offspring distribution is the reproductive number of the infectious disease, and is related to the basic reproduction number $R_0$ via the proportion of susceptible individuals in the population: $R=R_0\frac{S}{N}$. The parameter $k$ determines the level of overdispersion in the population. At smaller values of $k$, most individuals do not cause any further infections while a few contribute to most of the transmission events.
R0 <- 2 k <- 0.5 Tg <- 5 N <- 5000 S <- 4999 dt <- 0.1 total_dt <- 1500 min_epi_size <- 20 max_attempts <- 100 params <- c(R0=R0, k=k, Tg=Tg, N=N, S=S, I=N-S)
Setting $N=$ r prettyNum(N, ",")
, $R_0=$ r round(R0, 2)
, $k=$ r k
, and duration of infectiousness $\frac{1}{\gamma}=$ r Tg
days, we can simulate the outbreak using
seed.num <- 1010113 set.seed(seed.num) sim.outbreak <- simulate_sir(params, dt, total_dt, min_epi_size, max_attempts, TRUE)
The offspring distribution of the simulated epidemic follows a negative binomial distribution
r fig.counter.offspring <- length(fig.counter) + 1; fig.counter$offspring <- fig.counter.offspring
par(mar=c(5.1, 4.1, 0.25, 0.25)) offspring <- rnbinom(10000, mu=R0, size=k) hist(offspring, xlab="Number of onward infections", main="") legend("topright", legend=paste0("R0=", round(R0, 2), " and k=", k))
r fig.counter.sim.traj <- length(fig.counter) + 1; fig.counter$sim.traj <- fig.counter.sim.traj
The final epidemic size was r prettyNum(sim.outbreak$total_infected, ",")
.
The epidemic trajectories denoted by the incidence and prevalence curves are shown in the Figure r fig.counter.sim.traj
below. Assuming that infectious individuals are reported at the time of recovery, the incidence curve shows the daily number of reported cases.
P1 <- ggplot(data.frame(time_series_from_line_list(sim.outbreak))) + theme_bw() + geom_bar(aes(x=time, y=incidence), stat="identity") + xlab("Days since start of epidemic") + ylab("Incidence per day") + ggtitle("A") P2 <- ggplot(data.frame(x=(1:sim.outbreak$total_dt)*dt, Prevalence=sim.outbreak$prevalence)) + theme_bw() + geom_line(aes(x=x, y=Prevalence)) + xlab("Days since start of epidemic") + ggtitle("B") grid.arrange(P1, P2, ncol=1)
By setting \texttt{track_transmissions} to \texttt{TRUE} we can track who infected whom in the outbreak and thus reconstruct the transmission tree. From the transmission tree, we can infer the pathogen phylogeny which describes the ancestral relationship between pathogen isolates from infected individuals.
sim.transmission.tree <- as.data.frame(get_transmission_tree(sim.outbreak$infected)) sim.transmission.tree$from <- as.factor(sim.transmission.tree$from) sim.transmission.tree$to <- as.factor(sim.transmission.tree$to) fig.counter.sim.graph <- fig.counter
We can visualise the transmission network using the \texttt{get_transmission_tree} function. Below is the transmission network of the first 100 infected people.
r fig.counter.sim.graph <- length(fig.counter) + 1; fig.counter$sim.graph <- fig.counter.sim.graph
sim.graph <- network(sim.transmission.tree[1:100, 1:2], directed=TRUE) sim.graph %e% "length" <- sim.transmission.tree[1:100, 3] ggnet2(sim.graph, arrow.size = 9, node.alpha=.5, label=TRUE)
r fig.counter.tree.plot <- length(fig.counter) + 1; fig.counter$tree.plot <- fig.counter.tree.plot
The phylogenetic tree is related to the transmission tree. In the case of the latter, parents are represented by internal nodes whereas in the case of phylogenies, parents are represented by an external node (tip). The \texttt{get_phylo} function produces the phylogenetic tree for a given outbreak. Figure r fig.counter.tree.plot
is the phylogenetic tree of the first 100 individuals to be infected during the epidemic.
tree <- get_phylo(sim.outbreak$infected)
par(mar=c(0.5, 0.5, 0.5, 0.5)) not.sampled.tips <- 101:length(tree$tip.label) subtree <- drop.tip(tree, not.sampled.tips) plot(subtree)
Inferring parameters of dynamic disease models such as the SIR require data to be in time-series format, i.e. a quantity per time step. For epidemiologic data, this could be the number of reported cases per day. If data collected during an outbreak is in the form of a line list where each line contains information about an infected individual, this can be converted to time-series format using the function \texttt{time_series_from_line_list}. The first column should contain the ID of the infected individual and the second column the time of reporting. Here we assumed that an individual was reported upon recovery.
First, down-sample the outbreaks:
sampling.prob <- 0.01 data.dt <- 1 set.seed(seed.num) sampled.sim.outbreak <- downsample(sim.outbreak, strategy="proportional", prob=sampling.prob)
r fig.counter.phylo2 <- length(fig.counter) + 1; fig.counter$phylo2 <- fig.counter.phylo2
The phylogeny of the randomly sampled individuals is given in Figure r fig.counter.phylo2
.
subtree <- drop.tip(reorder.phylo(tree, "postorder"), which(!(1:length(tree$tip.label) %in% sampled.sim.outbreak$sampled_individuals))) plot(subtree, show.tip.label=FALSE)
Usually, epidemiological data appears in the form of a line list. To get the epidemiological data for the down-sampled outbreak (observed incidence over time), use the 'time_series_from_line_list' function which takes the sampling times of individuals, and the temporal resolution (step_size) as arguments. For example, if step_size=0.1 days, then the number of individuals in each 0.1 day interval is counted.
epi.sampling.times <- sampled.sim.outbreak$infected_sampled[, 3] epi_data <- time_series_from_line_list(epi.sampling.times, step_size=dt) head(epi_data)
In reality, data are unlikely to be reported at a more-than-daily frequency. In the inference algorithm, there is another parameter that controls how much to aggregate the epidemiological data. See the 'mcmc_options <- .....' line in Section 3, Subsection: "Create input files for EpiGenMCMC program". By setting the 'pfilter_every" argument to a value greater than 1/0.1=10, the incidence time series will be aggregated at a daily level. E.g. when pfilter_every=70, likelihood is calculated by comparing the simulated incidence over the course of 7 days to the observed incidence over the same 7 day period.
The 'coalescent.intervals.datedPhylo' function can be used generate a list object containing information on lineages through time and time intervals between coalescent events. This list object can then be passed to the 'coal.intervals.in.discrete.time' function to create the object
subtree.ci <- coalescent.intervals.datedPhylo(subtree) gen_data <- coal.intervals.in.discrete.time(subtree.ci, dt=dt)
If the TMRCA of the phylogeny starts earlier than the time series, or if the most recent sequence is not from the last individual that appears in the incidence time series, then the two sets of data need to be aligned such that they start and end on the same dates. The 'align_epi_gen_data' function can be used for this purpose. The 'last_tip_time' argument refers to the number of days (whatever the simulation time unit is) between the start of the epidemic time series and the time of the last sampled sequence.
both_data <- align_epi_gen_data(epi=epi_data, gen=gen_data, dt=dt, last_tip_time=with(sampled.sim.outbreak, max(infected[sampled_individuals, 3])))
param_list <- create_params_list( param_names=c("R0", "k", "rateI2R", "N", "S", "reporting", "time_before_data"), # All parameter values init_param_values=c(8e-5, k, 1/Tg, N, S, sampling.prob, 0), # Initial parameter values params_to_estimate=c("R0", "k", "rateI2R", "reporting"), # Names of parameters to be estimated transform=c(NA, NA, "inverse", NA), # The algorithm will estimate the value of the transformed parameter prior=c("unif", "gamma", "unif", "beta"), # Prior distribution prior_params=list(c(0.0, 0.01), c(2.5, 1/2.5), c(1.0, 30.0), c(1.0, 3.0)), # Parameters for the prior distribution proposal_params=list(c(5e-6, 0.0, 0.01), c(0.05, 0.001, 5), c(1.0, 1.0, 30.0), c(0.05, 0.0, 1.0)), # proposal_params: each element of the list includes the standard deviation # of proposal distribution, and the range of parameter values to be explored optimal_acceptance=0.234, lower_acceptance=0.1, upper_acceptance=0.8, adapt_every=20, max_adapt_times=100 ) mcmc_options <- create_mcmc_options ( particles=5000, iterations=1000, log_every=1, # how often to save parameter estimates pfilter_every=10, # resample particles every 10*0.1 days (i.e. 1 day), where each simulation time step is 0.1 days which_likelihood=0, # 0= use both epi and genetic data, 1=use only epi data, 2=use only genetic data pfilter_threshold=1.0, # filter particles when ESS < this threshold num_threads=4, # number of cores to use during parallelised particle filtering log_filename="log.txt", traj_filename="traj.txt") input_dir <- tempdir() input_files <- EpiGenR::generate_cpp_input_files( dt=dt, params=param_list, initial_states=c(S=S, I=1, R=0), data=both_data, mcmc_options=mcmc_options, params_file = paste0(input_dir, "/param.txt"), mcmc_options_file = paste0(input_dir, "/mcmc_options.txt"), initial_states_file = paste0(input_dir, "/initial_states.txt"), data_file = paste0(input_dir, "/data") )
As the default clang compiler on Mac OSX does not support OpenMP, a separate compiler needs to be installed on the computer in order to compile code that uses OpenMP. The gcc compiler can be downloaded from here: https://gcc.gnu.org/wiki/openmp
Usually to compile C++ code, the gcc compiler to use is 'g++-6'
cd /path/to/EpiGenMCMC/ g++-6 -fopenmp -lgsl -Imodels *.cpp models/SIR_offspring_distribution.cpp -o SIRmodel
I would recommend running the program from the terminal:
cat(paste("cd", input_dir, ";\n /path/to/EpiGenMCMC/./SIRmodel ", gsub(paste0(input_dir, "/"), "", input_files)))
However it can also be run from the R console using:
EpiGenR::run_pMCMC("/path/to/EpiGenMCMC/src/SIRmodel", input_files, wait=TRUE)
To check that the model has been coded correctly in C++, you might want to simulate epidemics from the model. To do this, another C++ program needs to be compiled:
g++-6 -fopenmp -lgsl simulate/main.cpp model.cpp parameter.cpp trajectory.cpp models/SIR_offspring_distribution.cpp -o simulate
The arguments for the simulation program are (in order)
./simulate param.txt 10 1000 0.1 10 1 718247194 4 initial_states.txt simulation_output.txt
model_params(argv[1]);
std::vector
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.