In lucymli/EpiGenR: Combined analysis of epidemiologic and genetic data to infer infectious disease dynamics

library(EpiGenR)
library(GGally)
library(network)
library(ape)
library(ggplot2)
library(grid)
library(gridExtra)
fig.counter <- list()
knitr::opts_chunk$set(warning=FALSE, error=TRUE, message=FALSE, echo=TRUE)

Simulate epidemic data
Convert line list data and pathogen phylogeny into list objects
Construct input objects for inference
Compile and call the EpiGenMCMC program to estimate parameters

1. Simulate epidemic data

In the example here, I simulate an epidemic according to a stochastic SIR model, which is a state space model with 3 state variables: Susceptible, Infected, and Removed. Two events can occur to change the state variable values: infection and recovery. Simulation takes in discrete steps indexed by $t$, where each step size is $dt$. During each small time interval $[(t-1)\cdot dt, t\cdot dt]$, the number of recovery events given $I_t$ infected individuals and a recovery rate of $\gamma$ is approximate binomial $\textrm{recoveries}_t\sim Bin(I_t,\gamma \Delta t)$. Assuming all onward transmissions occur at recovery, the number of infection events during a time interval $[(t-1)\cdot dt, t\cdot dt]$ follows the offspring distribution which I model using the negative binomial $\textrm{infections}_t \sim NBin(\textrm{recoveries}_t\times R_t,\textrm{recoveries}_t \times k)$.

we assume an S->I->R model of disease progression in which susceptible individuals become infected and capable of infecting others, and later recover and stop being infectious. The time to recovery is exponentially distributed with rate $\gamma$. Upon recovery, an infector infects $b_{0}$ number of individuals. The number of onward infections, i.e. `offspring', caused by each infected individual is a random variable drawn from a negative binomial offspring distribution $B \sim NBin(R, k)$ with mean $R=\frac{\sum_{i=0}^{N-1}b_i}{N}$, dispersion parameter $k$, and variance $\sigma^2 = R(1+\frac{R}{k})$. The mean of the offspring distribution is the reproductive number of the infectious disease, and is related to the basic reproduction number $R_0$ via the proportion of susceptible individuals in the population: $R=R_0\frac{S}{N}$. The parameter $k$ determines the level of overdispersion in the population. At smaller values of $k$, most individuals do not cause any further infections while a few contribute to most of the transmission events.

R0 <- 2
k <- 0.5
Tg <- 5
N <- 5000
S <- 4999
dt <- 0.1
total_dt <- 1500
min_epi_size <- 20
max_attempts <- 100
params <- c(R0=R0, k=k, Tg=Tg, N=N, S=S, I=N-S)

Setting $N=$ r prettyNum(N, ","), $R_0=$ r round(R0, 2), $k=$ r k, and duration of infectiousness $\frac{1}{\gamma}=$ r Tg days, we can simulate the outbreak using

seed.num <- 1010113
set.seed(seed.num)
sim.outbreak <- simulate_sir(params, dt, total_dt, min_epi_size, max_attempts, TRUE)

The offspring distribution of the simulated epidemic follows a negative binomial distribution

r fig.counter.offspring <- length(fig.counter) + 1; fig.counter$offspring <- fig.counter.offspring

par(mar=c(5.1, 4.1, 0.25, 0.25))
offspring <- rnbinom(10000, mu=R0, size=k)
hist(offspring, xlab="Number of onward infections", main="")
legend("topright", legend=paste0("R0=", round(R0, 2), " and k=", k))

r fig.counter.sim.traj <- length(fig.counter) + 1; fig.counter$sim.traj <- fig.counter.sim.traj

The final epidemic size was r prettyNum(sim.outbreak$total_infected, ",").

Simulated epidemic trajectories

The epidemic trajectories denoted by the incidence and prevalence curves are shown in the Figure r fig.counter.sim.traj below. Assuming that infectious individuals are reported at the time of recovery, the incidence curve shows the daily number of reported cases.

P1 <- ggplot(data.frame(time_series_from_line_list(sim.outbreak))) + theme_bw() +
  geom_bar(aes(x=time, y=incidence), stat="identity") +
  xlab("Days since start of epidemic") +
  ylab("Incidence per day") + ggtitle("A")
P2 <- ggplot(data.frame(x=(1:sim.outbreak$total_dt)*dt, Prevalence=sim.outbreak$prevalence)) +
  theme_bw() + geom_line(aes(x=x, y=Prevalence)) +
  xlab("Days since start of epidemic") + ggtitle("B")
grid.arrange(P1, P2, ncol=1)

2. Convert line list data and pathogen phylogeny into list objects

Transmission Tree

By setting \texttt{track_transmissions} to \texttt{TRUE} we can track who infected whom in the outbreak and thus reconstruct the transmission tree. From the transmission tree, we can infer the pathogen phylogeny which describes the ancestral relationship between pathogen isolates from infected individuals.

sim.transmission.tree <- as.data.frame(get_transmission_tree(sim.outbreak$infected))
sim.transmission.tree$from <- as.factor(sim.transmission.tree$from)
sim.transmission.tree$to <- as.factor(sim.transmission.tree$to)
fig.counter.sim.graph <- fig.counter

We can visualise the transmission network using the \texttt{get_transmission_tree} function. Below is the transmission network of the first 100 infected people.

r fig.counter.sim.graph <- length(fig.counter) + 1; fig.counter$sim.graph <- fig.counter.sim.graph

sim.graph <- network(sim.transmission.tree[1:100, 1:2], directed=TRUE)
sim.graph %e% "length" <- sim.transmission.tree[1:100, 3]
ggnet2(sim.graph, arrow.size = 9, node.alpha=.5, label=TRUE)

Phylogeny

r fig.counter.tree.plot <- length(fig.counter) + 1; fig.counter$tree.plot <- fig.counter.tree.plot

The phylogenetic tree is related to the transmission tree. In the case of the latter, parents are represented by internal nodes whereas in the case of phylogenies, parents are represented by an external node (tip). The \texttt{get_phylo} function produces the phylogenetic tree for a given outbreak. Figure r fig.counter.tree.plot is the phylogenetic tree of the first 100 individuals to be infected during the epidemic.

tree <- get_phylo(sim.outbreak$infected)

par(mar=c(0.5, 0.5, 0.5, 0.5))
not.sampled.tips <- 101:length(tree$tip.label)
subtree <- drop.tip(tree, not.sampled.tips)
plot(subtree)

Producing time-series data from simulation

Inferring parameters of dynamic disease models such as the SIR require data to be in time-series format, i.e. a quantity per time step. For epidemiologic data, this could be the number of reported cases per day. If data collected during an outbreak is in the form of a line list where each line contains information about an infected individual, this can be converted to time-series format using the function \texttt{time_series_from_line_list}. The first column should contain the ID of the infected individual and the second column the time of reporting. Here we assumed that an individual was reported upon recovery.

First, down-sample the outbreaks:

sampling.prob <- 0.01
data.dt <- 1
set.seed(seed.num)
sampled.sim.outbreak <- downsample(sim.outbreak, strategy="proportional", prob=sampling.prob)

r fig.counter.phylo2 <- length(fig.counter) + 1; fig.counter$phylo2 <- fig.counter.phylo2

The phylogeny of the randomly sampled individuals is given in Figure r fig.counter.phylo2.

subtree <- drop.tip(reorder.phylo(tree, "postorder"), which(!(1:length(tree$tip.label) %in% sampled.sim.outbreak$sampled_individuals)))
plot(subtree, show.tip.label=FALSE)

3. Construct input objects for inference

Generate data files

Usually, epidemiological data appears in the form of a line list. To get the epidemiological data for the down-sampled outbreak (observed incidence over time), use the 'time_series_from_line_list' function which takes the sampling times of individuals, and the temporal resolution (step_size) as arguments. For example, if step_size=0.1 days, then the number of individuals in each 0.1 day interval is counted.

epi.sampling.times <- sampled.sim.outbreak$infected_sampled[, 3]
epi_data <- time_series_from_line_list(epi.sampling.times, step_size=dt)
head(epi_data)

In reality, data are unlikely to be reported at a more-than-daily frequency. In the inference algorithm, there is another parameter that controls how much to aggregate the epidemiological data. See the 'mcmc_options <- .....' line in Section 3, Subsection: "Create input files for EpiGenMCMC program". By setting the 'pfilter_every" argument to a value greater than 1/0.1=10, the incidence time series will be aggregated at a daily level. E.g. when pfilter_every=70, likelihood is calculated by comparing the simulated incidence over the course of 7 days to the observed incidence over the same 7 day period.

The 'coalescent.intervals.datedPhylo' function can be used generate a list object containing information on lineages through time and time intervals between coalescent events. This list object can then be passed to the 'coal.intervals.in.discrete.time' function to create the object

subtree.ci <- coalescent.intervals.datedPhylo(subtree)
gen_data <- coal.intervals.in.discrete.time(subtree.ci, dt=dt)

If the TMRCA of the phylogeny starts earlier than the time series, or if the most recent sequence is not from the last individual that appears in the incidence time series, then the two sets of data need to be aligned such that they start and end on the same dates. The 'align_epi_gen_data' function can be used for this purpose. The 'last_tip_time' argument refers to the number of days (whatever the simulation time unit is) between the start of the epidemic time series and the time of the last sampled sequence.

both_data <- 
  align_epi_gen_data(epi=epi_data, gen=gen_data, dt=dt, 
                     last_tip_time=with(sampled.sim.outbreak, max(infected[sampled_individuals, 3])))

Create input files for EpiGenMCMC program

param_list <- create_params_list(
  param_names=c("R0", "k", "rateI2R", "N", "S", "reporting", "time_before_data"), # All parameter values
  init_param_values=c(8e-5, k, 1/Tg, N, S, sampling.prob, 0), # Initial parameter values
  params_to_estimate=c("R0", "k", "rateI2R", "reporting"), # Names of parameters to be estimated
  transform=c(NA, NA, "inverse", NA), # The algorithm will estimate the value of the transformed parameter
  prior=c("unif", "gamma", "unif", "beta"), # Prior distribution
  prior_params=list(c(0.0, 0.01), c(2.5, 1/2.5), c(1.0, 30.0), c(1.0, 3.0)), # Parameters for the prior distribution
  proposal_params=list(c(5e-6, 0.0, 0.01), c(0.05, 0.001, 5), c(1.0, 1.0, 30.0), c(0.05, 0.0, 1.0)),  
  # proposal_params: each element of the list includes the standard deviation 
  # of proposal distribution, and the range of parameter values to be explored
  optimal_acceptance=0.234,
  lower_acceptance=0.1, upper_acceptance=0.8,
  adapt_every=20,
  max_adapt_times=100
)


mcmc_options <- create_mcmc_options (
  particles=5000, iterations=1000, 
  log_every=1, # how often to save parameter estimates
  pfilter_every=10, # resample particles every 10*0.1 days (i.e. 1 day), where each simulation time step is 0.1 days
  which_likelihood=0, # 0= use both epi and genetic data, 1=use only epi data, 2=use only genetic data
  pfilter_threshold=1.0, # filter particles when ESS < this threshold
  num_threads=4, # number of cores to use during parallelised particle filtering
  log_filename="log.txt", traj_filename="traj.txt")

input_dir <- tempdir()

input_files <- EpiGenR::generate_cpp_input_files(
  dt=dt, params=param_list, initial_states=c(S=S, I=1, R=0),
  data=both_data,
  mcmc_options=mcmc_options,
  params_file = paste0(input_dir, "/param.txt"), 
  mcmc_options_file = paste0(input_dir, "/mcmc_options.txt"), 
  initial_states_file = paste0(input_dir, "/initial_states.txt"), 
  data_file = paste0(input_dir, "/data")
  )

4. Compile C++ program (EpiGenMCMC) and use to to estimate model parameters

As the default clang compiler on Mac OSX does not support OpenMP, a separate compiler needs to be installed on the computer in order to compile code that uses OpenMP. The gcc compiler can be downloaded from here: https://gcc.gnu.org/wiki/openmp

Usually to compile C++ code, the gcc compiler to use is 'g++-6'

cd /path/to/EpiGenMCMC/
g++-6 -fopenmp -lgsl -Imodels *.cpp models/SIR_offspring_distribution.cpp -o SIRmodel

I would recommend running the program from the terminal:

cat(paste("cd", input_dir, ";\n /path/to/EpiGenMCMC/./SIRmodel ", gsub(paste0(input_dir, "/"), "", input_files)))

However it can also be run from the R console using:

EpiGenR::run_pMCMC("/path/to/EpiGenMCMC/src/SIRmodel", input_files, wait=TRUE)

5. Simulate from model using C++

To check that the model has been coded correctly in C++, you might want to simulate epidemics from the model. To do this, another C++ program needs to be compiled:

g++-6 -fopenmp -lgsl simulate/main.cpp model.cpp parameter.cpp trajectory.cpp models/SIR_offspring_distribution.cpp -o simulate

The arguments for the simulation program are (in order)

name of parameter file ("param.txt")
number of repeated simulations
total number of simulation time steps
dt, size of simulation time step
size of aggregation (e.g. 10 -> report the number of newly infected individuals every 10 simulation time steps)
number of subpopulations (should set this to 1)
Random number seed
Number of cores to split the particle filtering
Initial states ("initial_states.txt")
Name of output file

./simulate param.txt 10 1000 0.1 10 1 718247194 4 initial_states.txt simulation_output.txt

model_params(argv[1]); std::vector values = model_params.get_values_vector(); std::vector param_names = model_params.get_names_vector(); std::cout << "Read in a total of " << model_params.get_total_params() << " parameters." << std::endl; int replicates = std::stoi(argv[2]); int total_dt = std::stoi(argv[3]); double dt_size = std::stod(argv[4]); int sum_every = std::stoi(argv[5]); int num_groups = std::stoi(argv[6]); int seed_num = std::stoi(argv[7]); int num_threads = std::stoi(argv[8]); std::string traj_input = argv[9]; std::string traj_output = argv[10];

lucymli/EpiGenR documentation built on Aug. 27, 2020, 7:32 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com