{bpmodels}
is now retired and will no longer be maintained. We recommend using{epichains}
instead. If you need help converting your code to use{epichains}
, please open a discussion on epichains.
knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE, collapse = TRUE, comment = "#>" )
Branching processes can be used to project infectious disease trends in time
provided we can characterise the distribution of times between the symptom onset of
successive cases (serial interval), and the distribution of secondary cases
produced by a single individual (offspring distribution). Such simulations can
be achieved in bpmodels with the chain_sim()
function and @pearson2020, and
@abbott2020 illustrate its application to COVID-19.
The purpose of this vignette is to use early data on COVID-19 in South Africa [@marivate2020] to illustrate how bpmodels can be used to project incidence in an outbreak.
Let's load the required packages
library("bpmodels") library("dplyr") library("ggplot2") library("lubridate")
Included in bpmodels is a cleaned time series of the first 15 days of the COVID-19 outbreak in South Africa. This can be loaded into memory as follows:
data("covid19_sa", package = "bpmodels")
Let us examine the first 6 entries of the dataset.
head(covid19_sa)
chain_sim()
requires a vector of onset times, t0
, for each
chain/individual/simulation.
The covid19_sa
dataset above is aggregated, so we will have to disaggregate
it into a linelist with each row representing a case and their onset time.
To achieve this, we will first use the date of the index case as the reference and find the difference between each date and the reference.
days_since_index <- as.integer(covid19_sa$date - min(covid19_sa$date)) days_since_index
Using the vector of start times for the time series, we will then create the linelist by disaggregating the time series so that each case has a corresponding start time.
start_times <- unlist(mapply( function(x, y) rep(x, times = ifelse(y == 0, 1, y)), days_since_index, covid19_sa$cases )) start_times
The log-normal distribution is commonly used in epidemiology to characterise quantities such as the serial interval because it has a large variance and can only be positive-valued [@nishiura2007; @limpert2001].
In this example, we will assume based on COVID-19 literature that the serial interval, S, is log-normal distributed with parameters, $\mu = 4.7$ and $\sigma = 2.9$ [@pearson2020]. Note that when the distribution is described this way, it means $\mu$ and $\sigma$ are the expected value and standard deviation of the natural logarithm of the serial interval. Hence, in order to sample the "back-transformed" measured serial interval with expectation/mean, $E[S]$ and standard deviation, $SD [S]$, we can use the following parametrisation:
\begin{align} E[S] &= \ln \left( \dfrac{\mu^2}{(\sqrt{\mu^2 + \sigma^2}} \right) \
SD [S] &= \sqrt {\ln \left(1 + \dfrac{\sigma^2}{\mu^2} \right)}
\end{align}
See "log-normal_distribution" on Wikipedia for a detailed explanation of this parametrisation.
We will now set up the serial interval function with the appropriate inputs.
We adopt R's random lognormal distribution generator (rlnorm()
) that
takes meanlog
and sdlog
as arguments, which we define with the
parametrisation above as log_mean()
and log_sd()
respectively and wrap it in
the serial_interval()
function. Moreover, serial_interval()
takes one
argument sample_size
as is required by bpmodels
(See ?bpmodels::chain_sim
), which is further passed to rlnorm()
as the
first argument to determine the number of observations to sample
(See ?rlnorm
).
mu <- 4.7 sgma <- 2.9 log_mean <- log((mu^2) / (sqrt(sgma^2 + mu^2))) # log mean log_sd <- sqrt(log(1 + (sgma / mu)^2)) # log sd #' serial interval function serial_interval <- function(sample_size) { si <- rlnorm(sample_size, meanlog = log_mean, sdlog = log_sd) return(si) }
The negative binomial distribution is commonly used in epidemiology to account for individual variation in transmissibility, also known as superspreading [@lloyd-smith2005].
For this example, we will assume that the offspring distribution is characterised by a negative binomial with $R = 2.5$ [@abbott2020] and $k = 0.58$ [@wang2020]. In this parameterization, $R$ represents the $R_0$, which is defined as the average number of cases produced by a single individual in an entirely susceptible population. The parameter $k$ represents superspreading, that is, the degree of heterogeneity in transmission by single individuals.
chain_sim()
also requires the end time for the simulations. For this
example, we will simulate outbreaks that end 14 days after the last date
of observations in covid19_sa
.
#' Date to end simulation (14 day projection in this case) projection_window <- 14 # 14 days/ 2-week ahead projection projection_end_day <- max(days_since_index) + projection_window projection_end_day
chain_sim()
is stochastic, meaning the results are different every
time it is run for the same set of parameters, so we will run the simulations
many times and summarise the results.
We will, therefore, run each simulation $100$ times.
#' Number of simulations sim_rep <- 100
chain_sim()
makes the following simplifying assumptions:
To summarise the whole set up so far, we are going to simulate
each chain r sim_rep
times, projecting COVID-19 cases over
r projection_window
days after the first $15$ days.
We will use the function lapply()
to run the simulations and bind them
by rows with dplyr::bind_rows()
.
set.seed(1234) sim_chain_sizes <- lapply( seq_len(sim_rep), function(sim) { chain_sim( n = length(start_times), offspring = "nbinom", mu = 2.5, size = 0.58, stat = "size", serial = serial_interval, t0 = start_times, tf = projection_end_day, tree = TRUE ) %>% mutate(sim = sim) } ) sim_output <- bind_rows(sim_chain_sizes)
Let us view the first few rows of the simulation results.
head(sim_output)
Now, we will summarise the simulation results.
We want to plot the individual simulated daily time series and show the median cases per day aggregated over all simulations.
First, we will create the daily time series per simulation by aggregating the number of cases per day of each simulation.
# Daily number of cases for each simulation incidence_ts <- sim_output %>% mutate(day = ceiling(time)) %>% group_by(sim, day) %>% summarise(cases = n()) %>% ungroup() head(incidence_ts)
Next, we will add a date column to the results of each simulation set. We will use the date of the first case in the observed data as the reference start date.
# Get start date from the observed data index_date <- min(covid19_sa$date) index_date # Add a dates column to each simulation result incidence_ts_by_date <- incidence_ts %>% group_by(sim) %>% mutate(date = index_date + days(seq(0, n() - 1))) %>% ungroup() head(incidence_ts_by_date)
Now we will aggregate the simulations by day and evaluate the median daily cases across all simulations.
# Median daily number of cases aggregated across all simulations median_daily_cases <- incidence_ts_by_date %>% group_by(date) %>% summarise(median_cases = median(cases)) %>% ungroup() %>% arrange(date) head(median_daily_cases)
We will now plot the individual simulation results alongside the median of the aggregated results.
## since all simulations may end at a different date, we will find the minimum ## final date for all simulations for the purposes of visualisation final_date <- incidence_ts_by_date %>% group_by(sim) %>% summarise(final_date = max(date), .groups = "drop") %>% summarise(min_final_date = min(final_date)) %>% pull(min_final_date) incidence_ts_by_date <- incidence_ts_by_date %>% filter(date <= final_date) median_daily_cases <- median_daily_cases %>% filter(date <= final_date) ggplot(data = incidence_ts_by_date) + geom_line( aes( x = date, y = cases, group = sim ), color = "grey", linewidth = 0.2, alpha = 0.25 ) + geom_line( data = median_daily_cases, aes( x = date, y = median_cases ), color = "tomato3", linewidth = 1.8 ) + geom_point( data = covid19_sa, aes( x = date, y = cases ), color = "black", size = 1.75, shape = 21 ) + geom_line( data = covid19_sa, aes( x = date, y = cases ), color = "black", linewidth = 1 ) + scale_x_continuous( breaks = seq( min(incidence_ts_by_date$date), max(incidence_ts_by_date$date), 5 ), labels = seq( min(incidence_ts_by_date$date), max(incidence_ts_by_date$date), 5 ) ) + scale_y_continuous( breaks = seq( 0, max(incidence_ts_by_date$cases) + 200, 250 ), labels = seq( 0, max(incidence_ts_by_date$cases) + 200, 250 ) ) + geom_vline( mapping = aes(xintercept = max(covid19_sa$date)), linetype = "dashed" ) + labs(x = "Date", y = "Daily cases")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.