knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
serosolver
jointly infers antibody kinetics, infection histories and attack rates from cross-sectional or longitudinal serological data, originally described by Kucharski et al. @kucharski2018 and developed further by Hay et al. @hay2020 . serosolver
exploits the predictability of antibody kinetics over time following exposure, using one or more antibody level measurements combined with a parameterised antibody kinetics model to back-calculate (i.e., extrapolate backwards) the timing of individual infections.
What distinguishes serosolver
from other such time-since-infection models is a) jointly inferring an antibody kinetics model and infection histories within the same framework; b) the ability to infer multiple infections per person; and c) supporting multi-antigen systems as well as single-antigen problems.
The core of the package is the serosolver
function, which takes all of the data and model assumptions as inputs and saves Markov chain Monte Carlo (MCMC) outputs to disk. serosolver
implements a flexible, hierarchical model, meaning that users must be aware of a number of model design decisions. In this vignette, we will show how a serological dataset can be cleaned for use in serosolver
, and walk through the core functionality of the package and post-processing functions.
This guide provides an overview of the key components for fitting a serosolver
model. For more in depth instructions on specific features, see the case study vignettes. We will discuss the following workflow:
knitr::include_graphics("figures/guide_fig1.png")
The following code runs the entire pipeline for a simulation-recovery experiment with serosolver
. We will step through each of these sections in more detail in this guide.
library(serosolver) library(tidyverse) library(doParallel) ## Get all possible infection times possible_exposure_times <- seq(2000,2024,by=1) ## Vector of antigens that have biomarker measurements (note only one representative antigen per time) sampled_antigens <- seq(min(possible_exposure_times), max(possible_exposure_times), by=2) ## Times at which serum samples can be taken sampling_times <- 2020:2024 ## Number of serum samples taken n_samps <- 5 ## Simulate some random attack rates attack_rates <- runif(length(possible_exposure_times), 0.05, 0.15) ## Use example antigenic map (can also be left as NULL) antigenic_map <- example_antigenic_map[1:25,] antigenic_map$inf_times <- possible_exposure_times ## Simulate a full serosurvey with these parameters all_simulated_data <- simulate_data(par_tab=example_par_tab, group=1, n_indiv=50, possible_exposure_times=possible_exposure_times, measured_biomarker_ids = sampled_antigens, sampling_times=sampling_times, nsamps=n_samps, antigenic_map=antigenic_map, age_min=10,age_max=75, attack_rates=attack_rates, repeats=2, data_type=c(1)) ## Pull out the simulated titre data and infection histories antibody_data <- all_simulated_data$antibody_data true_inf_hist <- all_simulated_data$infection_histories plot_antibody_data(antibody_data,possible_exposure_times,1:4,infection_histories = true_inf_hist) output <- serosolver::serosolver(example_par_tab, antibody_data, antigenic_map, filename="readme", n_chains=3,parallel=TRUE, mcmc_pars=c(adaptive_iterations=100000, iterations=500000),verbose=TRUE) chains <- load_mcmc_chains(location=getwd(),par_tab=example_par_tab,burnin = 100000,unfixed=TRUE) plot_model_fits(chain = chains$theta_chain, infection_histories = chains$inf_chain, known_infection_history = true_inf_hist, antibody_data = antibody_data,individuals=1:4, antigenic_map=antigenic_map, par_tab=example_par_tab, orientation="cross-sectional")
Install serosolver
using the devtools
package:
devtools::install_github("seroanalytics/serosolver")
Note, that serosolver
uses Rcpp
code, including RcppArmadillo
. You will therefore need a working C++ compiler such as clang
or gcc
. The easiest way to set this up is up for Windows users is the Rtools
package. For Mac users, you will need to install a compiler such as Xcode
or clang
.
There are a number of additional package requirements, all of which can be installed from cran
:
library(serosolver) library(ggplot2) library(plyr) library(dplyr) library(tidyr) library(data.table) library(doParallel) library(coda)
serosolver
expects a data frame in long format (i.e., there is a row for each observation), referred to as antibody_data
in most function arguments, giving the individual ID, sample time, biomarkers or antigens measured, serological measurement, and demographic information:
data(example_antibody_data) knitr::kable(head(example_antibody_data))
Description of variables:
individual
: consecutive integer ID of individuals from $1,\dots,N$, where $N$ is the number of individuals in the sample.sample_time
: integer value of the time period in which a sample was collected.biomarker_id
: integer ID of the measured biomarker or antigen.biomarker_group
: optional, integer ID of the measurement type.measurement
: the measurement for this individual
against biomarker_id
at sample_time
.repeat_number
: integer value starting from 1 where repeat measurements are available per sample and biomarker_id
combination.population_group
: integer ID separating individuals within a sample into different groups.birth
: integer value giving the time period (matching the supported range of sample_time
) in which an individual was born.Data can be cross-sectional samples with multiple biomarkers measured per sample (e.g., multiple influenza strains), such that different biomarker_id
s correspond to different time periods, or longitidinal samples with one or more biomarkers measured per sample (e.g., measuring measles IgG titers over time).
plot_antibody_data(example_antibody_data,example_antigenic_map$inf_times,n_indivs=1:3,study_design = "cross-sectional")
plot_antibody_data(example_antibody_data %>% filter(biomarker_id %in% c(1968,2002,2012)),2010:2015,n_indivs=1:3,study_design = "longitudinal")
Most applications of serosolver
involve fitting to antibody titre or concentration data. Make a note of whether your data are continuous or discrete, and the upper and lower bounds of the assay. These will need to be specified in the model later using the par_tab
object. serosolver
also supports multiple measurement types per sample, for example, measuring both antibody titre and avidity. These observation types are distinguished by the biomarker_group
variable --- all observations within a biomarker_group
are underpinned by their own antibody kinetics model parameters, but arising from the same individual-level infection histories. These will be specified using the data_type
argument later.
## Example of two biomarker_groups, one has discrete data one has continuous data ## Set bounds of the assay par_tab[par_tab$names=="min_measurement" & par_tab$biomarker_group == 1,values] <- lower_bound1 par_tab[par_tab$names=="max_measurement" & par_tab$biomarker_group == 1,values] <- upper_bound1 par_tab[par_tab$names=="min_measurement" & par_tab$biomarker_group == 2,values] <- lower_bound2 par_tab[par_tab$names=="max_measurement" & par_tab$biomarker_group == 2,values] <- upper_bound2 ## Set observation type, discrete = 1, continuous = 2 data_type <- c(1,2) ## used in `serosolver::serosolver`, `serosolver::simulate_data` etc
serosolver
was designed to treat biomarker_id
as interchangeable with the time variables, such that one antigen ID is assumed to circulate in one time period. For example, biomarker_id=1968
refers to the antigen assumed to be circulating in 1968. Note that at present, serosolver
supports only one antigen/biomarker per time period. However, a single biomarker_id
can be assumed to circulate and dominate for multiple time periods, which can be specified using the antigenic_map
argument.
The antigenic_map
object tells serosolver
how different biomarker_id
s are related to each other (using coordinates on an antigenic map), as well as the time periods during which the biomarker_id
is assumed to be circulating. Setting all antigenic coordinates to the same values tells serosolver
that there are no antigenic differences between biomarker_id
s. The same thing can be achieved by setting antigenic_map=NULL
and instead specifying the possible_exposure_times
vector.
## Example antigenic map data(example_antigenic_map) knitr::kable(head(example_antigenic_map)) ## Antigenic map with no antigenic variation data.frame(x_coord=1,y_coord=1,inf_times=1:10) ## Antigenic map with clusters data.frame(x_coord = rep(seq(1,3,by=1),each=3), y_coord=rep(seq(1,3,by=1),each=3),inf_times=1:9) ## Antigenic map with just one biomarker ID data.frame(x_coord=1,y_coord=1,inf_times=1) ## Antigenic map with two biomarker groups data.frame(x_coord=1,y_coord=1,inf_times=rep(1:10,2),biomarker_group=rep(c(1,2),each=10))
Individuals can't be infected before they're born, and we can't infer infections which occurred after the final sample from each individual. serosolver
uses the birth
and sample_time
variables to constrain which biomarker_id
s an individual can have been infected by.
The population_group
variable is used to split individuals into different groups. These might represent different geographic groups, which can be used to estimate separate attack rates, or demographic groups, which can be used to estimate demographic-specific antibody kinetics (or both). For most models, this variable can be ignored or set to 1 for all individuals. See \<\< PLACEHOLDER >> for more details.
For new users, the most confusing model design decision in serosolver
is setting the infection history model. First, we need to understand the epidemiological question we seek to address. If our main objective is to estimate attack rates, we need to think about the power/information we have to estimate attack rates and infection histories at a chosen time resolution (e.g., annual attack rates or per month). If instead our main objective is to understand antibody kinetics, we might wish to simplify the infection history model to allow a more complex antibody kinetics model.
\<\< ADD IMAGE ON INFECTION HISTORY VECTOR WITH ATTACK RATES ETC. Could compare prior plot at different time resolutions >>
All time variables in a serosolver
model must be at the same time resolution, including sample_time
, biomarker_id
, birth
, and possible_exposure_times
. This can cause confusion, as these variables must be integers, and thus we have to do some transformation before running serosolver
:
For annual samples, the sample time is simply the year each sample was collected in. For example, data collected between 2009 and 2011 would have values of 2009, 2010, or 2011.
For semi-annual samples, the sample time is the year the sample was collected multiplied by 2 (since there are two semesters in every year). For example, data collected between 2009 and 2011 would have values of $(2009 \times 2) + 1 = 4019$, $(2009.5 \times 2) + 1 = 4020$, $(2010 \times 2) + 1 = 4021$, $(2010.5 \times 2) + 1 = 4022$, $(2011 \times 2) + 1 = 4023$, $(2011.5 \times 2) + 1 = 4024$.
Similarly, for quarterly or monthly samples, the sample time follows the same pattern. For example for a sample collected in July of 2009, the quarterly sample time would be $(2009.75 \times 4) + 1 = 8040$ (because July is in the third quarter of the year) and the monthly sample time would be $(2009.5833 \times 12) + 1 = 24116$ (because July is the seventh month and $7/12=0.5833$).
A useful feature of serosolver
is to use mixed time resolutions. For example, we might estimate infection histories at an annual resolution up to 2009, and then at a per-quarter resolution thereafter. Another example might be to estimate per-cluster infections for historic periods of time to establish long-term infection histories, and then per-quarter or per-year infection histories during the study period to model shorter-term dynamics. This can be useful for reducing parameter space when we have limited information to draw inferences about particular time periods, but rich information for others. See \<\< PLACEHOLDER >> for more details.
It is possible to estimate infection states just for the duration of the serostudy, or to estimate infection histories for an arbitrary number of time periods prior to the study. For example, we might use serosolver to estimate infections that occurred between two serum sample collection times in the same way we would measure seroconversion rates. Alternatively, we might only have serum samples for recent time periods, but aim to estimate infection histories all the way back to the time of birth.
serosolver
estimates an infection state for each element of the vector possible_exposure_times
, which is either passed directly or extract automatically from antigenic_map$inf_times
. The range and entries of these vectors therefore determines the model time period and resolution of the infection history model.
Version 1, 2 or 3? Link to paper
Several input files are stored within the serosolver
package in the 'inputs' directory, however users can create their own. The default parameter file 'parTab_base.csv' is shown below:
options(scipen=999) par_tab_path <- system.file("extdata", "par_tab_base.csv", package = "serosolver") par_tab <- read.csv(par_tab_path, stringsAsFactors=FALSE) par_tab[par_tab$names %in% c("alpha","beta"),"values"] <- c(1,1) ## Remove phi term for infection history prior version 2 par_tab <- par_tab[par_tab$type != 2,] knitr::kable(par_tab)
In the parameter file, the column 'fixed' indicates if each parameter value is fixed or should be estimated. For values that are not fixed, starting values need to be generated. The below code generates a random starting value for unfixed parameter values.
seroslver
uses an Adaptive Metropolis-within-Gibbs algorithm. Given a starting point and the necessary MCMC parameters, serosolver()
performs a random-walk of the posterior space to produce an MCMC chain that can be used to generate MCMC density and iteration plots (see section 4).
Using the default options, the MCMC can be run using the following.
# run MCMC res <- serosolver(par_tab = start_tab, titre_dat = titre_dat, antigenic_map = antigenic_map, start_inf_hist = start_inf, CREATE_POSTERIOR_FUNC = create_posterior_func, version = 2)
If you want to change any default options, such as increasing or decreasing the number of iterations and the adaptive period you can pass a vector of function argument values to serosolver()
.
mcmc_pars <- c("iterations"=50000,"target_acceptance_rate_theta"=0.44,"target_acceptance_rate_inf_hist"=0.44, "adaptive_frequency"=1000,"thin"=1, "adaptive_iterations"=10000, "save_block"=1000, "thin_inf_hist"=10, "proposal_inf_hist_indiv_prop"=0.5, "proposal_ratio"=2, "proposal_inf_hist_time_prop"=0.5, "proposal_inf_hist_distance"=5, "proposal_inf_hist_adaptive"=0, "proposal_inf_hist_indiv_swap_ratio"=0.5,"proposal_inf_hist_group_swap_ratio"=0.5, "proposal_inf_hist_group_swap_prop"=0.5) res <- serosolver(par_tab = start_tab, titre_dat = titre_dat, antigenic_map = antigenic_map, mcmc_pars = mcmc_pars, start_inf_hist = start_inf, CREATE_POSTERIOR_FUNC = create_posterior_func, version = 2)
serosolver
contains numerous plotting functions for visualising results and evaluating model performance. Below are examples of several useful plotting functions.
To evaluate model performance, trace and density plots can be produced by plotting the MCMC chains. In the below example the adaptive period and burn-in from the MCMC are excluded in the diagnostic plots.
Plot inferred attack rates.
Plot inferred infection histories.
Plot inferred antibody titres.
Generate cumulative incidence plots.
The following code produces a table of the parameter estimates.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.