In adamkucharski/serosolver: Inference Framework for Serological Data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Overview

serosolver jointly infers antibody kinetics, infection histories and attack rates from cross-sectional or longitudinal serological data, originally described by Kucharski et al. @kucharski2018 and developed further by Hay et al. @hay2020 . serosolver exploits the predictability of antibody kinetics over time following exposure, using one or more antibody level measurements combined with a parameterised antibody kinetics model to back-calculate (i.e., extrapolate backwards) the timing of individual infections.

What distinguishes serosolver from other such time-since-infection models is a) jointly inferring an antibody kinetics model and infection histories within the same framework; b) the ability to infer multiple infections per person; and c) supporting multi-antigen systems as well as single-antigen problems.

The core of the package is the serosolver function, which takes all of the data and model assumptions as inputs and saves Markov chain Monte Carlo (MCMC) outputs to disk. serosolver implements a flexible, hierarchical model, meaning that users must be aware of a number of model design decisions. In this vignette, we will show how a serological dataset can be cleaned for use in serosolver, and walk through the core functionality of the package and post-processing functions.

This guide provides an overview of the key components for fitting a serosolver model. For more in depth instructions on specific features, see the case study vignettes. We will discuss the following workflow:

knitr::include_graphics("figures/guide_fig1.png")

Entire workflow

The following code runs the entire pipeline for a simulation-recovery experiment with serosolver. We will step through each of these sections in more detail in this guide.

library(serosolver)
library(tidyverse)
library(doParallel)

## Get all possible infection times
possible_exposure_times <- seq(2000,2024,by=1)

## Vector of antigens that have biomarker measurements (note only one representative antigen per time)
sampled_antigens <- seq(min(possible_exposure_times), max(possible_exposure_times), by=2)

## Times at which serum samples can be taken
sampling_times <- 2020:2024

## Number of serum samples taken
n_samps <- 5

## Simulate some random attack rates
attack_rates <- runif(length(possible_exposure_times), 0.05, 0.15)

## Use example antigenic map (can also be left as NULL)
antigenic_map <- example_antigenic_map[1:25,]
antigenic_map$inf_times <- possible_exposure_times

## Simulate a full serosurvey with these parameters
all_simulated_data <- simulate_data(par_tab=example_par_tab, group=1, n_indiv=50,
                                    possible_exposure_times=possible_exposure_times,
                                    measured_biomarker_ids = sampled_antigens,
                                    sampling_times=sampling_times, nsamps=n_samps,
                                    antigenic_map=antigenic_map,
                                    age_min=10,age_max=75,
                                    attack_rates=attack_rates, repeats=2,
                                    data_type=c(1))

## Pull out the simulated titre data and infection histories
antibody_data <- all_simulated_data$antibody_data
true_inf_hist <- all_simulated_data$infection_histories
plot_antibody_data(antibody_data,possible_exposure_times,1:4,infection_histories = true_inf_hist)

output <- serosolver::serosolver(example_par_tab, antibody_data, 
                                 antigenic_map,
                filename="readme", n_chains=3,parallel=TRUE,
                mcmc_pars=c(adaptive_iterations=100000, iterations=500000),verbose=TRUE)

chains <- load_mcmc_chains(location=getwd(),par_tab=example_par_tab,burnin = 100000,unfixed=TRUE)
plot_model_fits(chain = chains$theta_chain,
                infection_histories = chains$inf_chain,
                known_infection_history = true_inf_hist,
                antibody_data = antibody_data,individuals=1:4,
                antigenic_map=antigenic_map,
                par_tab=example_par_tab,
                orientation="cross-sectional")

Installation and dependencies

Install serosolver using the devtools package:

devtools::install_github("seroanalytics/serosolver")

Note, that serosolver uses Rcpp code, including RcppArmadillo. You will therefore need a working C++ compiler such as clang or gcc. The easiest way to set this up is up for Windows users is the Rtools package. For Mac users, you will need to install a compiler such as Xcode or clang.

There are a number of additional package requirements, all of which can be installed from cran:

library(serosolver)
library(ggplot2)
library(plyr)
library(dplyr)
library(tidyr)
library(data.table)
library(doParallel)
library(coda)

1. Input data

1.1 Required variables

serosolver expects a data frame in long format (i.e., there is a row for each observation), referred to as antibody_data in most function arguments, giving the individual ID, sample time, biomarkers or antigens measured, serological measurement, and demographic information:

data(example_antibody_data)
knitr::kable(head(example_antibody_data))

Description of variables:

individual: consecutive integer ID of individuals from $1,\dots,N$, where $N$ is the number of individuals in the sample.
sample_time: integer value of the time period in which a sample was collected.
biomarker_id: integer ID of the measured biomarker or antigen.
biomarker_group: optional, integer ID of the measurement type.
measurement: the measurement for this individual against biomarker_id at sample_time.
repeat_number: integer value starting from 1 where repeat measurements are available per sample and biomarker_id combination.
population_group: integer ID separating individuals within a sample into different groups.
birth: integer value giving the time period (matching the supported range of sample_time) in which an individual was born.

Data can be cross-sectional samples with multiple biomarkers measured per sample (e.g., multiple influenza strains), such that different biomarker_ids correspond to different time periods, or longitidinal samples with one or more biomarkers measured per sample (e.g., measuring measles IgG titers over time).

plot_antibody_data(example_antibody_data,example_antigenic_map$inf_times,n_indivs=1:3,study_design = "cross-sectional")

plot_antibody_data(example_antibody_data %>% filter(biomarker_id %in% c(1968,2002,2012)),2010:2015,n_indivs=1:3,study_design = "longitudinal")

1.2 Serological data

Most applications of serosolver involve fitting to antibody titre or concentration data. Make a note of whether your data are continuous or discrete, and the upper and lower bounds of the assay. These will need to be specified in the model later using the par_tab object. serosolver also supports multiple measurement types per sample, for example, measuring both antibody titre and avidity. These observation types are distinguished by the biomarker_group variable --- all observations within a biomarker_group are underpinned by their own antibody kinetics model parameters, but arising from the same individual-level infection histories. These will be specified using the data_type argument later.

## Example of two biomarker_groups, one has discrete data one has continuous data
## Set bounds of the assay
par_tab[par_tab$names=="min_measurement" & par_tab$biomarker_group == 1,values] <- lower_bound1
par_tab[par_tab$names=="max_measurement" & par_tab$biomarker_group == 1,values] <- upper_bound1

par_tab[par_tab$names=="min_measurement" & par_tab$biomarker_group == 2,values] <- lower_bound2
par_tab[par_tab$names=="max_measurement" & par_tab$biomarker_group == 2,values] <- upper_bound2

## Set observation type, discrete = 1, continuous = 2
data_type <- c(1,2) ## used in `serosolver::serosolver`, `serosolver::simulate_data` etc

1.3 Antigenic map

serosolver was designed to treat biomarker_id as interchangeable with the time variables, such that one antigen ID is assumed to circulate in one time period. For example, biomarker_id=1968 refers to the antigen assumed to be circulating in 1968. Note that at present, serosolver supports only one antigen/biomarker per time period. However, a single biomarker_id can be assumed to circulate and dominate for multiple time periods, which can be specified using the antigenic_map argument.

The antigenic_map object tells serosolver how different biomarker_ids are related to each other (using coordinates on an antigenic map), as well as the time periods during which the biomarker_id is assumed to be circulating. Setting all antigenic coordinates to the same values tells serosolver that there are no antigenic differences between biomarker_ids. The same thing can be achieved by setting antigenic_map=NULL and instead specifying the possible_exposure_times vector.

## Example antigenic map
data(example_antigenic_map)
knitr::kable(head(example_antigenic_map))

## Antigenic map with no antigenic variation
data.frame(x_coord=1,y_coord=1,inf_times=1:10)

## Antigenic map with clusters
data.frame(x_coord = rep(seq(1,3,by=1),each=3), y_coord=rep(seq(1,3,by=1),each=3),inf_times=1:9)

## Antigenic map with just one biomarker ID
data.frame(x_coord=1,y_coord=1,inf_times=1)

## Antigenic map with two biomarker groups
data.frame(x_coord=1,y_coord=1,inf_times=rep(1:10,2),biomarker_group=rep(c(1,2),each=10))

1.4 Covariate data

Individuals can't be infected before they're born, and we can't infer infections which occurred after the final sample from each individual. serosolver uses the birth and sample_time variables to constrain which biomarker_ids an individual can have been infected by.

The population_group variable is used to split individuals into different groups. These might represent different geographic groups, which can be used to estimate separate attack rates, or demographic groups, which can be used to estimate demographic-specific antibody kinetics (or both). For most models, this variable can be ignored or set to 1 for all individuals. See \<\< PLACEHOLDER >> for more details.

2 Model decisions

2.1 Infection history model and priors

For new users, the most confusing model design decision in serosolver is setting the infection history model. First, we need to understand the epidemiological question we seek to address. If our main objective is to estimate attack rates, we need to think about the power/information we have to estimate attack rates and infection histories at a chosen time resolution (e.g., annual attack rates or per month). If instead our main objective is to understand antibody kinetics, we might wish to simplify the infection history model to allow a more complex antibody kinetics model.

\<\< ADD IMAGE ON INFECTION HISTORY VECTOR WITH ATTACK RATES ETC. Could compare prior plot at different time resolutions >>

2.1.1 Model time resolution

All time variables in a serosolver model must be at the same time resolution, including sample_time, biomarker_id, birth, and possible_exposure_times. This can cause confusion, as these variables must be integers, and thus we have to do some transformation before running serosolver:

For annual samples, the sample time is simply the year each sample was collected in. For example, data collected between 2009 and 2011 would have values of 2009, 2010, or 2011.
For semi-annual samples, the sample time is the year the sample was collected multiplied by 2 (since there are two semesters in every year). For example, data collected between 2009 and 2011 would have values of $(2009 \times 2) + 1 = 4019$, $(2009.5 \times 2) + 1 = 4020$, $(2010 \times 2) + 1 = 4021$, $(2010.5 \times 2) + 1 = 4022$, $(2011 \times 2) + 1 = 4023$, $(2011.5 \times 2) + 1 = 4024$.
Similarly, for quarterly or monthly samples, the sample time follows the same pattern. For example for a sample collected in July of 2009, the quarterly sample time would be $(2009.75 \times 4) + 1 = 8040$ (because July is in the third quarter of the year) and the monthly sample time would be $(2009.5833 \times 12) + 1 = 24116$ (because July is the seventh month and $7/12=0.5833$).

A useful feature of serosolver is to use mixed time resolutions. For example, we might estimate infection histories at an annual resolution up to 2009, and then at a per-quarter resolution thereafter. Another example might be to estimate per-cluster infections for historic periods of time to establish long-term infection histories, and then per-quarter or per-year infection histories during the study period to model shorter-term dynamics. This can be useful for reducing parameter space when we have limited information to draw inferences about particular time periods, but rich information for others. See \<\< PLACEHOLDER >> for more details.

2.1.2 Model time period

It is possible to estimate infection states just for the duration of the serostudy, or to estimate infection histories for an arbitrary number of time periods prior to the study. For example, we might use serosolver to estimate infections that occurred between two serum sample collection times in the same way we would measure seroconversion rates. Alternatively, we might only have serum samples for recent time periods, but aim to estimate infection histories all the way back to the time of birth.

serosolver estimates an infection state for each element of the vector possible_exposure_times, which is either passed directly or extract automatically from antigenic_map$inf_times. The range and entries of these vectors therefore determines the model time period and resolution of the infection history model.

2.2.3 Infection history priors

Version 1, 2 or 3? Link to paper

2.2 Antibody kinetics model and priors

Along with a data file, serosolver expects several additional inputs namely, a parameter input file for starting values of antibody kinetics parameters, an antigenic map, an age mask, and starting infection histories. The following section will walk the users through creating each of these inputs.

Several input files are stored within the serosolver package in the 'inputs' directory, however users can create their own. The default parameter file 'parTab_base.csv' is shown below:

  options(scipen=999)
  par_tab_path <- system.file("extdata", "par_tab_base.csv", package = "serosolver")
  par_tab <- read.csv(par_tab_path, stringsAsFactors=FALSE)
  par_tab[par_tab$names %in% c("alpha","beta"),"values"] <- c(1,1)

  ## Remove phi term for infection history prior version 2
  par_tab <- par_tab[par_tab$type != 2,]
  knitr::kable(par_tab)

In the parameter file, the column 'fixed' indicates if each parameter value is fixed or should be estimated. For values that are not fixed, starting values need to be generated. The below code generates a random starting value for unfixed parameter values.

2.3 Observation model

3. Running serosolver

seroslver uses an Adaptive Metropolis-within-Gibbs algorithm. Given a starting point and the necessary MCMC parameters, serosolver() performs a random-walk of the posterior space to produce an MCMC chain that can be used to generate MCMC density and iteration plots (see section 4).

Using the default options, the MCMC can be run using the following.

# run MCMC  
  res <- serosolver(par_tab = start_tab, titre_dat = titre_dat, antigenic_map = antigenic_map, 
                  start_inf_hist = start_inf, CREATE_POSTERIOR_FUNC = create_posterior_func,
                  version = 2)

If you want to change any default options, such as increasing or decreasing the number of iterations and the adaptive period you can pass a vector of function argument values to serosolver().

  mcmc_pars <- c("iterations"=50000,"target_acceptance_rate_theta"=0.44,"target_acceptance_rate_inf_hist"=0.44,
                "adaptive_frequency"=1000,"thin"=1,
                "adaptive_iterations"=10000, "save_block"=1000, 
                "thin_inf_hist"=10, "proposal_inf_hist_indiv_prop"=0.5,
                "proposal_ratio"=2, "proposal_inf_hist_time_prop"=0.5, 
                "proposal_inf_hist_distance"=5, "proposal_inf_hist_adaptive"=0,
                "proposal_inf_hist_indiv_swap_ratio"=0.5,"proposal_inf_hist_group_swap_ratio"=0.5,
                "proposal_inf_hist_group_swap_prop"=0.5)
  res <- serosolver(par_tab = start_tab, titre_dat = titre_dat, antigenic_map = antigenic_map, 
                mcmc_pars = mcmc_pars, start_inf_hist = start_inf,
                CREATE_POSTERIOR_FUNC = create_posterior_func, version = 2)

4. Output posteriors

serosolver contains numerous plotting functions for visualising results and evaluating model performance. Below are examples of several useful plotting functions.

5. Output diagnostics

To evaluate model performance, trace and density plots can be produced by plotting the MCMC chains. In the below example the adaptive period and burn-in from the MCMC are excluded in the diagnostic plots.

6. Output estimates

Plot inferred attack rates.

Plot inferred infection histories.

Plot inferred antibody titres.

Generate cumulative incidence plots.

Parameter estimates

The following code produces a table of the parameter estimates.

References

adamkucharski/serosolver documentation built on Aug. 16, 2024, 3:31 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

adamkucharski/serosolver
Inference Framework for Serological Data

In adamkucharski/serosolver: Inference Framework for Serological Data

Overview

Contents

Entire workflow

Installation and dependencies

1. Input data

1.1 Required variables

1.2 Serological data

1.3 Antigenic map

1.4 Covariate data

2 Model decisions

2.1 Infection history model and priors

2.1.1 Model time resolution

2.1.2 Model time period

2.2.3 Infection history priors

2.2 Antibody kinetics model and priors

2.3 Observation model

3. Running serosolver

4. Output posteriors

5. Output diagnostics

6. Output estimates

Parameter estimates

References

R Package Documentation

Browse R Packages

We want your feedback!

adamkucharski/serosolver Inference Framework for Serological Data

In adamkucharski/serosolver: Inference Framework for Serological Data

Overview

Contents

Entire workflow

Installation and dependencies

1. Input data

1.1 Required variables

1.2 Serological data

1.3 Antigenic map

1.4 Covariate data

2 Model decisions

2.1 Infection history model and priors

2.1.1 Model time resolution

2.1.2 Model time period

2.2.3 Infection history priors

2.2 Antibody kinetics model and priors

2.3 Observation model

3. Running serosolver

4. Output posteriors

5. Output diagnostics

6. Output estimates

Parameter estimates

References

R Package Documentation

Browse R Packages

We want your feedback!

adamkucharski/serosolver
Inference Framework for Serological Data