In seroanalytics/serosolver: Inference Framework for Serological Data

knitr::opts_chunk$set(
  collapse = TRUE,
  echo=TRUE,
  comment = "#>",
  eval=TRUE
)

Overview

This vignette provides all of the analysis for case study 2 in the accompanying package and paper. Briefly, the aim is to infer antibody kinetics and historical attack rates using cross-sectional haemagglutination inhibition titre data on a panel of recent and historical A/H3N2 strains. All of the functions used here are well documented and have many tunable arguments, and we therefore encourage users to refer to the helps files.

This vignette demonstrates only how to reproduce the MCMC chains, simulate data, assess model fits and assess chain convergence. Code to reproduce figures from the main text in the accompanying paper can be found in the inst/extdata/scripts folder of the package.

Setup

Installation and requirements

serosolver may be installed from github using the devtools package. There are a number of additional packages that we need for this analysis.

# Required to run serosolver
devtools::install_github("seroanalytics/serosolver")
library(serosolver)
library(plyr)
library(data.table)

## Required for this analysis
library(reshape2)
library(foreach)
library(doParallel)
library(bayesplot)
library(coda)
library(ggplot2)
library(viridis)

# set up cluster
set.seed(1234)
cl <- makeCluster(5)
registerDoParallel(cl)

## Note that this vignette was generated on a Windows machine,
## and the setup for parallelisation is different on a Linux machine
## for Linux machine:
#  library(doMC)
#  library(doRNG)
#  registerDoMC(cores=5)

Assumptions

In this analysis, all serological samples were taken in 2009 and therefore all time variables are relative to this year. We are interested in inferring infections and attack rates at an annual resolution, and therefore set resolution to 1. Our primary outcome of interest is to infer unbiased historical attack rates, and we therefore use the version of the code with a Beta prior on per-time attack rates, prior_version=2. Furthermore, we have found that in the situation where the number of possible infections to infer is large but the amount of data is relatively sparse, identifiability is poor when using reference priors (eg. uniform Beta or Jeffrey's prior). We instead opted to use a weakly informative prior for annual attack rates with mode = 0.15 but with high variance, corresponding to prior observations of annual influenza attack rates. We set these parameters at the start of the analysis.

filename <- "case_study_2"
resolution <- 1 ## eg. this would be set to 12 for monthly resolution
sample_year <- 2009

serosolver::describe_priors()
prior_version <- 2

Preparing the data

The data used in this analysis are haemagglutination inhibition (HI) titres against a number of A/H3N2 that have circulated since 1968. The raw data are in a wide format, providing the highest two-fold dilution of serum at which haemagglutination is inhibited. The first step of the analysis is therefore to clean the titre data and convert the data frame to the long format, as described in the quickstart vignette.

## Read in data
raw_dat_path <- system.file("extdata", "Fluscape_HI_data.csv", package = "serosolver")
raw_dat <- read.csv(file = raw_dat_path, stringsAsFactors = FALSE)
print(head(raw_dat))

## Add indexing column for each individual
raw_dat$individual <- 1:nrow(raw_dat)

## Convert data to long format
melted_dat <- reshape2::melt(raw_dat, id.vars=c("individual","Age"),stringsAsFactors=FALSE)

## Modify column names to meet serosolver's expectations
colnames(melted_dat) <- c("individual","DOB","virus","titre")
melted_dat$virus <- as.character(melted_dat$virus)

## Extract circulation years for each virus code, which will be used 
## by serosolver as the circulation time
melted_dat$virus <- as.numeric(sapply(melted_dat$virus, function(x) strsplit(x,split = "HI.H3N2.")[[1]][2]))

## Clean and log transform the data
melted_dat <- melted_dat[complete.cases(melted_dat),]
melted_dat[melted_dat$titre == 0,"titre"] <- 5
melted_dat$titre <- log2(melted_dat$titre/5)

## Convert ages to DOB
melted_dat$DOB <- sample_year - melted_dat$DOB

## All samples taken at the same time
melted_dat$samples <- sample_year

## Add column for titre repeats, enumerating for each measurement for the same virus/sample/individual
melted_dat <- plyr::ddply(melted_dat,.(individual,virus,samples),function(x) cbind(x,"run"=1:nrow(x),"group"=1))

## Rename to data expected by serosolver
titre_dat <- melted_dat
print(head(titre_dat))

Given that this analysis uses titres from multiple, antigenically related viruses, it is necessary to define an antigenic map describing the antigenic distance between all of the viruses here. We use coordinates based on the antigenic map created by Fonville et al. Generating the antigenic map involves fitting a smoothing spline through provided coordinates to give a representative virus for each time point (in this case, each year) that an individual could be infected. This process also inputs antigenic coordinates for time points that we do not have a measured virus.

## Read in raw coordinates
antigenic_coords_path <- system.file("extdata", "fonville_map_approx.csv", package = "serosolver")
antigenic_coords <- read.csv(file = antigenic_coords_path, stringsAsFactors=FALSE)
print(head(antigenic_coords))

## Convert to form expected by serosolver
antigenic_map <- generate_antigenic_map(antigenic_coords, resolution)
print(head(antigenic_map))

## More flexible version of the above function
virus_key <- c(
    "HK68" = 1968, "EN72" = 1972, "VI75" = 1975, "TX77" = 1977, 
    "BK79" = 1979, "SI87" = 1987, "BE89" = 1989, "BJ89" = 1989,
    "BE92" = 1992, "WU95" = 1995, "SY97" = 1997, "FU02" = 2002, 
    "CA04" = 2004, "WI05" = 2005, "PE06" = 2006
  )
antigenic_coords$Strain <- virus_key[antigenic_coords$Strain]
antigenic_map <- generate_antigenic_map_flexible(antigenic_coords)

## Restrict entries to years of interest. Entries in antigenic_map determine
## the times that individual can be infected ie. the dimensions of the infection
## history matrix.
antigenic_map <- antigenic_map[antigenic_map$inf_times >= 1968 & antigenic_map$inf_times <= sample_year,]
strain_isolation_times <- unique(antigenic_map$inf_times)

NOTE: generate_antigenic_map expects the provided file fonville_map_approx.csv. Users should refer to generate_antigenic_map_flexible for more generic antigenic map generation.

Finally, we must specify the par_tab data frame, which controls which parameters are included in the model, which are fixed, and their uniform prior ranges. Given that we are integrating out the probability of infection terms under prior version 2, we must remove these parameters from par_tab. Furthermore, given that we are interested in long-term dynamics with relatively sparse data, we remove parameters relating to the short-term antibody kinetics phase to avoid identifiability issues. We set alpha and beta of the beta prior to give a mode of 0.15 assuming that our prior belief has the equivalent weighting to 4 observed individuals.

par_tab_path <- system.file("extdata", "par_tab_base.csv", package = "serosolver")
par_tab <- read.csv(file = par_tab_path, stringsAsFactors=FALSE)

## Set parameters for Beta prior on infection histories
beta_pars <- find_beta_prior_mode(0.15,4)
par_tab[par_tab$names == "alpha","values"] <- beta_pars$alpha
par_tab[par_tab$names == "beta","values"] <- beta_pars$beta
## Maximum recordable log titre in these data is 8
par_tab[par_tab$names == "MAX_TITRE","values"] <- 8

## Remove phi parameters, as these are integrated out under prior version 2
par_tab <- par_tab[par_tab$names != "phi",]

## Fix all short term parameters to 0
par_tab[par_tab$names %in% c("mu_short","sigma2","wane"),"fixed"] <- 1 # mu_short, waning and sigma2 are fixed
par_tab[par_tab$names %in% c("mu_short","sigma2","wane"),"values"] <- 0 # set these values to 0

Summary

\begin{enumerate} \item Choose resolution, attack rate priors and reference time for "the present" \item Convert titre data to long format \item Generate antigenic map for all viruses that an individual could be exposed to \item Generate parameter control table for MCMC \end{enumerate}

Running the MCMC

We are now ready to fit our model. We will fit multiple chains in parallel, though the below analysis could easily be replicated by running chains sequentially. Starting conditions for the MCMC chain must be generated that return a finite likelihood. The user may modify many of the MCMC control parameters, though the defaults are fine for most purposes. We have made some minor tweaks in this case study to improve convergence on infection history estimates. Step sizes for parameters in par_tab are tuned automatically, and some automated tuning of the infection history proposals takes place for prior version 3. However, for other attack rate priors, it is necessary for the user to do some manual tuning of a) the number of individuals sampled at each step proposal_inf_hist_indiv_prop; b) the number of time points sampled at each step proposal_inf_hist_time_prop; c) the frequency of individual infection history swapping steps (ie. for an individual, choose two time points and swap their contents)proposal_inf_hist_indiv_swap_ratio; d) proportion of infection history sampling steps which should be the alternative swapping step, where the contents of infection histories at two time points are swapped proposal_inf_hist_group_swap_ratio; e) proportion of infection histories to swap with each alternative swapping step proposal_inf_hist_group_swap_prop. For example, in this case study, we attack rates are likely to be highly correlated in adjacent years (as we have limited data to distinguish between infections in years close in time), and we therefore increase the frequency of the alternative infection history swapping step with proposal_inf_hist_group_swap_prop.

Changing the number of iterations and the length of the adaptive period are often desirable. More crucially, the amount of chain thinning should be specified to ensure that users are not saving a large number of MCMC iterations (as this will rapidly fill disk space!). Thinning should be set such that at least 1000 iterations are saved (ie. iterations/thin and thin_inf_hist). Users are encouraged to pay extra attention to thin_inf_hist, which dictates the thinning of the infection history chain, and can generate a very large file if left unchecked.

## Distinct filename for each chain
no_chains <- 5
filenames <- paste0(filename, "_",1:no_chains)
chain_path <- sub("par_tab_base.csv","",par_tab_path)
chain_path_real <- paste0(chain_path, "cs2_real/")
chain_path_sim <- paste0(chain_path, "cs2_sim/")

## Create the posterior solving function that will be used in the MCMC framework 
model_func <- create_posterior_func(par_tab=par_tab,
                                titre_dat=titre_dat,
                                antigenic_map=antigenic_map,
                                version=prior_version) # function in posteriors.R

## Generate results in parallel
res <- foreach(x = filenames, .packages = c('serosolver','data.table','plyr')) %dopar% {
  ## Not all random starting conditions return finite likelihood, so for each chain generate random
  ## conditions until we get one with a finite likelihood
  start_prob <- -Inf
  while(!is.finite(start_prob)){
    ## Generating starting antibody kinetics parameters
    start_tab <- generate_start_tab(par_tab)

    ## Generate starting infection history
    start_inf <- setup_infection_histories_titre(titre_dat, strain_isolation_times, 
                                                 space=3,titre_cutoff=4)
    start_prob <- sum(model_func(start_tab$values, start_inf)[[1]])
  }


  res <- serosolver(par_tab = start_tab, 
                  titre_dat = titre_dat,
                  antigenic_map = antigenic_map,
                  start_inf_hist = start_inf, 
                  mcmc_pars = c("iterations"=500000,"adaptive_iterations"=100000,"thin"=1000,
                                "thin_inf_hist"=1000,"save_block"=1000,
                                "proposal_inf_hist_time_prop"=1, "proposal_inf_hist_indiv_prop"=1,
                                "proposal_inf_hist_group_swap_ratio"=0.8, "proposal_inf_hist_group_swap_prop"=1),
                  filename = paste0(chain_path_real,x), 
                  CREATE_POSTERIOR_FUNC = create_posterior_func, 
                  version = prior_version)
}

Post-run analyses

Once the MCMC chains are run, serosolver provides a number of simple functions to generate standard outputs and MCMC diagnostics. The saved MCMC chains are compatible with the coda and bayesplot packages, and users are encouraged to use these. First, read in the MCMC chains. The below function distinguishes between posterior samples for the infection history matrix and for the process parameters. The function searches for all files with the filenames generated by serosolver in the specified directory, and returns data structures with these concatenated and seperated in a list.

## Read in the MCMC chains
## Note that `thin` here is in addition to any thinning done during the fitting
#all_chains <- load_mcmc_chains(location=chain_path_real,thin=1,burnin=100000,
#                               par_tab=par_tab,unfixed=FALSE,convert_mcmc=TRUE)

## Alternative, load the included MCMC chains rather than re-running
data(cs2_chains_real)
all_chains <- cs2_chains_real

print(summary(all_chains))

Chains should then be checked for the usual MCMC diagnostics: $\hat{R}$ and effective sample size. First, looking at the antibody kinetics process parameters:

## Get the MCMC chains as a list
list_chains <- all_chains$theta_list_chains
## Look at diagnostics for the free parameters
list_chains1 <- lapply(list_chains, function(x) as.mcmc(x[,c("mu","sigma1","error",
                                                     "tau","total_infections",
                                                     "lnlike","prior_prob")]))

## Gelman-Rubin diagnostics to assess between-chain convergence for each parameter
print(gelman.diag(as.mcmc.list(list_chains1)))
gelman.plot(as.mcmc.list(list_chains1))

## Effective sample size for each parameter
print(effectiveSize(as.mcmc.list(list_chains1)))

## Posterior estimates for each parameter
print(summary(as.mcmc.list(list_chains1)))


## Plot the MCMC trace using the `bayesplot` package
color_scheme_set("viridis")
p_theta_trace <- mcmc_trace(list_chains1)
print(p_theta_trace)

and at the infection histories:

## Extract infection history chain
inf_chain <- all_chains$inf_chain

## Look at inferred attack rates
p_ar <- plot_attack_rates(inf_chain, titre_dat, strain_isolation_times, pad_chain=TRUE,
                          plot_den = TRUE,prior_pars=list(prior_version=prior_version, 
                                            alpha=par_tab[par_tab$names=="alpha","values"],
                                            beta=par_tab[par_tab$names=="beta","values"])) 
print(p_ar)

## Calculate convergence diagnostics and summary statistics on infection histories
## Important to scale all infection estimates by number alive from titre_dat
n_alive <- get_n_alive_group(titre_dat, strain_isolation_times,melt=TRUE)

## This function generates a number of MCMC outputs
ps_infhist <- plot_posteriors_infhist(inf_chain=inf_chain, 
                                      years=strain_isolation_times, 
                                      n_alive=n_alive)


## Posterior mean, median, 95% credible intervals and effective sample size
## on per time attack rates
print(head(ps_infhist[["estimates"]]$by_year))

## Posterior mean, median, 95% credible intervals and effective sample size
## on per individual total number of infections
print(head(ps_infhist[["estimates"]]$by_indiv))

## Check convergence of infection history summary statistics
## MCMC trace plots of attack rates
## Each subplot shows one year
print(ps_infhist[["by_time_trace"]][[1]])

## MCMC trace plots of total number of infections per individual
## Each subplot shows one individual
print(ps_infhist[["by_indiv_trace"]][[1]])

## Distribution of total number of infections
print(ps_infhist[["indiv_infections"]])

## Check for agreement between inferred cumulative infection histories 
## for some individuals
p_indiv_inf_hists <- generate_cumulative_inf_plots(inf_chain,indivs=1:9,pad_chain=FALSE,
                                                  strain_isolation_times = strain_isolation_times,
                                                  number_col=3)
## Each subplot shows one individual
print(p_indiv_inf_hists[[1]])

## Posterior probability that infections occured at given times per individual
## Each subplot shows one individual
print(p_indiv_inf_hists[[2]])

Mixing can sometimes be very poor for per-time attack rates when adjacent times are highly correlated. This is often the case when the amount of data relatively poor. A cruder time resolution (eg. per two years) may be advisable, and mixing may benefit from increasing the proposal_inf_hist_group_swap_ratio and proposal_inf_hist_indiv_prop parameters in the mcmc_pars list in serosolver. proposal_inf_hist_indiv_prop determines how frequently the MCMC sampler uses a proposal step that swaps the a proportion proposal_inf_hist_group_swap_ratio of individual's infection states between two time points.

Users may also easily check the inferred antibody landscapes at the time each sample was taken. Black dots show observations, shaded regions and black line show 95%, 50% credible intervals and posterior median.

## get_titre_predictions expects only a single MCMC chain, so
## subset for only one chain
chain <- as.data.frame(all_chains$theta_chain)
chain1 <- chain[chain$chain_no == 1,]
inf_chain1 <- inf_chain[inf_chain$chain_no == 1,]

titre_preds <- get_titre_predictions(chain = chain1, 
                                     infection_histories = inf_chain1, 
                                     titre_dat = titre_dat, 
                                     individuals = unique(titre_dat$individual),
                                     antigenic_map = antigenic_map, 
                                     par_tab = par_tab,expand_titredat=FALSE)
to_use <- titre_preds$predictions
print(head(to_use))

## Using ggplot
## Shaded regions show 95% and 50% credible intervals, 
## line shows posterior median.
## Each suplot shows one individual
titre_pred_p <- ggplot(to_use[to_use$individual %in% 1:9,])+
  geom_ribbon(aes(x=virus,ymin=lower, ymax=upper),fill="gray90")+
  geom_ribbon(aes(x=virus,ymin=lower_50, ymax=upper_50),fill="gray70")+
  geom_line(aes(x=virus, y=median))+
  geom_point(aes(x=virus, y=titre))+
  coord_cartesian(ylim=c(0,8))+
  ylab("log titre") +
  xlab("Time of virus circulation") +
  theme_classic() +
  facet_wrap(~individual)
titre_pred_p

Further analyses

Figures in the main text can be readily generated from the MCMC output from above. The source code to generate these figures has been hidden, but can be found in the original .Rmd file for this vignette.

First, we are interested in calculating the number of infections experienced by individuals over time as a function of their age. We see that individuals are infected less frequently as they become older.

inf_chain <- all_chains$inf_chain

## The MCMC framework saves only present infections (ie. entries for the infection 
## history matrix of 1) to save space ie. sparse format. The output chain
## should therefore be filled with the missing 0s before extensive analysis 
## to avoid bias.
inf_chain <- pad_inf_chain(inf_chain)

## Using data tables to get total number of infections
## per individual
data.table::setkey(inf_chain, "i", "samp_no","chain_no")
n_inf_chain_i <- inf_chain[, list(V1 = sum(x)), by = key(inf_chain)]
setkey(n_inf_chain_i, "i")
n_inf_chain <- n_inf_chain_i[,list(median_infs=median(V1)), 
                             by=key(n_inf_chain_i)]
colnames(n_inf_chain)[1] <- "individual"
setkey(n_inf_chain, "individual")

## Merge with titre data to recover individual 
## id
titre_dat1 <- data.table(titre_dat)
setkey(titre_dat1, "individual")
titre_dat1 <- merge(n_inf_chain, titre_dat1)

## Split individuals into age groups and plot summaries
titre_dat1$age <- sample_year - titre_dat1$DOB
titre_dat1$infs_per_life <- titre_dat1$median_infs/titre_dat1$age
titre_dat1$age_group <- cut(titre_dat1$age, breaks = c(0,19,40,65,90))

age_dist <- ggplot(titre_dat1) + 
  geom_boxplot(aes(group=age_group,y=infs_per_life*10,x=age_group)) +
  theme_classic() +
  ylab("Median number of infections\n per 10 years alive") +
  xlab("Age group")
age_dist

Given the sparsity of data here, the default attack rate plot is difficult to interpret. Below is an alternative visualisation of the attack rate, with the 95% and 50% credible intervals shown in red, the posterior median shown in black and the posterior maximum likelihood estimate shown as a dashed green line.

## Find samples that were in both theta and inf hist chains
chain <- as.data.frame(all_chains$theta_chain)
intersect_samps <- intersect(unique(inf_chain$samp_no), unique(chain$samp_no))
chain <- chain[chain$samp_no %in% intersect_samps,]

## Find the parameter values that gave the highest posterior probability
which_mle <- chain[which.max(chain$lnlike),c("samp_no","chain_no")]

## Take subset of chain for computational speed, as do not need all samples
samps <- unique(inf_chain[,c("samp_no","chain_no")])
n_samps <- sample(1:nrow(samps), 100)
samps <- samps[n_samps,]
samps <- rbind(samps, which_mle) ## Plus MLE estimate
## Append the MLE estimate, note that this is max(samp_no)

## Create new index variables for simplicity
samps$samp_no1 <- 1:nrow(samps)
samps$chain_no1 <- 1

## Inner join to return only our subset of samples
## Reformat samp_no and chain_no identifiers so that code
## sees samples as coming from one chain
inf_chain <- merge(inf_chain, samps, by=c("samp_no","chain_no"))
inf_chain <- inf_chain[,c("samp_no1","chain_no1","i","j","x")]
colnames(inf_chain)[1:2] <- c("samp_no","chain_no")
inf_chain <- pad_inf_chain(inf_chain)
## Rename columns to be more informative

## Column names expected by code below
colnames(inf_chain) <- c("samp_no","chain_no","individual","year","infected","group")

## Data on which strains belong to which cluster
cluster_path <- system.file("extdata", "fonville_clusters.csv", package = "serosolver")
clusters <- read.csv(file = cluster_path, stringsAsFactors=FALSE)
clusters <- clusters[clusters$year <= sample_year,]

## j=1 corresponds to the year 1968
inf_chain$year <- inf_chain$year + 1967

## Merge cluster data and infection history data
inf_chain <- merge(inf_chain, clusters[,c("year","cluster1")],by="year")

## Calculate ages and age groups of all individuals
titre_dat$age <- max(strain_isolation_times) - titre_dat$DOB
titre_dat$age_group <- cut(titre_dat$age,breaks=c(0,20,100),include.lowest=TRUE)
ages <- unique(titre_dat[,c("individual","age_group","DOB","age")])

## Merge infection histories with individual data
inf_chain<- merge(inf_chain, data.table(ages), by=c("individual"))

## Alive status for each individual for each time,
## only interested in individuals that were alive 
## when a virus circulated
inf_chain$alive <- inf_chain$DOB <= inf_chain$year
inf_chain <- inf_chain[inf_chain$alive,]

## Find out number of infections per year
inf_chain$potential_infection <- 1
setkey(inf_chain, "samp_no","chain_no","year")
inf_chain_ar <- inf_chain[,list(no_infected=sum(infected),
                                potential_infection=sum(potential_infection)),
                         by=key(inf_chain)]

## Calculate posterior median and credible intervals on the per year attack rates
setkey(inf_chain_ar, "year")
y <- inf_chain_ar[,list(median_ar=median(no_infected/potential_infection),
             lower_quantile=quantile(no_infected/potential_infection,0.025),
             upper_quantile=quantile(no_infected/potential_infection,0.975),
             lower_quantile_50=quantile(no_infected/potential_infection,0.25),
             upper_quantile_50=quantile(no_infected/potential_infection,0.75)),
             by=key(inf_chain_ar)]

## Extract the MLE attack rate estimates
inf_chain_mle <- inf_chain_ar[inf_chain_ar$samp_no == max(inf_chain_ar$samp_no),]

p_ar <- ggplot(data=y) + 
  geom_ribbon(aes(x=year, ymin=lower_quantile,ymax=upper_quantile), fill=viridis(8)[3],alpha=0.2) +
  geom_ribbon(aes(x=year, ymin=lower_quantile_50,ymax=upper_quantile_50), fill=viridis(8)[3],alpha=0.5) +
  geom_line(aes(x=year,y=median_ar)) +
  geom_line(data=inf_chain_mle, aes(x=year, y=no_infected/potential_infection), linetype="dashed", size=0.8,col="forestgreen")+
  scale_x_continuous(limits=c(1968,sample_year),expand=c(0,0))+
  scale_y_continuous(limits=c(0,1),expand=c(0,0)) +
  ylab("Attack rate") +
  xlab("Year of circulation") +
  theme_classic()+
  theme(panel.grid=element_blank())

print(p_ar)

Finally, inferring individual infection histories allows us to investigate age-specific patterns of incidence. Here, we show the proportion of individuals that were infected at least once within a single antigenic cluster, finding that clusters that circulate for longer tend to infect a far higher proportion of the population. Furthermore, we see that a far higher proportion of the younger age group is infected in more recent years.

## Find out number of infections per cluster for each individual
setkey(inf_chain, "individual","samp_no","chain_no","cluster1","age_group")
inf_chain_cluster <- inf_chain[,list(infected1=sum(infected), 
                               potential_infection=sum(potential_infection)),
                         by=key(inf_chain)]
inf_chain_cluster_once <- inf_chain_cluster
inf_chain_cluster_once[inf_chain_cluster_once$infected1 >=1,"infected1"] <- 1
inf_chain_cluster_once[inf_chain_cluster_once$potential_infection >=1,"potential_infection"] <- 1

setkey(inf_chain_cluster_once, "samp_no","chain_no","cluster1","age_group")
inf_chain_cluster_once <- inf_chain_cluster_once[,list(total_infected=sum(infected1), 
                               total_potential_infection=sum(potential_infection)),
                         by=key(inf_chain_cluster_once)]


## For which years was each cluster circulating?
year_ranges <- ddply(clusters, ~cluster1, function(x){
  c(min(x$year), max(x$year))
})
year_ranges$width <- year_ranges$V2 - year_ranges$V1 + 1
colnames(year_ranges) <- c("cluster1","start_year","end_year","width")

## We want to know number of people alive per cluster per age group
n_alive_clusters <- ddply(year_ranges, ~cluster1, function(x){
  ddply(ages, ~age_group, function(y){
    nrow(y[y$DOB <= x$end_year,])
  })
})
colnames(n_alive_clusters)[3] <- "n_alive"
n_alive_clusters <- merge(n_alive_clusters, year_ranges)
y <- merge(inf_chain_cluster_once, data.table(n_alive_clusters), by=c("age_group","cluster1"))

y$cluster1 <- as.factor(y$cluster1)
colnames(y)[which(colnames(y)=="width")] <- "Years of circulation"

p_once <- ggplot(y) + 
  geom_violin(aes(x=cluster1,y=total_infected/n_alive,
                   fill=`Years of circulation`, group=cluster1),
              adjust=1.2, 
              draw_quantiles=c(0.025,0.5,0.975), scale="width"
              ) + 
  scale_fill_gradient2(low="blue",high=viridis(8)[3]) +
  ylab("Proportion of inviduals\n infected at least once") +
  xlab("Cluster index") +
  theme_classic() +
  theme(legend.position = "bottom",
        panel.grid=element_blank()) +
  facet_wrap(~age_group, ncol=1)

print(p_once)

Simulation recovery

We finish the vignette by presenting a simulation-recovery experiment to test the ability of the framework to recover known infection histories and antibody kinetics parameters using simulated data that matches the real dataset.

Extract MLE parameters from fits

We simulate infection histories and antibody titre data based on the "real" parameters inferred from fitting the model above. First, we extract the maximum posterior probability antibody kinetics parameters and attack rates.

## Read in MCMC chains from fitting
#all_chains <- load_mcmc_chains(location=chain_path_real,thin=1,burnin=100000,
#                               par_tab=par_tab,unfixed=FALSE,convert_mcmc=FALSE)

## Alternative, load the included MCMC chains rather than re-running
data(cs2_chains_real_b)
all_chains <- cs2_chains_real_b

## Find samples that were in both theta and inf hist chains
chain <- all_chains$theta_chain
inf_chain <- all_chains$inf_chain
intersect_samps <- intersect(unique(inf_chain$samp_no), unique(chain$samp_no))
chain <- chain[chain$samp_no %in% intersect_samps,]

## Find the parameter values that gave the highest posterior probability
which_mle <- chain[which.max(chain$lnlike),c("samp_no","chain_no")]
mle_theta_pars <- chain[chain$samp_no == which_mle$samp_no & chain$chain_no == which_mle$chain_no,]

## Store total infections to compare later
mle_total_infs <- mle_theta_pars[,"total_infections"]
mle_theta_pars <- mle_theta_pars[,par_tab$names]
mle_inf_hist <- inf_chain[inf_chain$samp_no == which_mle$samp_no & inf_chain$chain_no == which_mle$chain_no,]

## Generate full infection history matrix using provided function
mle_inf_hist <- expand_summary_inf_chain(mle_inf_hist[,c("samp_no","j","i","x")])
## Find number of infections per year from this infection history
no_infs <- colSums(mle_inf_hist[,3:ncol(mle_inf_hist)])

## If missing time points in simulated attack rates
if(length(no_infs) < length(strain_isolation_times)){
  diff_lengths <- length(strain_isolation_times) - length(no_infs)
  no_infs <- c(no_infs, rep(0, diff_lengths))
}

## Find attack rate per year
n_alive <- get_n_alive(titre_dat, strain_isolation_times)
attack_rates <- no_infs/n_alive

Functions are provided to simulate antibody titre data under a given serosurvey design. The antibody kinetics parameters and attack rates estimated above are used to simulate titres from the model. The simulate_data function is well documented, and users should refer to the help file to customise the simulated serosurvey design.

set.seed(1234)

sim_par_tab <- par_tab
sim_par_tab$values <- as.numeric(mle_theta_pars)
sim_par_tab[sim_par_tab$names %in% c("alpha","beta"),"values"] <- c(1/3,1/3)

age_min <- 2009 - max(titre_dat$DOB)
age_max <- 2009 - min(titre_dat$DOB)
n_indiv <- length(unique(titre_dat$individual))
dat <- simulate_data(par_tab=sim_par_tab,
                     n_indiv=n_indiv, 
                     buckets=resolution, 
                     strain_isolation_times=strain_isolation_times,
                     sampling_times=2009, 
                     nsamps=1, 
                     antigenic_map=antigenic_map, 
                     age_min=age_min,
                     age_max=age_max,
                     attack_rates=attack_rates,
                     repeats=1)


## Inspect simulated antibody titre data and infection histories
sim_titre_dat <- dat[["data"]]
sim_infection_histories <- dat[["infection_histories"]]

## Store total infections to compare later
actual_total_infections <- sum(sim_infection_histories)

plot_data(sim_titre_dat, sim_infection_histories, 
          strain_isolation_times,n_indivs = 5)

## Use titres only against same viruses tested in real data
viruses <- unique(titre_dat$virus)
sim_titre_dat <- sim_titre_dat[sim_titre_dat$virus %in% viruses, ]
sim_ages <- dat[["ages"]]
sim_titre_dat <- merge(sim_titre_dat, sim_ages)
sim_ar <- dat[["attack_rates"]]

Simulation fitting

Once these simulated data have been generated, the work flow becomes exactly the same as with the real data above.

filename <- "case_study_2_sim"

## Distinct filename for each chain
no_chains <- 5
filenames <- paste0(filename, "_",1:no_chains)

## Create the posterior solving function that will be used in the MCMC framework 
model_func <- create_posterior_func(par_tab=sim_par_tab,
                                titre_dat=sim_titre_dat,
                                antigenic_map=antigenic_map,
                                version=prior_version) # function in posteriors.R

## Generate results in parallel
res <- foreach(x = filenames, .packages = c('serosolver','data.table','plyr')) %dopar% {
  ## Not all random starting conditions return finite likelihood, so for each chain generate random
  ## conditions until we get one with a finite likelihood
  start_prob <- -Inf
  while(!is.finite(start_prob)){
    ## Generate starting values for theta
    start_tab <- generate_start_tab(par_tab)
    ## Generate starting infection history
    start_inf <- setup_infection_histories_titre(sim_titre_dat, strain_isolation_times, 
                                                 space=3,titre_cutoff=4)
    start_prob <- sum(model_func(start_tab$values, start_inf)[[1]])
  }

  res <- serosolver(par_tab = start_tab, 
                  titre_dat = sim_titre_dat,
                  antigenic_map = antigenic_map,
                  start_inf_hist = start_inf, 
                  mcmc_pars = c("iterations"=500000,"adaptive_iterations"=100000,"thin"=1000,
                                "thin_inf_hist"=1000,"save_block"=1000,
                                "proposal_inf_hist_group_swap_ratio"=0.8, "proposal_inf_hist_indiv_prop"=1),
                  filename = paste0(chain_path_sim,x), 
                  CREATE_POSTERIOR_FUNC = create_posterior_func, 
                  version = prior_version)
}

Simulation analysis

MCMC chains should be checked for convergence under the usual diagnostics. We also compare the inferred posterior distributions to the known true parameter values. We see that convergence and between-chain agreement is good and that the model recovers reasonably unbiased estimates for some parameters. However, under this sampling strategy the model slightly underestimates the amount of long term antibody boosting elicited by a single infection and overestimates the total number of infections. This is driven by the contribution of the attack rate prior relative to the contribution of the likelihood (the data). Increasing the number of measured titres (for example, measure titres against 40 viruses rather than 9) or using a more informative attack rate prior would help reduce this bias.

## Read in the MCMC chains
## Note that `thin` here is in addition to any thinning done during the fitting
#sim_all_chains <- load_mcmc_chains(location=chain_path_sim,thin=1,burnin=100000,
#                               par_tab=par_tab,unfixed=FALSE,convert_mcmc=TRUE)

## Alternative, load the included MCMC chains rather than re-running
data(cs2_chains_sim)
sim_all_chains <- cs2_chains_sim

theta_chain <- sim_all_chains$theta_chain
## Get the MCMC chains as a list
list_chains <- sim_all_chains$theta_list_chains
## Look at diagnostics for the free parameters
list_chains1 <- lapply(list_chains, function(x) as.mcmc(x[,c("mu","sigma1","error",
                                                     "tau","total_infections",
                                                     "lnlike","prior_prob")]))

## Gelman-Rubin diagnostics and effective sample size
print(gelman.diag(as.mcmc.list(list_chains1)))
print(effectiveSize(as.mcmc.list(list_chains1)))

melted_theta_chain <- reshape2::melt(as.data.frame(theta_chain), id.vars=c("samp_no","chain_no"))
estimated_pars <- c(sim_par_tab[sim_par_tab$fixed == 0,"names"],"total_infections")
melted_theta_chain <- melted_theta_chain[melted_theta_chain$variable %in% estimated_pars,]
colnames(melted_theta_chain)[3] <- "names"

add_row <- data.frame("total_infections",actual_total_infections,0,0.1,0,10000,0,0,1)
colnames(add_row) <- colnames(sim_par_tab)
sim_par_tab1 <- rbind(sim_par_tab, add_row)

ggplot(melted_theta_chain) + 
  geom_density(aes(x=value,fill=as.factor(chain_no)),alpha=0.5) +
  geom_vline(data=sim_par_tab1[sim_par_tab1$fixed == 0,],aes(xintercept=values),linetype="dashed") +
  facet_wrap(~names,scales="free") + 
  theme_classic() +
  theme(legend.position="bottom")

Recovery of known attack rates is also reasonably accurate, though the constraint of the posterior distibution is quite low for many years where identifiability is poor. Again, more titre data or more individuals would improve inferential power. One particularly reassuring plot is the comparison of known individual cumulative infection histories (the cumulative sum of infections over time for an individual) against the estimated posterior distribution of cumulative infection histories. We see that the 95% credible intervals capture the true cumulative infection histories in almost all cases.

## Extract infection history chain
inf_chain <- sim_all_chains$inf_chain

## Look at inferred attack rates
p_ar <- plot_attack_rates(inf_chain, sim_titre_dat, strain_isolation_times, pad_chain=FALSE,
                            plot_den = TRUE,prior_pars=list(prior_version=prior_version, 
                                                          alpha=par_tab[par_tab$names=="alpha","values"],
                                                          beta=par_tab[par_tab$names=="beta","values"]))  +
  geom_point(data=sim_ar,aes(x=year,y=AR),col="purple")
print(p_ar)

## Calculate convergence diagnostics and summary statistics on infection histories
## Important to scale all infection estimates by number alive from titre_dat
sim_n_alive <- get_n_alive_group(sim_titre_dat, strain_isolation_times,melt=TRUE)

## This function generates a number of MCMC outputs
ps_infhist <- plot_posteriors_infhist(inf_chain=inf_chain, 
                                      years=strain_isolation_times, 
                                      n_alive=sim_n_alive)

## Check convergence of infection history summary statistics
## MCMC trace plots of attack rates
print(ps_infhist[["by_time_trace"]][[1]])
## MCMC trace plots of total number of infections per individual
print(ps_infhist[["by_indiv_trace"]][[1]])

## Check for agreement between inferred cumulative infection histories 
## for some individuals
p_indiv_inf_hists <- generate_cumulative_inf_plots(inf_chain,indivs=1:9,pad_chain=FALSE,
                                                   real_inf_hist=sim_infection_histories,
                                                  strain_isolation_times = strain_isolation_times,
                                                  number_col=3)
## Each subplot shows one individual
print(p_indiv_inf_hists[[1]])
## Posterior probability that infections occured at given times per individual
## Each subplot shows one individual
print(p_indiv_inf_hists[[2]])

seroanalytics/serosolver documentation built on April 12, 2025, 7:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

seroanalytics/serosolver
Inference Framework for Serological Data

In seroanalytics/serosolver: Inference Framework for Serological Data

Overview

Setup

Installation and requirements

Assumptions

Preparing the data

Summary

Running the MCMC

Post-run analyses

Further analyses

Simulation recovery

Extract MLE parameters from fits

Simulation fitting

Simulation analysis

R Package Documentation

Browse R Packages

We want your feedback!

seroanalytics/serosolver Inference Framework for Serological Data

In seroanalytics/serosolver: Inference Framework for Serological Data

Overview

Setup

Installation and requirements

Assumptions

Preparing the data

Summary

Running the MCMC

Post-run analyses

Further analyses

Simulation recovery

Extract MLE parameters from fits

Simulation fitting

Simulation analysis

R Package Documentation

Browse R Packages

We want your feedback!

seroanalytics/serosolver
Inference Framework for Serological Data