knitr::opts_chunk$set( collapse = TRUE, echo=TRUE, comment = "#>", eval=TRUE )
This vignette provides all of the analysis for case study 2 in the accompanying package and paper. Briefly, the aim is to infer antibody kinetics and historical attack rates using cross-sectional haemagglutination inhibition titre data on a panel of recent and historical A/H3N2 strains. All of the functions used here are well documented and have many tunable arguments, and we therefore encourage users to refer to the helps files.
This vignette demonstrates only how to reproduce the MCMC chains, simulate data, assess model fits and assess chain convergence. Code to reproduce figures from the main text in the accompanying paper can be found in the inst/extdata/scripts
folder of the package.
serosolver
may be installed from github using the devtools
package. There are a number of additional packages that we need for this analysis.
# Required to run serosolver devtools::install_github("seroanalytics/serosolver") library(serosolver) library(plyr) library(data.table) ## Required for this analysis library(reshape2) library(foreach) library(doParallel) library(bayesplot) library(coda) library(ggplot2) library(viridis) # set up cluster set.seed(1234) cl <- makeCluster(5) registerDoParallel(cl) ## Note that this vignette was generated on a Windows machine, ## and the setup for parallelisation is different on a Linux machine ## for Linux machine: # library(doMC) # library(doRNG) # registerDoMC(cores=5)
In this analysis, all serological samples were taken in 2009 and therefore all time variables are relative to this year. We are interested in inferring infections and attack rates at an annual resolution, and therefore set resolution
to 1. Our primary outcome of interest is to infer unbiased historical attack rates, and we therefore use the version of the code with a Beta prior on per-time attack rates, prior_version=2
. Furthermore, we have found that in the situation where the number of possible infections to infer is large but the amount of data is relatively sparse, identifiability is poor when using reference priors (eg. uniform Beta or Jeffrey's prior). We instead opted to use a weakly informative prior for annual attack rates with mode = 0.15 but with high variance, corresponding to prior observations of annual influenza attack rates. We set these parameters at the start of the analysis.
filename <- "case_study_2" resolution <- 1 ## eg. this would be set to 12 for monthly resolution sample_year <- 2009 serosolver::describe_priors() prior_version <- 2
The data used in this analysis are haemagglutination inhibition (HI) titres against a number of A/H3N2 that have circulated since 1968. The raw data are in a wide format, providing the highest two-fold dilution of serum at which haemagglutination is inhibited. The first step of the analysis is therefore to clean the titre data and convert the data frame to the long format, as described in the quickstart vignette.
## Read in data raw_dat_path <- system.file("extdata", "Fluscape_HI_data.csv", package = "serosolver") raw_dat <- read.csv(file = raw_dat_path, stringsAsFactors = FALSE) print(head(raw_dat)) ## Add indexing column for each individual raw_dat$individual <- 1:nrow(raw_dat) ## Convert data to long format melted_dat <- reshape2::melt(raw_dat, id.vars=c("individual","Age"),stringsAsFactors=FALSE) ## Modify column names to meet serosolver's expectations colnames(melted_dat) <- c("individual","DOB","virus","titre") melted_dat$virus <- as.character(melted_dat$virus) ## Extract circulation years for each virus code, which will be used ## by serosolver as the circulation time melted_dat$virus <- as.numeric(sapply(melted_dat$virus, function(x) strsplit(x,split = "HI.H3N2.")[[1]][2])) ## Clean and log transform the data melted_dat <- melted_dat[complete.cases(melted_dat),] melted_dat[melted_dat$titre == 0,"titre"] <- 5 melted_dat$titre <- log2(melted_dat$titre/5) ## Convert ages to DOB melted_dat$DOB <- sample_year - melted_dat$DOB ## All samples taken at the same time melted_dat$samples <- sample_year ## Add column for titre repeats, enumerating for each measurement for the same virus/sample/individual melted_dat <- plyr::ddply(melted_dat,.(individual,virus,samples),function(x) cbind(x,"run"=1:nrow(x),"group"=1)) ## Rename to data expected by serosolver titre_dat <- melted_dat print(head(titre_dat))
Given that this analysis uses titres from multiple, antigenically related viruses, it is necessary to define an antigenic map describing the antigenic distance between all of the viruses here. We use coordinates based on the antigenic map created by Fonville et al. Generating the antigenic map involves fitting a smoothing spline through provided coordinates to give a representative virus for each time point (in this case, each year) that an individual could be infected. This process also inputs antigenic coordinates for time points that we do not have a measured virus.
## Read in raw coordinates antigenic_coords_path <- system.file("extdata", "fonville_map_approx.csv", package = "serosolver") antigenic_coords <- read.csv(file = antigenic_coords_path, stringsAsFactors=FALSE) print(head(antigenic_coords)) ## Convert to form expected by serosolver antigenic_map <- generate_antigenic_map(antigenic_coords, resolution) print(head(antigenic_map)) ## More flexible version of the above function virus_key <- c( "HK68" = 1968, "EN72" = 1972, "VI75" = 1975, "TX77" = 1977, "BK79" = 1979, "SI87" = 1987, "BE89" = 1989, "BJ89" = 1989, "BE92" = 1992, "WU95" = 1995, "SY97" = 1997, "FU02" = 2002, "CA04" = 2004, "WI05" = 2005, "PE06" = 2006 ) antigenic_coords$Strain <- virus_key[antigenic_coords$Strain] antigenic_map <- generate_antigenic_map_flexible(antigenic_coords) ## Restrict entries to years of interest. Entries in antigenic_map determine ## the times that individual can be infected ie. the dimensions of the infection ## history matrix. antigenic_map <- antigenic_map[antigenic_map$inf_times >= 1968 & antigenic_map$inf_times <= sample_year,] strain_isolation_times <- unique(antigenic_map$inf_times)
NOTE: generate_antigenic_map
expects the provided file fonville_map_approx.csv
. Users should refer to generate_antigenic_map_flexible
for more generic antigenic map generation.
Finally, we must specify the par_tab
data frame, which controls which parameters are included in the model, which are fixed, and their uniform prior ranges. Given that we are integrating out the probability of infection terms under prior version 2, we must remove these parameters from par_tab
. Furthermore, given that we are interested in long-term dynamics with relatively sparse data, we remove parameters relating to the short-term antibody kinetics phase to avoid identifiability issues. We set alpha and beta of the beta prior to give a mode of 0.15 assuming that our prior belief has the equivalent weighting to 4 observed individuals.
par_tab_path <- system.file("extdata", "par_tab_base.csv", package = "serosolver") par_tab <- read.csv(file = par_tab_path, stringsAsFactors=FALSE) ## Set parameters for Beta prior on infection histories beta_pars <- find_beta_prior_mode(0.15,4) par_tab[par_tab$names == "alpha","values"] <- beta_pars$alpha par_tab[par_tab$names == "beta","values"] <- beta_pars$beta ## Maximum recordable log titre in these data is 8 par_tab[par_tab$names == "MAX_TITRE","values"] <- 8 ## Remove phi parameters, as these are integrated out under prior version 2 par_tab <- par_tab[par_tab$names != "phi",] ## Fix all short term parameters to 0 par_tab[par_tab$names %in% c("mu_short","sigma2","wane"),"fixed"] <- 1 # mu_short, waning and sigma2 are fixed par_tab[par_tab$names %in% c("mu_short","sigma2","wane"),"values"] <- 0 # set these values to 0
\begin{enumerate} \item Choose resolution, attack rate priors and reference time for "the present" \item Convert titre data to long format \item Generate antigenic map for all viruses that an individual could be exposed to \item Generate parameter control table for MCMC \end{enumerate}
We are now ready to fit our model. We will fit multiple chains in parallel, though the below analysis could easily be replicated by running chains sequentially. Starting conditions for the MCMC chain must be generated that return a finite likelihood. The user may modify many of the MCMC control parameters, though the defaults are fine for most purposes. We have made some minor tweaks in this case study to improve convergence on infection history estimates. Step sizes for parameters in par_tab
are tuned automatically, and some automated tuning of the infection history proposals takes place for prior version 3. However, for other attack rate priors, it is necessary for the user to do some manual tuning of a) the number of individuals sampled at each step proposal_inf_hist_indiv_prop
; b) the number of time points sampled at each step proposal_inf_hist_time_prop
; c) the frequency of individual infection history swapping steps (ie. for an individual, choose two time points and swap their contents)proposal_inf_hist_indiv_swap_ratio
; d) proportion of infection history sampling steps which should be the alternative swapping step, where the contents of infection histories at two time points are swapped proposal_inf_hist_group_swap_ratio
; e) proportion of infection histories to swap with each alternative swapping step proposal_inf_hist_group_swap_prop
. For example, in this case study, we attack rates are likely to be highly correlated in adjacent years (as we have limited data to distinguish between infections in years close in time), and we therefore increase the frequency of the alternative infection history swapping step with proposal_inf_hist_group_swap_prop
.
Changing the number of iterations and the length of the adaptive period are often desirable. More crucially, the amount of chain thinning should be specified to ensure that users are not saving a large number of MCMC iterations (as this will rapidly fill disk space!). Thinning should be set such that at least 1000 iterations are saved (ie. iterations
/thin
and thin_inf_hist
). Users are encouraged to pay extra attention to thin_inf_hist
, which dictates the thinning of the infection history chain, and can generate a very large file if left unchecked.
## Distinct filename for each chain no_chains <- 5 filenames <- paste0(filename, "_",1:no_chains) chain_path <- sub("par_tab_base.csv","",par_tab_path) chain_path_real <- paste0(chain_path, "cs2_real/") chain_path_sim <- paste0(chain_path, "cs2_sim/") ## Create the posterior solving function that will be used in the MCMC framework model_func <- create_posterior_func(par_tab=par_tab, titre_dat=titre_dat, antigenic_map=antigenic_map, version=prior_version) # function in posteriors.R
## Generate results in parallel res <- foreach(x = filenames, .packages = c('serosolver','data.table','plyr')) %dopar% { ## Not all random starting conditions return finite likelihood, so for each chain generate random ## conditions until we get one with a finite likelihood start_prob <- -Inf while(!is.finite(start_prob)){ ## Generating starting antibody kinetics parameters start_tab <- generate_start_tab(par_tab) ## Generate starting infection history start_inf <- setup_infection_histories_titre(titre_dat, strain_isolation_times, space=3,titre_cutoff=4) start_prob <- sum(model_func(start_tab$values, start_inf)[[1]]) } res <- serosolver(par_tab = start_tab, titre_dat = titre_dat, antigenic_map = antigenic_map, start_inf_hist = start_inf, mcmc_pars = c("iterations"=500000,"adaptive_iterations"=100000,"thin"=1000, "thin_inf_hist"=1000,"save_block"=1000, "proposal_inf_hist_time_prop"=1, "proposal_inf_hist_indiv_prop"=1, "proposal_inf_hist_group_swap_ratio"=0.8, "proposal_inf_hist_group_swap_prop"=1), filename = paste0(chain_path_real,x), CREATE_POSTERIOR_FUNC = create_posterior_func, version = prior_version) }
Once the MCMC chains are run, serosolver
provides a number of simple functions to generate standard outputs and MCMC diagnostics. The saved MCMC chains are compatible with the coda
and bayesplot
packages, and users are encouraged to use these. First, read in the MCMC chains. The below function distinguishes between posterior samples for the infection history matrix and for the process parameters. The function searches for all files with the filenames generated by serosolver
in the specified directory, and returns data structures with these concatenated and seperated in a list.
## Read in the MCMC chains ## Note that `thin` here is in addition to any thinning done during the fitting #all_chains <- load_mcmc_chains(location=chain_path_real,thin=1,burnin=100000, # par_tab=par_tab,unfixed=FALSE,convert_mcmc=TRUE) ## Alternative, load the included MCMC chains rather than re-running data(cs2_chains_real) all_chains <- cs2_chains_real print(summary(all_chains))
Chains should then be checked for the usual MCMC diagnostics: $\hat{R}$ and effective sample size. First, looking at the antibody kinetics process parameters:
## Get the MCMC chains as a list list_chains <- all_chains$theta_list_chains ## Look at diagnostics for the free parameters list_chains1 <- lapply(list_chains, function(x) as.mcmc(x[,c("mu","sigma1","error", "tau","total_infections", "lnlike","prior_prob")])) ## Gelman-Rubin diagnostics to assess between-chain convergence for each parameter print(gelman.diag(as.mcmc.list(list_chains1))) gelman.plot(as.mcmc.list(list_chains1)) ## Effective sample size for each parameter print(effectiveSize(as.mcmc.list(list_chains1))) ## Posterior estimates for each parameter print(summary(as.mcmc.list(list_chains1))) ## Plot the MCMC trace using the `bayesplot` package color_scheme_set("viridis") p_theta_trace <- mcmc_trace(list_chains1) print(p_theta_trace)
and at the infection histories:
## Extract infection history chain inf_chain <- all_chains$inf_chain ## Look at inferred attack rates p_ar <- plot_attack_rates(inf_chain, titre_dat, strain_isolation_times, pad_chain=TRUE, plot_den = TRUE,prior_pars=list(prior_version=prior_version, alpha=par_tab[par_tab$names=="alpha","values"], beta=par_tab[par_tab$names=="beta","values"])) print(p_ar) ## Calculate convergence diagnostics and summary statistics on infection histories ## Important to scale all infection estimates by number alive from titre_dat n_alive <- get_n_alive_group(titre_dat, strain_isolation_times,melt=TRUE) ## This function generates a number of MCMC outputs ps_infhist <- plot_posteriors_infhist(inf_chain=inf_chain, years=strain_isolation_times, n_alive=n_alive) ## Posterior mean, median, 95% credible intervals and effective sample size ## on per time attack rates print(head(ps_infhist[["estimates"]]$by_year)) ## Posterior mean, median, 95% credible intervals and effective sample size ## on per individual total number of infections print(head(ps_infhist[["estimates"]]$by_indiv)) ## Check convergence of infection history summary statistics ## MCMC trace plots of attack rates ## Each subplot shows one year print(ps_infhist[["by_time_trace"]][[1]]) ## MCMC trace plots of total number of infections per individual ## Each subplot shows one individual print(ps_infhist[["by_indiv_trace"]][[1]]) ## Distribution of total number of infections print(ps_infhist[["indiv_infections"]]) ## Check for agreement between inferred cumulative infection histories ## for some individuals p_indiv_inf_hists <- generate_cumulative_inf_plots(inf_chain,indivs=1:9,pad_chain=FALSE, strain_isolation_times = strain_isolation_times, number_col=3) ## Each subplot shows one individual print(p_indiv_inf_hists[[1]]) ## Posterior probability that infections occured at given times per individual ## Each subplot shows one individual print(p_indiv_inf_hists[[2]])
Mixing can sometimes be very poor for per-time attack rates when adjacent times are highly correlated. This is often the case when the amount of data relatively poor. A cruder time resolution (eg. per two years) may be advisable, and mixing may benefit from increasing the proposal_inf_hist_group_swap_ratio
and proposal_inf_hist_indiv_prop
parameters in the mcmc_pars
list in serosolver
. proposal_inf_hist_indiv_prop
determines how frequently the MCMC sampler uses a proposal step that swaps the a proportion proposal_inf_hist_group_swap_ratio
of individual's infection states between two time points.
Users may also easily check the inferred antibody landscapes at the time each sample was taken. Black dots show observations, shaded regions and black line show 95%, 50% credible intervals and posterior median.
## get_titre_predictions expects only a single MCMC chain, so ## subset for only one chain chain <- as.data.frame(all_chains$theta_chain) chain1 <- chain[chain$chain_no == 1,] inf_chain1 <- inf_chain[inf_chain$chain_no == 1,] titre_preds <- get_titre_predictions(chain = chain1, infection_histories = inf_chain1, titre_dat = titre_dat, individuals = unique(titre_dat$individual), antigenic_map = antigenic_map, par_tab = par_tab,expand_titredat=FALSE) to_use <- titre_preds$predictions print(head(to_use)) ## Using ggplot ## Shaded regions show 95% and 50% credible intervals, ## line shows posterior median. ## Each suplot shows one individual titre_pred_p <- ggplot(to_use[to_use$individual %in% 1:9,])+ geom_ribbon(aes(x=virus,ymin=lower, ymax=upper),fill="gray90")+ geom_ribbon(aes(x=virus,ymin=lower_50, ymax=upper_50),fill="gray70")+ geom_line(aes(x=virus, y=median))+ geom_point(aes(x=virus, y=titre))+ coord_cartesian(ylim=c(0,8))+ ylab("log titre") + xlab("Time of virus circulation") + theme_classic() + facet_wrap(~individual) titre_pred_p
Figures in the main text can be readily generated from the MCMC output from above. The source code to generate these figures has been hidden, but can be found in the original .Rmd file for this vignette.
First, we are interested in calculating the number of infections experienced by individuals over time as a function of their age. We see that individuals are infected less frequently as they become older.
inf_chain <- all_chains$inf_chain ## The MCMC framework saves only present infections (ie. entries for the infection ## history matrix of 1) to save space ie. sparse format. The output chain ## should therefore be filled with the missing 0s before extensive analysis ## to avoid bias. inf_chain <- pad_inf_chain(inf_chain) ## Using data tables to get total number of infections ## per individual data.table::setkey(inf_chain, "i", "samp_no","chain_no") n_inf_chain_i <- inf_chain[, list(V1 = sum(x)), by = key(inf_chain)] setkey(n_inf_chain_i, "i") n_inf_chain <- n_inf_chain_i[,list(median_infs=median(V1)), by=key(n_inf_chain_i)] colnames(n_inf_chain)[1] <- "individual" setkey(n_inf_chain, "individual") ## Merge with titre data to recover individual ## id titre_dat1 <- data.table(titre_dat) setkey(titre_dat1, "individual") titre_dat1 <- merge(n_inf_chain, titre_dat1) ## Split individuals into age groups and plot summaries titre_dat1$age <- sample_year - titre_dat1$DOB titre_dat1$infs_per_life <- titre_dat1$median_infs/titre_dat1$age titre_dat1$age_group <- cut(titre_dat1$age, breaks = c(0,19,40,65,90)) age_dist <- ggplot(titre_dat1) + geom_boxplot(aes(group=age_group,y=infs_per_life*10,x=age_group)) + theme_classic() + ylab("Median number of infections\n per 10 years alive") + xlab("Age group") age_dist
Given the sparsity of data here, the default attack rate plot is difficult to interpret. Below is an alternative visualisation of the attack rate, with the 95% and 50% credible intervals shown in red, the posterior median shown in black and the posterior maximum likelihood estimate shown as a dashed green line.
## Find samples that were in both theta and inf hist chains chain <- as.data.frame(all_chains$theta_chain) intersect_samps <- intersect(unique(inf_chain$samp_no), unique(chain$samp_no)) chain <- chain[chain$samp_no %in% intersect_samps,] ## Find the parameter values that gave the highest posterior probability which_mle <- chain[which.max(chain$lnlike),c("samp_no","chain_no")] ## Take subset of chain for computational speed, as do not need all samples samps <- unique(inf_chain[,c("samp_no","chain_no")]) n_samps <- sample(1:nrow(samps), 100) samps <- samps[n_samps,] samps <- rbind(samps, which_mle) ## Plus MLE estimate ## Append the MLE estimate, note that this is max(samp_no) ## Create new index variables for simplicity samps$samp_no1 <- 1:nrow(samps) samps$chain_no1 <- 1 ## Inner join to return only our subset of samples ## Reformat samp_no and chain_no identifiers so that code ## sees samples as coming from one chain inf_chain <- merge(inf_chain, samps, by=c("samp_no","chain_no")) inf_chain <- inf_chain[,c("samp_no1","chain_no1","i","j","x")] colnames(inf_chain)[1:2] <- c("samp_no","chain_no") inf_chain <- pad_inf_chain(inf_chain) ## Rename columns to be more informative ## Column names expected by code below colnames(inf_chain) <- c("samp_no","chain_no","individual","year","infected","group") ## Data on which strains belong to which cluster cluster_path <- system.file("extdata", "fonville_clusters.csv", package = "serosolver") clusters <- read.csv(file = cluster_path, stringsAsFactors=FALSE) clusters <- clusters[clusters$year <= sample_year,] ## j=1 corresponds to the year 1968 inf_chain$year <- inf_chain$year + 1967 ## Merge cluster data and infection history data inf_chain <- merge(inf_chain, clusters[,c("year","cluster1")],by="year") ## Calculate ages and age groups of all individuals titre_dat$age <- max(strain_isolation_times) - titre_dat$DOB titre_dat$age_group <- cut(titre_dat$age,breaks=c(0,20,100),include.lowest=TRUE) ages <- unique(titre_dat[,c("individual","age_group","DOB","age")]) ## Merge infection histories with individual data inf_chain<- merge(inf_chain, data.table(ages), by=c("individual")) ## Alive status for each individual for each time, ## only interested in individuals that were alive ## when a virus circulated inf_chain$alive <- inf_chain$DOB <= inf_chain$year inf_chain <- inf_chain[inf_chain$alive,]
## Find out number of infections per year inf_chain$potential_infection <- 1 setkey(inf_chain, "samp_no","chain_no","year") inf_chain_ar <- inf_chain[,list(no_infected=sum(infected), potential_infection=sum(potential_infection)), by=key(inf_chain)] ## Calculate posterior median and credible intervals on the per year attack rates setkey(inf_chain_ar, "year") y <- inf_chain_ar[,list(median_ar=median(no_infected/potential_infection), lower_quantile=quantile(no_infected/potential_infection,0.025), upper_quantile=quantile(no_infected/potential_infection,0.975), lower_quantile_50=quantile(no_infected/potential_infection,0.25), upper_quantile_50=quantile(no_infected/potential_infection,0.75)), by=key(inf_chain_ar)] ## Extract the MLE attack rate estimates inf_chain_mle <- inf_chain_ar[inf_chain_ar$samp_no == max(inf_chain_ar$samp_no),] p_ar <- ggplot(data=y) + geom_ribbon(aes(x=year, ymin=lower_quantile,ymax=upper_quantile), fill=viridis(8)[3],alpha=0.2) + geom_ribbon(aes(x=year, ymin=lower_quantile_50,ymax=upper_quantile_50), fill=viridis(8)[3],alpha=0.5) + geom_line(aes(x=year,y=median_ar)) + geom_line(data=inf_chain_mle, aes(x=year, y=no_infected/potential_infection), linetype="dashed", size=0.8,col="forestgreen")+ scale_x_continuous(limits=c(1968,sample_year),expand=c(0,0))+ scale_y_continuous(limits=c(0,1),expand=c(0,0)) + ylab("Attack rate") + xlab("Year of circulation") + theme_classic()+ theme(panel.grid=element_blank()) print(p_ar)
Finally, inferring individual infection histories allows us to investigate age-specific patterns of incidence. Here, we show the proportion of individuals that were infected at least once within a single antigenic cluster, finding that clusters that circulate for longer tend to infect a far higher proportion of the population. Furthermore, we see that a far higher proportion of the younger age group is infected in more recent years.
## Find out number of infections per cluster for each individual setkey(inf_chain, "individual","samp_no","chain_no","cluster1","age_group") inf_chain_cluster <- inf_chain[,list(infected1=sum(infected), potential_infection=sum(potential_infection)), by=key(inf_chain)] inf_chain_cluster_once <- inf_chain_cluster inf_chain_cluster_once[inf_chain_cluster_once$infected1 >=1,"infected1"] <- 1 inf_chain_cluster_once[inf_chain_cluster_once$potential_infection >=1,"potential_infection"] <- 1 setkey(inf_chain_cluster_once, "samp_no","chain_no","cluster1","age_group") inf_chain_cluster_once <- inf_chain_cluster_once[,list(total_infected=sum(infected1), total_potential_infection=sum(potential_infection)), by=key(inf_chain_cluster_once)] ## For which years was each cluster circulating? year_ranges <- ddply(clusters, ~cluster1, function(x){ c(min(x$year), max(x$year)) }) year_ranges$width <- year_ranges$V2 - year_ranges$V1 + 1 colnames(year_ranges) <- c("cluster1","start_year","end_year","width") ## We want to know number of people alive per cluster per age group n_alive_clusters <- ddply(year_ranges, ~cluster1, function(x){ ddply(ages, ~age_group, function(y){ nrow(y[y$DOB <= x$end_year,]) }) }) colnames(n_alive_clusters)[3] <- "n_alive" n_alive_clusters <- merge(n_alive_clusters, year_ranges) y <- merge(inf_chain_cluster_once, data.table(n_alive_clusters), by=c("age_group","cluster1")) y$cluster1 <- as.factor(y$cluster1) colnames(y)[which(colnames(y)=="width")] <- "Years of circulation" p_once <- ggplot(y) + geom_violin(aes(x=cluster1,y=total_infected/n_alive, fill=`Years of circulation`, group=cluster1), adjust=1.2, draw_quantiles=c(0.025,0.5,0.975), scale="width" ) + scale_fill_gradient2(low="blue",high=viridis(8)[3]) + ylab("Proportion of inviduals\n infected at least once") + xlab("Cluster index") + theme_classic() + theme(legend.position = "bottom", panel.grid=element_blank()) + facet_wrap(~age_group, ncol=1) print(p_once)
We finish the vignette by presenting a simulation-recovery experiment to test the ability of the framework to recover known infection histories and antibody kinetics parameters using simulated data that matches the real dataset.
We simulate infection histories and antibody titre data based on the "real" parameters inferred from fitting the model above. First, we extract the maximum posterior probability antibody kinetics parameters and attack rates.
## Read in MCMC chains from fitting #all_chains <- load_mcmc_chains(location=chain_path_real,thin=1,burnin=100000, # par_tab=par_tab,unfixed=FALSE,convert_mcmc=FALSE) ## Alternative, load the included MCMC chains rather than re-running data(cs2_chains_real_b) all_chains <- cs2_chains_real_b ## Find samples that were in both theta and inf hist chains chain <- all_chains$theta_chain inf_chain <- all_chains$inf_chain intersect_samps <- intersect(unique(inf_chain$samp_no), unique(chain$samp_no)) chain <- chain[chain$samp_no %in% intersect_samps,] ## Find the parameter values that gave the highest posterior probability which_mle <- chain[which.max(chain$lnlike),c("samp_no","chain_no")] mle_theta_pars <- chain[chain$samp_no == which_mle$samp_no & chain$chain_no == which_mle$chain_no,] ## Store total infections to compare later mle_total_infs <- mle_theta_pars[,"total_infections"] mle_theta_pars <- mle_theta_pars[,par_tab$names] mle_inf_hist <- inf_chain[inf_chain$samp_no == which_mle$samp_no & inf_chain$chain_no == which_mle$chain_no,] ## Generate full infection history matrix using provided function mle_inf_hist <- expand_summary_inf_chain(mle_inf_hist[,c("samp_no","j","i","x")]) ## Find number of infections per year from this infection history no_infs <- colSums(mle_inf_hist[,3:ncol(mle_inf_hist)]) ## If missing time points in simulated attack rates if(length(no_infs) < length(strain_isolation_times)){ diff_lengths <- length(strain_isolation_times) - length(no_infs) no_infs <- c(no_infs, rep(0, diff_lengths)) } ## Find attack rate per year n_alive <- get_n_alive(titre_dat, strain_isolation_times) attack_rates <- no_infs/n_alive
Functions are provided to simulate antibody titre data under a given serosurvey design. The antibody kinetics parameters and attack rates estimated above are used to simulate titres from the model. The simulate_data
function is well documented, and users should refer to the help file to customise the simulated serosurvey design.
set.seed(1234) sim_par_tab <- par_tab sim_par_tab$values <- as.numeric(mle_theta_pars) sim_par_tab[sim_par_tab$names %in% c("alpha","beta"),"values"] <- c(1/3,1/3) age_min <- 2009 - max(titre_dat$DOB) age_max <- 2009 - min(titre_dat$DOB) n_indiv <- length(unique(titre_dat$individual)) dat <- simulate_data(par_tab=sim_par_tab, n_indiv=n_indiv, buckets=resolution, strain_isolation_times=strain_isolation_times, sampling_times=2009, nsamps=1, antigenic_map=antigenic_map, age_min=age_min, age_max=age_max, attack_rates=attack_rates, repeats=1) ## Inspect simulated antibody titre data and infection histories sim_titre_dat <- dat[["data"]] sim_infection_histories <- dat[["infection_histories"]] ## Store total infections to compare later actual_total_infections <- sum(sim_infection_histories) plot_data(sim_titre_dat, sim_infection_histories, strain_isolation_times,n_indivs = 5) ## Use titres only against same viruses tested in real data viruses <- unique(titre_dat$virus) sim_titre_dat <- sim_titre_dat[sim_titre_dat$virus %in% viruses, ] sim_ages <- dat[["ages"]] sim_titre_dat <- merge(sim_titre_dat, sim_ages) sim_ar <- dat[["attack_rates"]]
Once these simulated data have been generated, the work flow becomes exactly the same as with the real data above.
filename <- "case_study_2_sim" ## Distinct filename for each chain no_chains <- 5 filenames <- paste0(filename, "_",1:no_chains) ## Create the posterior solving function that will be used in the MCMC framework model_func <- create_posterior_func(par_tab=sim_par_tab, titre_dat=sim_titre_dat, antigenic_map=antigenic_map, version=prior_version) # function in posteriors.R
## Generate results in parallel res <- foreach(x = filenames, .packages = c('serosolver','data.table','plyr')) %dopar% { ## Not all random starting conditions return finite likelihood, so for each chain generate random ## conditions until we get one with a finite likelihood start_prob <- -Inf while(!is.finite(start_prob)){ ## Generate starting values for theta start_tab <- generate_start_tab(par_tab) ## Generate starting infection history start_inf <- setup_infection_histories_titre(sim_titre_dat, strain_isolation_times, space=3,titre_cutoff=4) start_prob <- sum(model_func(start_tab$values, start_inf)[[1]]) } res <- serosolver(par_tab = start_tab, titre_dat = sim_titre_dat, antigenic_map = antigenic_map, start_inf_hist = start_inf, mcmc_pars = c("iterations"=500000,"adaptive_iterations"=100000,"thin"=1000, "thin_inf_hist"=1000,"save_block"=1000, "proposal_inf_hist_group_swap_ratio"=0.8, "proposal_inf_hist_indiv_prop"=1), filename = paste0(chain_path_sim,x), CREATE_POSTERIOR_FUNC = create_posterior_func, version = prior_version) }
MCMC chains should be checked for convergence under the usual diagnostics. We also compare the inferred posterior distributions to the known true parameter values. We see that convergence and between-chain agreement is good and that the model recovers reasonably unbiased estimates for some parameters. However, under this sampling strategy the model slightly underestimates the amount of long term antibody boosting elicited by a single infection and overestimates the total number of infections. This is driven by the contribution of the attack rate prior relative to the contribution of the likelihood (the data). Increasing the number of measured titres (for example, measure titres against 40 viruses rather than 9) or using a more informative attack rate prior would help reduce this bias.
## Read in the MCMC chains ## Note that `thin` here is in addition to any thinning done during the fitting #sim_all_chains <- load_mcmc_chains(location=chain_path_sim,thin=1,burnin=100000, # par_tab=par_tab,unfixed=FALSE,convert_mcmc=TRUE) ## Alternative, load the included MCMC chains rather than re-running data(cs2_chains_sim) sim_all_chains <- cs2_chains_sim theta_chain <- sim_all_chains$theta_chain ## Get the MCMC chains as a list list_chains <- sim_all_chains$theta_list_chains ## Look at diagnostics for the free parameters list_chains1 <- lapply(list_chains, function(x) as.mcmc(x[,c("mu","sigma1","error", "tau","total_infections", "lnlike","prior_prob")])) ## Gelman-Rubin diagnostics and effective sample size print(gelman.diag(as.mcmc.list(list_chains1))) print(effectiveSize(as.mcmc.list(list_chains1))) melted_theta_chain <- reshape2::melt(as.data.frame(theta_chain), id.vars=c("samp_no","chain_no")) estimated_pars <- c(sim_par_tab[sim_par_tab$fixed == 0,"names"],"total_infections") melted_theta_chain <- melted_theta_chain[melted_theta_chain$variable %in% estimated_pars,] colnames(melted_theta_chain)[3] <- "names" add_row <- data.frame("total_infections",actual_total_infections,0,0.1,0,10000,0,0,1) colnames(add_row) <- colnames(sim_par_tab) sim_par_tab1 <- rbind(sim_par_tab, add_row) ggplot(melted_theta_chain) + geom_density(aes(x=value,fill=as.factor(chain_no)),alpha=0.5) + geom_vline(data=sim_par_tab1[sim_par_tab1$fixed == 0,],aes(xintercept=values),linetype="dashed") + facet_wrap(~names,scales="free") + theme_classic() + theme(legend.position="bottom")
Recovery of known attack rates is also reasonably accurate, though the constraint of the posterior distibution is quite low for many years where identifiability is poor. Again, more titre data or more individuals would improve inferential power. One particularly reassuring plot is the comparison of known individual cumulative infection histories (the cumulative sum of infections over time for an individual) against the estimated posterior distribution of cumulative infection histories. We see that the 95% credible intervals capture the true cumulative infection histories in almost all cases.
## Extract infection history chain inf_chain <- sim_all_chains$inf_chain ## Look at inferred attack rates p_ar <- plot_attack_rates(inf_chain, sim_titre_dat, strain_isolation_times, pad_chain=FALSE, plot_den = TRUE,prior_pars=list(prior_version=prior_version, alpha=par_tab[par_tab$names=="alpha","values"], beta=par_tab[par_tab$names=="beta","values"])) + geom_point(data=sim_ar,aes(x=year,y=AR),col="purple") print(p_ar) ## Calculate convergence diagnostics and summary statistics on infection histories ## Important to scale all infection estimates by number alive from titre_dat sim_n_alive <- get_n_alive_group(sim_titre_dat, strain_isolation_times,melt=TRUE) ## This function generates a number of MCMC outputs ps_infhist <- plot_posteriors_infhist(inf_chain=inf_chain, years=strain_isolation_times, n_alive=sim_n_alive) ## Check convergence of infection history summary statistics ## MCMC trace plots of attack rates print(ps_infhist[["by_time_trace"]][[1]]) ## MCMC trace plots of total number of infections per individual print(ps_infhist[["by_indiv_trace"]][[1]]) ## Check for agreement between inferred cumulative infection histories ## for some individuals p_indiv_inf_hists <- generate_cumulative_inf_plots(inf_chain,indivs=1:9,pad_chain=FALSE, real_inf_hist=sim_infection_histories, strain_isolation_times = strain_isolation_times, number_col=3) ## Each subplot shows one individual print(p_indiv_inf_hists[[1]]) ## Posterior probability that infections occured at given times per individual ## Each subplot shows one individual print(p_indiv_inf_hists[[2]])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.