knitr::opts_chunk$set(fig.width=7, fig.height=4)
In this document, we present a full analysis of the use of the package Rsero, taking as an example a serological survey of chikungunya in Vietnam. Data were obtained in Quan et al., Evidence of Previous but Not Current Transmission of Chikungunya Virus in Southern and Central Vietnam: Results from a Systematic Review and a Seroprevalence Study in Four Locations, Plos NTD, 2018.
The Rsero package is built upon the rstan package, the R interface to Stan. Stan is based on C++ and its necessary to install Rtools on Windows computers. Rtools version 4.0 and more recent can be found here
To install the package Rsero, use the devtools command
devtools::install_github("nathoze/Rsero")
Installation may take a few minutes due to the compilation of the stan files that encode the serocatalytic models.
First, we load the package
library(Rsero)
We load the data and organize them in the SeroData format, which will be later used for fitting. Required parameters are age_at_sampling, which must be given as positive integers, the seropositivity of each individuals (0, 1, or booleans). Additional parameters are the category to which each individual belongs, a reference.category, the sampling_year.
data('chikv_vietnam') chikv_vietnam$location = as.character(chikv_vietnam$location) chik.sero = SeroData(age_at_sampling = chikv_vietnam$age, Y = chikv_vietnam$Y, category = chikv_vietnam$location, reference.category = 'Hue', sampling_year = 2017)
For a first analysis, we aggregate the data all across Vietnam, simply by not specifying the category
data('chikv_vietnam') chikv_vietnam$location = as.character(chikv_vietnam$location) chik.sero.aggregated = SeroData(age_at_sampling = chikv_vietnam$age, Y = chikv_vietnam$Y, sampling_year = 2017)
We estimate the seroprevalence and plot the seroprevalence by age on the aggregatedd dataset.
seroprevalence(chik.sero.aggregated) # Value of the seroprevalence
seroprevalence.plot(chik.sero.aggregated) # plots of the seroprevalence vs age
To define a model of pathogen circulation we use the method FOImodel(). Some models have additional parameters, for instance the outbreak models require that we specify the number of peaks K. In addition, we can define the prior distributions, and additional parameters such as the seroreversion
model = list() model[[1]] = FOImodel(type='constant') model[[2]] = FOImodel(type='piecewise', K=2) model[[3]] = FOImodel(type='piecewise', K=3) model[[4]] = FOImodel(type='outbreak', K=1) model[[5]] = FOImodel(type='outbreak', K=2) model[[6]] = FOImodel(type='constantoutbreak', K=1) model[[7]] = FOImodel(type='constantoutbreak', K=2) model[[8]] = FOImodel(type='independent') model[[9]] = FOImodel(type='constant', sp=0.95, se=0.99) model[[10]] = FOImodel(type='outbreak', K=1, sp=0.95, se=0.99) model[[11]] = FOImodel(type='constant', seroreversion = 1)
Our objective is to infer the parameters for the differents models, assess whether the models are adequate at reproducing the data, and select the best model(s). Doing these steps for all models is time consuming and we just compare two models here. For each model, we run 4 independent chains of 5,000 iterations each, where by default the first 2,500 are warmups. By specifying the option cores = 4 the chains are run in parallel on 4 CPUs.
First, we test the model of a constant force of infection and imperfect sensibility and specificity (model 9). In this model, the user has to provide the values of the sensibility and specificity, here se = 0.99 and sp = 0.95.
print(model[[9]])
Fit_1 = fit(data = chik.sero.aggregated, model = model[[9]], chain=4, cores=4, iter=5000)
Second, we test the model of one outbreak and imperfect sensibility and specificity (model 10).
print(model[[10]])
Fit_2 = fit(data = chik.sero.aggregated, model = model[[10]], chain=4, cores=4, iter=5000)
The Rsero packages includes a certain number of functions to analyze the MCMC chains and the posterior distributions of the parameters. We first compare the different models and evaluate the goodness of fit. The compute_information_criteria function implemented in Rsero evaluates the AIC, DIC, WAIC, and the PSIS-LOO. The AIC, which requires an estimation of the maximum likelihood, is obtained for the ensemble of parameters in the chain that maximizes the likelihood.
C_1 = compute_information_criteria(Fit_1) C_2 = compute_information_criteria(Fit_2) print(C_1) print(C_2)
The DIC is significantly lower for model 10 than for model 9. This indicates that model 10 (outbreak model with imperfect sensitivity) is more adequate at fitting the data.
This can be checked visually when plotting the observed and fitted seroprevalance. In the plot, the black segments are the mean and 95% confidence interval, the blue line and envelopes are mean and 95% credible interval of the seroprevalence.
For model 9,
p = seroprevalence.fit(Fit_1) print(p)
And for model 10,
p = seroprevalence.fit(Fit_2) print(p)
According to this model, an outbreak occurred around 40 years before the survey. The non-zero seroprevalence in the younger individuals is explained here by false positives (implied by the sp parameter).
The posterior distribution can be plotted using the command
plot_posterior(Fit_2)
Numerical values are given by the the parameters_credible_intervals function, which gives the mean and 2.5%, 50% and 97.5% quantiles of the posterior distributions of the most relevant parameters.
parameters_credible_intervals(Fit_2)
The plot of the annual force of infection is obtained by typing
plot(Fit_2, YLIM =0.3)
We show here how to do the MCMC chains diagnosis using the tools provided in the rstan package. The output object of the fit using the package Rsero contains the data, the model, and the fit. These various objects can be extracted using the commands
Fit_2$data Fit_2$model Fit_2$fit
Using this fit, it is possible first to obtain the traceplots using the rstan::traceplot function, providing information on chain convergence and mixing. The Rsero::traceplot_Rsero() function restricts the plots to the most relevant parameters. In the case of model 10, these are the force of infection (alpha) and timing of the outbreak (T).
traceplot_Rsero(Fit_2)
Additional diagnostics for the HMC and NUTS are available using standard rstan functions. The following functions are not intended for a precise diagnostic of individual parameters but rather to quickly identify if acceptable values are reached for all parameters. The Rhat compares the between and within-chain estimates and is a measure of good mixing should be lower than 1.1. The effective sample size is a measure of "how much independent information there is in autocorrelated chains" (Kruschke 2015) and it should be larger than 200 to provide a stable parameter estimate. For more detailed diagnostics the use of the posterior package is recommanded.
stan_diag(Fit_2$fit,info = c("sample")) # shows diagnostic (loglikelihood, acceptance rate,etc) stan_rhat(Fit_2$fit) # show the Rhat statistic stan_ess(Fit_2$fit) # show the ratio of effective sample size over total sample size
Finally the chains are available using the rstan::extract() function
Chains= rstan::extract(Fit_2$fit)
We fit model 10 to the dataset "chik.sero" which contains information the survey participant sampling locations.
Fit_3 = fit(data = chik.sero, model = model[[10]], chain=4, cores=4, iter=5000)
The fits are obtained using the same function seroprevalence.fit
p = seroprevalence.fit(Fit_3) print(p)
Default parameters for the fit are 5,000 iterations (including 2,500 warmups), four chains, and a random set of initial parameters. Rstan allows more parameters to be specified by the user. We show a few ones here, but more information is available in the documentation of the rstan::sampling() function here. Additional parameters such as the thinning, the algorithm (HMC, NUTS), and important parameters to control the sampler behavior (control) can be modified. A list of the parameters for the HMC algorithm (Hamiltonian Monte-Carlo) can be found here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.