knitr::opts_chunk$set(echo = TRUE)
The aim of the STOCfree
package is to predict herd level probabilities of freedom from infection from longitudinal data collected as part of surveillance programmes. A full description of the model principles is available as an article. This article has been peer-reviewed as part of a publishing initiative called Peer Community In Animal Science. The reviews can be accessed by clicking this link. Some aspects of the model were evaluated on simulated data by Mercat et al. (2022).
The model has been developed as part of the EFSA funded project STOC free. An overall description of the project can be found in a 2019 article by van Roon et al. published in frontiers in Veterinary Science.
The problem addressed is the following. For some major infectious diseases of cattle, there exist local surveillance programmes aimed at identifying infected herds or animals with the ultimate goal of controlling or eradicating the disease. Although such programmes usually lead to disease control or eradication, they can create some difficulty when trading animals. Herds considered as free from infection within one programme may not be considered free under another programme. This lack of comparability can be problematic. This is due to the differences in the programmes in terms of the frequency of testing, the categories of the animals tested, the tests used... as well as in the way freedom from infection is defined from the data collected. The aim of the STOC free project is to provide a framework allowing the comparison of the outputs generated by different surveillance programmes. This is known as output-based surveillance.
In the STOC free project, infection by the Bovine Viral Diarrhoea Virus (BVDV) in cattle is taken as an example. A description of BVDV surveillance programmes in the countries involved in the project was published by van Roon et al. in the Journal of Dairy Science in 2020 and shows the variety in existing programmes.
The statistical framework described on this page is meant to model surveillance programmes in which all herds in the programme are tested at regular time intervals. The model can be described as a Hidden Markov Model, running at the month level. The variable of interest is a latent status regarding infection that is imperfectly measured with tests and that can be predicted by risk factors. The model returns a herd level posterior distribution for the probability of being status positive given a sequence of test results and risk factors. This probability of being positive to the status (usually infection) is predicted for the last month in the dataset. Data collected before are used as historical data to train the model. Risk factors of new infection (more broadly of becoming status positive) are considered. The model is run in a Bayesian framework with estimation and prediction performed in either Stan or JAGS. The model runs faster and better in Stan.
The next sections describe how to install the package, set up and run the model.
Before installing the STOCfree package, you need to install either Stan or JAGS.
The Stan implementation relies on an interface called CmdStan which communicates with R through the CmdStanR R package. The instructions to install both CmdStan and CmdStanR can be found on the following page: https://mc-stan.org/cmdstanr/articles/cmdstanr.html. Running the Stan model also requires installing the posterior R package.
JAGS can be installed from the following website: https://sourceforge.net/projects/mcmc-jags/files/. If R cannot find JAGS upon calling STOCfree_JAGS()
try running the following in the console Sys.setenv(JAGS_HOME=“PATH/JAGS installation”)
.
The STOCfree
package needs to be installed from Github. This requires installing the remotes
(or devtools
) package first. You will need R version 4.0 or later to install the package. You may be asked to install or update several packages during the installation.
install.packages("remotes")
In order to install (or update) the STOCfree package, run the following line:
remotes::install_github("AurMad/STOCfree")
The STOCfree
package needs to be attached.
library(STOCfree)
The list of available functions and datasets can be accessed by typing
help(package="STOCfree")
We also attach the following packages that will be used later:
library(ggplot2)
Modelling will usually consist in the following steps:
We demonstrate the use of the package by using a toy dataset called herdBTM
which is included in the package. These are not real data and should not be used to draw any conclusion regarding BVD. This dataset contains the results of tests performed in 100 herds. Two different tests are used. The first test is an antibody ELISA performed on bulk tank milk. When this test is positive, a confirmatory test with a high specificity is performed one month later. The result of the ELISA test is an optical density ratio (continuous) which is converted into a negative or positive test result based on a threshold of 35. The dataset looks as follows:
head(herdBTM)
The columns of the dataset correspond to:
Farm
: farm IDDateOfTest
: date when the test was performedODR
: optical density ratio associated with ELISA testTest
: type of test used, either routine ELISA on bulk tank milk ("BTM_ODR") or confirmatory test ("confirm")TestResult
: test result. 0 for negative; 1 for positiveLocalSeroPrev
: local seroprevalence, a risk factor of new infectionSTOCfree_data
objectsAll the later analyses rely on the construction of a STOCfree_data
object. As an example, a STOCfree_data
object is constructed from the herdBTM
dataset. To create this object, we use the STOCfree_data()
function. Type ?STOCfree_data
in the console to see the list of arguments that can be used.
sfd0 <- STOCfree_data(test_data = herdBTM, test_herd_col = "Farm", test_date_col = "DateOfTest", test_res_col = "TestResult", test_name_col = "Test", status_dynamics_scale = "proba")
In this example, the function will gather the test data from the herdBTM
dataset, which must be a data.frame or a tibble. Herd identifiers are looked up in the Farm
column of the herdBTM
dataset, dates of test in the DateOfTest
column...
One argument worth mentioning is the status_dynamics_scale
argument. By default, infection dynamics parameters, i.e. probabilities of infection on the first time step, acquiring and eliminating the infection between time steps, are modelled on the probability scale using Beta distributions (status_dynamics_scale = "proba"
). Changing to status_dynamics_scale = "logit"
, these probabilities are modelled on the logit scale and normal prior distributions are used. In order to illustrate this point, we create a second STOCfree_data object with dynamics parameters on the logit scale.
sfd <- STOCfree_data(test_data = herdBTM, test_herd_col = "Farm", test_date_col = "DateOfTest", test_res_col = "TestResult", test_name_col = "Test", status_dynamics_scale = "logit")
The STOCfree_data()
function returns an object of class STOCfree_data
.
class(sfd)
A STOCfree_data
object is in fact a list of data.frames
. Below we provide a brief explanation on the content of this list.
str(sfd)
The list components are:
var_names
: correspondence between column names in the original datasets and relevant data used in the modelsherd_id_corresp
: for the analysis herds are numbered from 1 to the number of herds. This table contains the correspondence between these IDs and the original IDstest_data
: data with test resultsherd_test_data
: herd level indices in the sequence of all months for all herds used in the model. For each herd, the model needs the indices associated with the first, second, penultimate and last tests in the sequence.test_perf_prior
: table with parameters for the prior distributions for test characteristics.risk_factors
: list and type of risk factors that will be included in the modelrisk_factor_data
: dataset containing risk factor values for all herds and all months in the analysisinf_dyn_priors
: table with priors for infection dynamics: probability of being status positive on the first month, probability of becoming status positive between consecutive months when applicable, probability of not eliminating the infection between consecutive monthsThe model accounts for the fact that the sensitivity and specificity of the different tests used can be below 1. Hypotheses about these parameters are included through the use of Beta distributions as priors for the different sensitivities and specificities. We can check the current values for the parameters associated with these distributions using the show_tests()
function.
show_tests(sfd)
Column names ending with _a
and _b
contain the alpha and beta values for Beta distributions. For example Se_a = 20
and Se_b = 2
define the prior distribution for sensitivity as Se \~ Beta(20, 2).
As can be seen, no value has been defined yet. This can be done as follows:
sfd <- set_priors_tests(sfd, test = "BTM_ODR", Se_a = 5000, Se_b = 260, Sp_a = 20, Sp_b = 2) sfd <- set_priors_tests(sfd, test = "confirm", Se_a = 20, Se_b = 2, Sp_a = 10000, Sp_b = 1)
We can check that the priors have changed:
show_tests(sfd)
The prior distributions can be plotted with the plot_priors_tests()
function.
plot_priors_tests(sfd)
In order to help selecting appropriate parameter values for the Beta distributions, we have designed a Shiny app which is available from Github. See https://github.com/AurMad/betadistapp
These parameters are the probability of being status positive on the first time step (pi1), the probability of becoming status positive between consecutive time steps (tau1) and the probability of remaining status positive between consecutive time steps (tau2). When risk factors are included in the model, tau1 is modelled as a function of these risk factors and the parameters for the prior distribution of tau1 will not be taken into account.
For STOCfree_data
objects constructed with the argument status_dynamics_scale = "proba"
, the prior distributions for status dynamics are provided as Beta distributions on the probability scale. The show_priors_status_dyn()
shows the current parameter values for these Beta distributions and gives the syntax the use to change them.
show_priors_status_dyn(sfd0)
Parameter values are provided for these prior distributions as follows:
sfd0 <- set_priors_status_dyn(sfd0, pi1_a = 1, pi1_b = 1, tau1_a = 1.5, tau1_b = 10, tau2_a = 10, tau2_b = 1.5)
These distributions can be plotted.
plot_priors_status_dyn(sfd0)
To demonstrate how to set dynamics parameters on the logit scale, we use the sfd1
object created above. As can be seen, the parameter values requested consist of means and standard deviations. In order to help selecting appropriate parameter values for the normal distributions on the logit scale, we have designed a Shiny app which is available from Github. See https://github.com/AurMad/logitnormdistapp
show_priors_status_dyn(sfd)
Values are provided.
sfd <- set_priors_status_dyn(sfd, logit_pi1_mean = 0, logit_pi1_sd = 1.5, logit_tau1_mean = -3, logit_tau1_sd = 1, logit_tau2_mean = 3, logit_tau2_sd = 1)
These distributions can be plotted.
plot_priors_status_dyn(sfd)
The Stan implementation uses the forward algorithm for the estimation. The code was adapted from a tutorial on Hidden Markov models in Stan by Damiano et al. (2017). The model is run using the STOCfree_Stan()
function. The main argument is a STOCfree_data
object as created above. By default, the model outputs are stored in a folder called 'STOCfree_files'. This output folder can be changed using through out_path
argument.
sfm_stan <- STOCfree_Stan(sfd, n_chains = 3, n_iter = 1000, n_thin = 1, out_path = "STOCfree_Stan_1")
The STOC free model is run using the STOCfree_JAGS()
function. This function is a wrapper for the run.jags()
function from the runjags
package. The main argument is a STOCfree_data
object as created above. By default, the models outputs are stored in a folder called 'STOCfree_files'. This output folder can be changed using through out_path
argument.
sfm_jags <- STOCfree_JAGS(sfd, n_chains = 3, n_burnin = 1000, n_iter = 1000, n_thin = 1, out_path = "STOCfree_JAGS_1")
The Stan model returns an object of classes CmdStanMCMC
, CmdStanFit
and R6
. Therefore, all methods and functions working with objects of these classes will work with outputs created by STOCfree_Stan()
.
The JAGS model returns an object of class runjags
. Therefore, all methods and functions working with runjags objects will work with outputs created by STOCfree_JAGS()
.
The STOCfree
package contains functions to explore the model outputs that work in the same way, regardless of whether the output was generated with Stan or JAGS.
Two specific types of model outputs are of interest and can be extracted, either directly from the model output or loaded from the disk when STOCfree_model()
was run using save_output = TRUE
. These model outputs are the model parameters (Se, Sp, $\tau_1$, $\ldots$) and predicted probabilities of being status positive. Additionally, the JAGS model returns estimated monthly prevalences during the historical period.
By model parameters, we mean test characteristics, parameters related to infection dynamics and association with risk factors. These can be extracted from the model output:
param <- extract_STOCfree_param(sfm_stan)
or loaded from the disk
param <- read_STOCfree_param(out_path = "STOCfree_Stan_1")
These 2 functions create objects of class STOCfree_param
for which print and plot methods are defined in the package.
print(param)
Traceplots can be generated to check convergence visually.
plot(param, parameter = "Se1", type = "traceplot")
Density plots can be generated.
plot(param, parameter = "Sp1", type = "density")
Summary values for each parameter can be obtained as follows:
summary(param)
MCMC samples for the herd level predicted probabilities of infection can also be extracted or loaded.
pred <- extract_STOCfree_pred(sfm_stan, sfd)
or
pred <- read_STOCfree_pred("STOCfree_Stan_1")
Calling print on these objects will display the list of herds.
print(pred)
Calling plot on these predicted probabilities will plot the density of predicted probabilities for all herds across all iterations.
plot(pred)
It is also possible to plot the predicted probabilities of infection for specific herds:
plot(pred, herd = c("FR001", "FR002"), type = "individual", legend = TRUE)
A summary method is implemented for these predicted probabilities. It is possible to specify for which herds a summary is required.
summary(pred, herd = c("FR001", "FR003"))
This is only available for the JAGS model. The predicted monthly prevalences of infection can be extracted or loaded using:
prev <- extract_STOCfree_month_prev(sfm_jags, sfd)
or
prev <- read_STOCfree_month_prev("STOCfree_JAGS_1")
Calling plot on this object will create boxplots of estimated monthly infection prevalences.
plot(prev)
Risk factors are included with the objective of detecting infection earlier when the interval between tests is long, as is usually the case. They can also improve the performance of the prediction of the probability of being status positive when test performance is not good.
The association between the probability of becoming status positive (tau1) and risk factors is modelled with logistic regression. The uncertainty in status as measured by test result is accounted for in the regression through test sensitivity and specificity.
Bayesian inference in JAGS can take a lot of time. This time increases with the size of the dataset modelled as well as with the number of MCMC iterations. One way to overcome this problem is to force covariates in the model based on what is known about the disease. In this case, but is also possible to include informative prior distributions for the coefficients of the logistic regression between tau1 and risk factors.
The STOCfree package contains a set of functions designed to help in the selection of risk factors from the data. For this purpose, it is considered that all herds with a negative test result are eligible for becoming test positive on the next test. The probability of becoming test positive (versus remaining test negative) is modelled with logistic regression. A further thing that is considered is the fact that risk factors occurring at one point in time may be associated with becoming test positive later.
To show how to use the available functions, we use the intro
dataset which is included in the package.
intro
This dataset contains the number of animals purchased as well as the dates of introduction for the 100 herds in the herdBTM
data. Using these datasets, the probability of seroconversion (becoming test positive between 2 tests in a given herd) is modelled as a function of the number of animals purchased between the month of seroconversion (month 0) and 24 months earlier. The time of seroconversion is taken as the midpoint between the 2 test dates considered.
nAnim_lagged <- logit_nwinf_lagged( sf_data = sfd, rf_data = intro, rf_date_col = "dateIntr", rf_col = "nAnim", time_of_inf = "mid", lag1 = 0, lag2 = 24)
The function returns a table with the AICs of all the models evaluated.
str(nAnim_lagged)
This can be plotted using the code below. From the Figure below it can be seen that between the worst (AIC > 350) and the best (AIC \< 342) model, the difference in AIC is rather limited. There is a drop in AIC for intervals starting from 8 months before seroconversion.
ggplot(data = nAnim_lagged, aes(x = lag2, y = lag1, fill = AIC)) + geom_tile() + xlab("Time Lag 2 (months)") + ylab("Time Lag 1 (months)") + scale_fill_gradient(low = "red", high = "yellow", aesthetics = "fill") + ggtitle("Number of animals purchased")
When the function is used to evaluate several candidate variables, all these variables can be included in a multivariate logistic model. For this, we create a dataset with the different variables of interest.
nwinf <- make_nwinf_data(sfd, time_of_inf = "mid")
Here, we include the number of animals purchased 8 months before the test of interest. This is done by calling the add_risk_factor()
.
nwinf <- add_risk_factor(nwinf, intro, rf_col = "nAnim", rf_date_col = "dateIntr", lag1 = 8, lag2 = 8)
The multivariate logistic model is estimated with R glm()
function called by logit_nwinf()
function from the STOCfree
package:
modl <- logit_nwinf(nwinf, risk_factors = "nAnim_8_8")
The model results are:
summary(modl)
The STOCfree_data()
object is updated to include the risk factors selected for the analysis. This is done with the sf_add_risk_factor()
function.
sfd <- sf_add_risk_factor( sfd = sfd, risk_factor_data = intro, risk_herd_col = "Farm", risk_date_col = "dateIntr", risk_factor_col = "nAnim", risk_factor_type = "continuous", lag1 = 8, lag2 = 8, FUN = sum)
Doing this will result in the priors required for infection related parameters to be different from above. The priors for tau1 are not needed any more. They are replaced by prior distributions for the logistic regression.
show_priors_status_dyn(sfd)
The Bayesian model models the probability of becoming status positive (usually getting infected) as a function of one or several risk factors, with logistic regression. Priors for the coefficients in the logistic regression need to be provided. The prior distributions used are normal distributions on the logit scale. In our case, we need to provide prior distributions for the intercept as well as for coefficient associated with the number of animals purchased 8 months before the month of surveillance. Below is the list of covariates included in the STOCfree_data
so far.
show_rf(sfd)
sfd <- set_priors_rf(sfd, risk_factor = "Intercept", mean = -3, sd = 1) sfd <- set_priors_rf(sfd, risk_factor = "nAnim_8_8", mean = 0, sd = 2)
This has updated the list of covariates with the normal distribution parameters.
show_rf(sfd)
These priors can be plotted on the probability scale. The prior distribution for the intercept implies that in herds buying no cows, the probability of becoming positive is most likely below 0.4, with most of the density below 0.2.
plot_priors_rf(sfd)
This is done as explained above by calling the STOCfree_Stan()
or STOCfree_JAGS()
functions. In this case, estimates for the logistic regression parameters will appear as theta1 (model intercept) and theta2 (coefficient for the nAnim_8_8 variable).
Below is a complete example.
## Creation of the STOCfree_data object sfd <- STOCfree_data(test_data = herdBTM, test_herd_col = "Farm", test_date_col = "DateOfTest", test_res_col = "TestResult", test_name_col = "Test", status_dynamics_scale = "logit") ## Priors for test characteristics sfd <- set_priors_tests(sfd, test = "BTM_ODR", Se_a = 5000, Se_b = 260, Sp_a = 20, Sp_b = 2) sfd <- set_priors_tests(sfd, test = "confirm", Se_a = 20, Se_b = 2, Sp_a = 10000, Sp_b = 1) ## Prior for infection dynamics sfd <- set_priors_status_dyn(sfd, logit_pi1_mean = 0, logit_pi1_sd = 4, logit_tau2_mean = 3, logit_tau2_sd = 1) ## Adding a risk factor to the TSOCfree_data object sfd <- sf_add_risk_factor( sfd = sfd, risk_factor_data = intro, risk_herd_col = "Farm", risk_date_col = "dateIntr", risk_factor_col = "nAnim", risk_factor_type = "continuous", lag1 = 8, lag2 = 8, FUN = sum) ## Priors for risk factors association with log-odds of becoming positive sfd <- set_priors_rf(sfd, risk_factor = "Intercept", mean = -3, sd = 1) sfd <- set_priors_rf(sfd, risk_factor = "nAnim_8_8", mean = 0, sd = 2)
Stan version:
set.seed(124) sfm_stan_rf <- STOCfree_Stan(sfd, n_chains = 3, n_warmup = 10, n_iter = 100, n_thin = 1, out_path = "STOCfree_Stan_2")
JAGS version
sfm_jags_rf <- STOCfree_JAGS(sfd, n_chains = 3, n_burnin = 1000, n_iter = 1000, n_thin = 1, out_path = "STOCfree_JAGS_2")
Summary of parameters estimated with the Stan model.
param <- extract_STOCfree_param(sfm_stan_rf) summary(param)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.