README.md

STOCfree: prediction of probabilities of freedom from infection from longitudinal data ================

Overview

The aim of the STOCfree package is to predict herd level probabilities of freedom from infection from longitudinal data collected as part of surveillance programmes. A full description of the model principles is available as an article. This article has been peer-reviewed as part of a publishing initiative called Peer Community In Animal Science. The reviews can be accessed by clicking this link. Some aspects of the model were evaluated on simulated data by Mercat et al. (2022).

The model has been developed as part of the EFSA funded project STOC free. An overall description of the project can be found in a 2019 article by van Roon et al. published in frontiers in Veterinary Science.

The problem addressed is the following. For some major infectious diseases of cattle, there exist local surveillance programmes aimed at identifying infected herds or animals with the ultimate goal of controlling or eradicating the disease. Although such programmes usually lead to disease control or eradication, they can create some difficulty when trading animals. Herds considered as free from infection within one programme may not be considered free under another programme. This lack of comparability can be problematic. This is due to the differences in the programmes in terms of the frequency of testing, the categories of the animals tested, the tests used… as well as in the way freedom from infection is defined from the data collected. The aim of the STOC free project is to provide a framework allowing the comparison of the outputs generated by different surveillance programmes. This is known as output-based surveillance.

In the STOC free project, infection by the Bovine Viral Diarrhoea Virus (BVDV) in cattle is taken as an example. A description of BVDV surveillance programmes in the countries involved in the project was published by van Roon et al. in the Journal of Dairy Science in 2020 and shows the variety in existing programmes.

The statistical framework described on this page is meant to model surveillance programmes in which all herds in the programme are tested at regular time intervals. The model can be described as a Hidden Markov Model, running at the month level. The variable of interest is a latent status regarding infection that is imperfectly measured with tests and that can be predicted by risk factors. The model returns a herd level posterior distribution for the probability of being status positive given a sequence of test results and risk factors. This probability of being positive to the status (usually infection) is predicted for the last month in the dataset. Data collected before are used as historical data to train the model. Risk factors of new infection (more broadly of becoming status positive) are considered. The model is run in a Bayesian framework with estimation and prediction performed in either Stan or JAGS. The model runs faster and better in Stan.

The next sections describe how to install the package, set up and run the model.

Package installation and update

Before installing the STOCfree package, you need to install either Stan or JAGS.

The STOCfree package needs to be installed from Github. This requires installing the remotes (or devtools) package first. You will need R version 4.0 or later to install the package. You may be asked to install or update several packages during the installation.

install.packages("remotes")

In order to install (or update) the STOCfree package, run the following line:

remotes::install_github("AurMad/STOCfree")

Attaching packages

The STOCfree package needs to be attached.

library(STOCfree)

The list of available functions and datasets can be accessed by typing

help(package="STOCfree")

We also attach the following packages that will be used later:

library(ggplot2)

Steps of the analysis

Modelling will usually consist in the following steps:

  1. Set up the test data
  2. Define the prior distributions for test characteristics
  3. Define the prior distributions for the model parameters related to status
  4. Set up the risk factor data
  5. Define the prior distributions for the association between risk factors and probability of becoming infected (status positive)
  6. Run the STOC free model
  7. Analyse the model outputs

Test data

Example dataset

We demonstrate the use of the package by using a toy dataset called herdBTM which is included in the package. These are not real data and should not be used to draw any conclusion regarding BVD. This dataset contains the results of tests performed in 100 herds. Two different tests are used. The first test is an antibody ELISA performed on bulk tank milk. When this test is positive, a confirmatory test with a high specificity is performed one month later. The result of the ELISA test is an optical density ratio (continuous) which is converted into a negative or positive test result based on a threshold of 35. The dataset looks as follows:

head(herdBTM)
## # A tibble: 6 × 6
##   Farm  DateOfTest   ODR Test    TestResult LocalSeroPrev
##   <chr> <date>     <dbl> <chr>        <dbl>         <dbl>
## 1 FR001 2014-02-04  57.6 BTM_ODR          1          0.12
## 2 FR001 2014-03-01  NA   confirm          1          0.12
## 3 FR001 2014-09-10  66.5 BTM_ODR          1          0.1 
## 4 FR001 2014-10-01  NA   confirm          1          0.1 
## 5 FR001 2015-02-01  52.5 BTM_ODR          1          0.08
## 6 FR001 2015-03-01  NA   confirm          1          0.08

The columns of the dataset correspond to:

STOCfree_data objects

All the later analyses rely on the construction of a STOCfree_data object. As an example, a STOCfree_data object is constructed from the herdBTM dataset. To create this object, we use the STOCfree_data() function. Type ?STOCfree_data in the console to see the list of arguments that can be used.

sfd0 <- STOCfree_data(test_data = herdBTM,
                     test_herd_col = "Farm",
                     test_date_col = "DateOfTest",
                     test_res_col = "TestResult",
                     test_name_col = "Test",
                     status_dynamics_scale = "proba")

In this example, the function will gather the test data from the herdBTM dataset, which must be a data.frame or a tibble. Herd identifiers are looked up in the Farm column of the herdBTM dataset, dates of test in the DateOfTest column…

One argument worth mentioning is the status_dynamics_scale argument. By default, infection dynamics parameters, i.e. probabilities of infection on the first time step, acquiring and eliminating the infection between time steps, are modelled on the probability scale using Beta distributions (status_dynamics_scale = "proba"). Changing to status_dynamics_scale = "logit", these probabilities are modelled on the logit scale and normal prior distributions are used. In order to illustrate this point, we create a second STOCfree_data object with dynamics parameters on the logit scale.

sfd <- STOCfree_data(test_data = herdBTM,
                     test_herd_col = "Farm",
                     test_date_col = "DateOfTest",
                     test_res_col = "TestResult",
                     test_name_col = "Test",
                     status_dynamics_scale = "logit")

The STOCfree_data() function returns an object of class STOCfree_data.

class(sfd)
## [1] "herd_dynLogit" "STOCfree_data"

A STOCfree_data object is in fact a list of data.frames. Below we provide a brief explanation on the content of this list.

str(sfd)
## List of 8
##  $ var_names       : Named chr [1:4] "Farm" "DateOfTest" "TestResult" "Test"
##   ..- attr(*, "names")= chr [1:4] "test_herd_col" "test_date_col" "test_res_col" "test_name_col"
##  $ herd_id_corresp :'data.frame':    100 obs. of  2 variables:
##   ..$ Farm   : chr [1:100] "FR001" "FR002" "FR003" "FR004" ...
##   ..$ herd_id: int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
##  $ test_data       :'data.frame':    924 obs. of  6 variables:
##   ..$ status_id  : int [1:924] 1 2 8 9 13 14 20 21 25 26 ...
##   ..$ herd_id    : int [1:924] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ month_id   : int [1:924] 1 2 8 9 13 14 20 21 25 26 ...
##   ..$ status_type: num [1:924] 1 2 2 2 2 2 2 2 2 2 ...
##   ..$ test_id    : int [1:924] 1 2 1 2 1 2 1 2 1 2 ...
##   ..$ test_res   : num [1:924] 1 1 1 1 1 1 1 0 1 0 ...
##  $ herd_test_data  :'data.frame':    100 obs. of  3 variables:
##   ..$ herd_id: int [1:100] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ ind_i  : int [1:100] 1 34 67 100 133 166 199 232 265 298 ...
##   ..$ ind_p  : int [1:100] 33 66 99 132 165 198 231 264 297 330 ...
##  $ test_perf_prior :'data.frame':    2 obs. of  6 variables:
##   ..$ test   : chr [1:2] "BTM_ODR" "confirm"
##   ..$ test_id: int [1:2] 1 2
##   ..$ Se_a   : logi [1:2] NA NA
##   ..$ Se_b   : logi [1:2] NA NA
##   ..$ Sp_a   : logi [1:2] NA NA
##   ..$ Sp_b   : logi [1:2] NA NA
##  $ risk_factors    :'data.frame':    1 obs. of  3 variables:
##   ..$ risk_factor: chr "Intercept"
##   ..$ type       : chr "intercept"
##   ..$ modality   : logi NA
##  $ risk_factor_data:'data.frame':    3300 obs. of  4 variables:
##   ..$ status_id: int [1:3300] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ herd_id  : int [1:3300] 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ month_id : int [1:3300] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ intercept: num [1:3300] 1 1 1 1 1 1 1 1 1 1 ...
##  $ inf_dyn_priors  : Named logi [1:6] NA NA NA NA NA NA
##   ..- attr(*, "names")= chr [1:6] "logit_pi1_mean" "logit_pi1_sd" "logit_tau1_mean" "logit_tau1_sd" ...
##  - attr(*, "level")= chr "herd"
##  - attr(*, "status dynamics scale")= chr "logit"
##  - attr(*, "number of herds")= int 100
##  - attr(*, "number of tests")= int 2
##  - attr(*, "month first test")= chr "2014-02"
##  - attr(*, "month last test")= chr "2016-10"
##  - attr(*, "number of risk factors")= num 0
##  - attr(*, "class")= chr [1:2] "herd_dynLogit" "STOCfree_data"

The list components are:

Priors for test characteristics

The model accounts for the fact that the sensitivity and specificity of the different tests used can be below 1. Hypotheses about these parameters are included through the use of Beta distributions as priors for the different sensitivities and specificities. We can check the current values for the parameters associated with these distributions using the show_tests() function.

show_tests(sfd)
##      test test_id Se_a Se_b Sp_a Sp_b
## 1 BTM_ODR       1   NA   NA   NA   NA
## 2 confirm       2   NA   NA   NA   NA

Column names ending with _a and _b contain the alpha and beta values for Beta distributions. For example Se_a = 20 and Se_b = 2 define the prior distribution for sensitivity as Se \~ Beta(20, 2).

As can be seen, no value has been defined yet. This can be done as follows:

sfd <- set_priors_tests(sfd,
                 test = "BTM_ODR",
                 Se_a = 5000,
                 Se_b = 260,
                 Sp_a = 20,
                 Sp_b = 2)

sfd <- set_priors_tests(sfd,
                 test = "confirm",
                 Se_a = 20,
                 Se_b = 2,
                 Sp_a = 10000,
                 Sp_b = 1)

We can check that the priors have changed:

show_tests(sfd)
##      test test_id Se_a Se_b  Sp_a Sp_b
## 1 BTM_ODR       1 5000  260    20    2
## 2 confirm       2   20    2 10000    1

The prior distributions can be plotted with the plot_priors_tests() function.

plot_priors_tests(sfd)

In order to help selecting appropriate parameter values for the Beta distributions, we have designed a Shiny app which is available from Github. See https://github.com/AurMad/betadistapp

Priors for the model parameters related to status dynamics

These parameters are the probability of being status positive on the first time step (pi1), the probability of becoming status positive between consecutive time steps (tau1) and the probability of remaining status positive between consecutive time steps (tau2). When risk factors are included in the model, tau1 is modelled as a function of these risk factors and the parameters for the prior distribution of tau1 will not be taken into account.

Probability scale

For STOCfree_data objects constructed with the argument status_dynamics_scale = "proba", the prior distributions for status dynamics are provided as Beta distributions on the probability scale. The show_priors_status_dyn() shows the current parameter values for these Beta distributions and gives the syntax the use to change them.

show_priors_status_dyn(sfd0)
##  pi1_a  pi1_b tau1_a tau1_b tau2_a tau2_b 
##     NA     NA     NA     NA     NA     NA

## Set prior distributions for status dynamics using:
## sfd0 <- set_priors_status_dyn(sfd0, pi1_a = , pi1_b = , tau1_a = , tau1_b = , tau2_a = , tau2_b = )

Parameter values are provided for these prior distributions as follows:

sfd0 <- set_priors_status_dyn(sfd0,
                   pi1_a = 1,
                   pi1_b = 1,
                   tau1_a = 1.5,
                   tau1_b = 10,
                   tau2_a = 10,
                   tau2_b = 1.5)

These distributions can be plotted.

plot_priors_status_dyn(sfd0)

Logit scale

To demonstrate how to set dynamics parameters on the logit scale, we use the sfd1 object created above. As can be seen, the parameter values requested consist of means and standard deviations. In order to help selecting appropriate parameter values for the normal distributions on the logit scale, we have designed a Shiny app which is available from Github. See https://github.com/AurMad/logitnormdistapp

show_priors_status_dyn(sfd)
##  logit_pi1_mean    logit_pi1_sd logit_tau1_mean   logit_tau1_sd logit_tau2_mean 
##              NA              NA              NA              NA              NA 
##   logit_tau2_sd 
##              NA

## Set prior distributions for status dynamics using:
## sfd <- set_priors_status_dyn(sfd, logit_pi1_mean = , logit_pi1_sd = , logit_tau1_mean = , logit_tau1_sd = , logit_tau2_mean = , logit_tau2_sd = )

Values are provided.

sfd <- set_priors_status_dyn(sfd, 
                              logit_pi1_mean = 0, 
                              logit_pi1_sd = 1.5, 
                              logit_tau1_mean = -3, 
                              logit_tau1_sd = 1, 
                              logit_tau2_mean = 3, 
                              logit_tau2_sd = 1)

These distributions can be plotted.

plot_priors_status_dyn(sfd)

Running the STOC free model in Stan

The Stan implementation uses the forward algorithm for the estimation. The code was adapted from a tutorial on Hidden Markov models in Stan by Damiano et al. (2017). The model is run using the STOCfree_Stan() function. The main argument is a STOCfree_data object as created above. By default, the model outputs are stored in a folder called ‘STOCfree_files’. This output folder can be changed using through out_path argument.

sfm_stan <- STOCfree_Stan(sfd,
                      n_chains = 3,
                      n_iter = 1000,
                      n_thin = 1,
                      out_path = "STOCfree_Stan_1")
## Running MCMC with 3 sequential chains...
## 
## Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 1 finished in 46.6 seconds.
## Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 2 finished in 48.7 seconds.
## Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
## Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
## Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
## Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
## Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
## Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
## Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
## Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
## Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
## Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
## Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
## Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
## Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
## Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
## Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
## Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
## Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
## Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
## Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
## Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
## Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
## Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
## Chain 3 finished in 46.6 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 47.3 seconds.
## Total execution time: 142.6 seconds.

Running the STOC free model in JAGS

The STOC free model is run using the STOCfree_JAGS() function. This function is a wrapper for the run.jags() function from the runjags package. The main argument is a STOCfree_data object as created above. By default, the models outputs are stored in a folder called ‘STOCfree_files’. This output folder can be changed using through out_path argument.

sfm_jags <- STOCfree_JAGS(sfd,
                      n_chains = 3,
                      n_burnin = 1000,
                      n_iter = 1000,
                      n_thin = 1,
                      out_path = "STOCfree_JAGS_1")
## Warning: You attempted to start parallel chains without setting different PRNG
## for each chain, which is not recommended. Different .RNG.name values have been
## added to each set of initial values.

## Calling 3 simulations using the parallel method...
## Following the progress of chain 1 (the program will wait for all chains
## to finish before continuing):
## Welcome to JAGS 4.3.1 on Mon Sep 12 11:16:13 2022
## JAGS is free software and comes with ABSOLUTELY NO WARRANTY
## Loading module: basemod: ok
## Loading module: bugs: ok
## . . Reading data file data.txt
## . Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 824
##    Unobserved stochastic nodes: 3406
##    Total graph size: 27912
## . Reading parameter file inits1.txt
## . Initializing model
## . Adapting 1000
## -------------------------------------------------| 1000
## ++++++++++++++++++++++++++++++++++++++++++++++++++ 100%
## Adaptation successful
## . Updating 1000
## -------------------------------------------------| 1000
## ************************************************** 100%
## . . . . . . . Updating 1000
## -------------------------------------------------| 1000
## ************************************************** 100%
## . . . . Updating 0
## . Deleting model
## . 
## All chains have finished
## Simulation complete.  Reading coda files...
## Coda files loaded successfully
## Note: Summary statistics were not produced as there are >50 monitored
## variables
## [To override this behaviour see ?add.summary and ?runjags.options]
## FALSEFinished running the simulation

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.

## Warning: `spread_()` was deprecated in tidyr 1.2.0.
## Please use `spread()` instead.

Model results

Structure of the returned object

The Stan model returns an object of classes CmdStanMCMC, CmdStanFit and R6. Therefore, all methods and functions working with objects of these classes will work with outputs created by STOCfree_Stan().

The JAGS model returns an object of class runjags. Therefore, all methods and functions working with runjags objects will work with outputs created by STOCfree_JAGS().

The STOCfree package contains functions to explore the model outputs that work in the same way, regardless of whether the output was generated with Stan or JAGS.

STOCfree_model outputs

Two specific types of model outputs are of interest and can be extracted, either directly from the model output or loaded from the disk when STOCfree_model() was run using save_output = TRUE. These model outputs are the model parameters (Se, Sp, $\tau_1$, $\ldots$) and predicted probabilities of being status positive. Additionally, the JAGS model returns estimated monthly prevalences during the historical period.

Model parameters

By model parameters, we mean test characteristics, parameters related to infection dynamics and association with risk factors. These can be extracted from the model output:

param <- extract_STOCfree_param(sfm_stan)

or loaded from the disk

param <- read_STOCfree_param(out_path = "STOCfree_Stan_1")

These 2 functions create objects of class STOCfree_param for which print and plot methods are defined in the package.

print(param)
## MCMC samples from STOC free model parameters
## 
## Parameters: Se1, Se2, Sp1, Sp2, tau1, tau2 
## Number of chains: 3 
## Number of iterations per chain: 1000 
## Number of draws: 3000

Traceplots can be generated to check convergence visually.

plot(param, parameter = "Se1", type = "traceplot")

Density plots can be generated.

plot(param, parameter = "Sp1", type = "density")

Summary values for each parameter can be obtained as follows:

summary(param)
## Warning: Dropping 'draws_df' class as required metadata was removed.

## Warning: Dropping 'draws_df' class as required metadata was removed.

## Warning: Dropping 'draws_df' class as required metadata was removed.

## Warning: Dropping 'draws_df' class as required metadata was removed.

##            mean           sd     median      2.5%      97.5%      ess
## Se1  0.95021170 2.944234e-03 0.95028050 0.9443617 0.95558250 4531.024
## Se2  0.88369697 2.002153e-02 0.88420750 0.8427434 0.92128647 4754.958
## Sp1  0.98564383 9.752482e-03 0.98770150 0.9609376 0.99824453 3694.541
## Sp2  0.99990148 9.649939e-05 0.99992900 0.9996370 0.99999700 5730.500
## pi1  0.52999931 5.037150e-02 0.52984800 0.4306047 0.62888923 4787.779
## tau1 0.04605133 6.832599e-03 0.04581995 0.0339436 0.06070055 4186.314
## tau2 0.96378436 6.547002e-03 0.96410600 0.9499575 0.97548610 3980.593

Predicted probabilities of infection

MCMC samples for the herd level predicted probabilities of infection can also be extracted or loaded.

pred  <- extract_STOCfree_pred(sfm_stan, sfd)

or

pred <- read_STOCfree_pred("STOCfree_Stan_1")

Calling print on these objects will display the list of herds.

print(pred)
## MCMC samples from STOC free model herd level predicted probabilities of infection
## 
## Number of herds: 100 
## 
## Herds: FR001 FR002 FR003 FR004 FR005 FR006 FR007 FR008 FR009 FR010 FR011 FR012 FR013 FR014 FR015 FR016 FR017 FR018 FR019 FR020 FR021 FR022 FR023 FR024 FR025 FR026 FR027 FR028 FR029 FR030 FR031 FR032 FR033 FR034 FR035 FR036 FR037 FR038 FR039 FR040 FR041 FR042 FR043 FR044 FR045 FR046 FR047 FR048 FR049 FR050 FR051 FR052 FR053 FR054 FR055 FR056 FR057 FR058 FR059 FR060 FR061 FR062 FR063 FR064 FR065 FR066 FR067 FR068 FR069 FR070 FR071 FR072 FR073 FR074 FR075 FR076 FR077 FR078 FR079 FR080 FR081 FR082 FR083 FR084 FR085 FR086 FR087 FR088 FR089 FR090 FR091 FR092 FR093 FR094 FR095 FR096 FR097 FR098 FR099 FR100

Calling plot on these predicted probabilities will plot the density of predicted probabilities for all herds across all iterations.

plot(pred)

It is also possible to plot the predicted probabilities of infection for specific herds:

plot(pred, herd = c("FR001", "FR002"), type = "individual", legend = TRUE)

A summary method is implemented for these predicted probabilities. It is possible to specify for which herds a summary is required.

summary(pred, herd = c("FR001", "FR003"))
##    herd  mean    sd median  2.5% 97.5%
## 1 FR001 0.129 0.019  0.128 0.097 0.172
## 2 FR003 0.062 0.009  0.062 0.045 0.082

Monthly prevalences

This is only available for the JAGS model. The predicted monthly prevalences of infection can be extracted or loaded using:

prev <- extract_STOCfree_month_prev(sfm_jags, sfd)

or

prev <- read_STOCfree_month_prev("STOCfree_JAGS_1")

Calling plot on this object will create boxplots of estimated monthly infection prevalences.

plot(prev)

Inclusion of risk factors

Risk factors are included with the objective of detecting infection earlier when the interval between tests is long, as is usually the case. They can also improve the performance of the prediction of the probability of being status positive when test performance is not good.

The association between the probability of becoming status positive (tau1) and risk factors is modelled with logistic regression. The uncertainty in status as measured by test result is accounted for in the regression through test sensitivity and specificity.

Selection of risk factors to include in the Bayesian model

Bayesian inference in JAGS can take a lot of time. This time increases with the size of the dataset modelled as well as with the number of MCMC iterations. One way to overcome this problem is to force covariates in the model based on what is known about the disease. In this case, but is also possible to include informative prior distributions for the coefficients of the logistic regression between tau1 and risk factors.

The STOCfree package contains a set of functions designed to help in the selection of risk factors from the data. For this purpose, it is considered that all herds with a negative test result are eligible for becoming test positive on the next test. The probability of becoming test positive (versus remaining test negative) is modelled with logistic regression. A further thing that is considered is the fact that risk factors occurring at one point in time may be associated with becoming test positive later.

To show how to use the available functions, we use the intro dataset which is included in the package.

intro
## # A tibble: 1,324 × 3
##    Farm  dateIntr   nAnim
##    <chr> <fct>      <int>
##  1 FR001 2010-01-22     1
##  2 FR001 2015-05-04    14
##  3 FR002 2011-02-11     2
##  4 FR002 2010-12-23     5
##  5 FR003 2012-11-07     3
##  6 FR003 2016-05-25     5
##  7 FR005 2016-03-31    13
##  8 FR005 2016-04-20    17
##  9 FR008 2012-06-04     2
## 10 FR008 2010-04-30     1
## # … with 1,314 more rows

This dataset contains the number of animals purchased as well as the dates of introduction for the 100 herds in the herdBTM data. Using these datasets, the probability of seroconversion (becoming test positive between 2 tests in a given herd) is modelled as a function of the number of animals purchased between the month of seroconversion (month 0) and 24 months earlier. The time of seroconversion is taken as the midpoint between the 2 test dates considered.

nAnim_lagged <- logit_nwinf_lagged(
  sf_data = sfd,
  rf_data = intro,
  rf_date_col = "dateIntr",
  rf_col = "nAnim",
  time_of_inf = "mid",
  lag1 = 0,
  lag2 = 24)

The function returns a table with the AICs of all the models evaluated.

str(nAnim_lagged)
## 'data.frame':    325 obs. of  4 variables:
##  $ lag1: int  0 0 1 0 1 2 0 1 2 3 ...
##  $ lag2: int  0 1 1 2 2 2 3 3 3 3 ...
##  $ l   : int  0 1 0 2 1 0 3 2 1 0 ...
##  $ AIC : num  366 365 365 365 365 ...
##  - attr(*, "out.attrs")=List of 2
##   ..$ dim     : Named int [1:4] 25 25 1 1
##   .. ..- attr(*, "names")= chr [1:4] "lag1" "lag2" "l" "AIC"
##   ..$ dimnames:List of 4
##   .. ..$ lag1: chr [1:25] "lag1= 0" "lag1= 1" "lag1= 2" "lag1= 3" ...
##   .. ..$ lag2: chr [1:25] "lag2= 0" "lag2= 1" "lag2= 2" "lag2= 3" ...
##   .. ..$ l   : chr "l=NA"
##   .. ..$ AIC : chr "AIC=NA"

This can be plotted using the code below. From the Figure below it can be seen that between the worst (AIC > 350) and the best (AIC \< 342) model, the difference in AIC is rather limited. There is a drop in AIC for intervals starting from 8 months before seroconversion.

ggplot(data = nAnim_lagged, aes(x = lag2, y = lag1, fill = AIC)) +
  geom_tile() +
  xlab("Time Lag 2 (months)") +
  ylab("Time Lag 1 (months)") +
  scale_fill_gradient(low = "red", high = "yellow", aesthetics = "fill") +
  ggtitle("Number of animals purchased")

When the function is used to evaluate several candidate variables, all these variables can be included in a multivariate logistic model. For this, we create a dataset with the different variables of interest.

nwinf <- make_nwinf_data(sfd,
                         time_of_inf = "mid")

Here, we include the number of animals purchased 8 months before the test of interest. This is done by calling the add_risk_factor().

nwinf <- add_risk_factor(nwinf,
                         intro,
                         rf_col = "nAnim",
                         rf_date_col = "dateIntr",
                         lag1 = 8,
                         lag2 = 8)

The multivariate logistic model is estimated with R glm() function called by logit_nwinf() function from the STOCfree package:

modl <- logit_nwinf(nwinf,
                    risk_factors = "nAnim_8_8")

The model results are:

summary(modl)
## 
## Call:
## glm(formula = formula, family = binomial(link = "logit"), data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.631  -0.849  -0.849   1.546   1.546  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.83502    0.13206  -6.323 2.56e-10 ***
## nAnim_8_8    0.12384    0.05958   2.078   0.0377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 361.60  on 288  degrees of freedom
## Residual deviance: 356.94  on 287  degrees of freedom
## AIC: 360.94
## 
## Number of Fisher Scoring iterations: 4

Inclusion of risk factors in the Bayesian model

The STOCfree_data() object is updated to include the risk factors selected for the analysis. This is done with the sf_add_risk_factor() function.

sfd <- sf_add_risk_factor(
  sfd = sfd,
  risk_factor_data = intro,
  risk_herd_col = "Farm",
  risk_date_col = "dateIntr",
  risk_factor_col = "nAnim",
  risk_factor_type = "continuous",
  lag1 = 8,
  lag2 = 8,
  FUN = sum)
## Joining, by = "Farm"
## Joining, by = "date__1"
## Joining, by = c("herd_id", "lagged_month")

Doing this will result in the priors required for infection related parameters to be different from above. The priors for tau1 are not needed any more. They are replaced by prior distributions for the logistic regression.

show_priors_status_dyn(sfd)
##  logit_pi1_mean    logit_pi1_sd logit_tau2_mean   logit_tau2_sd 
##             0.0             1.5             3.0             1.0

Priors for the logistic regression

The Bayesian model models the probability of becoming status positive (usually getting infected) as a function of one or several risk factors, with logistic regression. Priors for the coefficients in the logistic regression need to be provided. The prior distributions used are normal distributions on the logit scale. In our case, we need to provide prior distributions for the intercept as well as for coefficient associated with the number of animals purchased 8 months before the month of surveillance. Below is the list of covariates included in the STOCfree_data so far.

show_rf(sfd)
##   risk_factor       type modality
## 1   Intercept  intercept       NA
## 2   nAnim_8_8 continuous       NA
sfd <- set_priors_rf(sfd,
                   risk_factor = "Intercept",
                   mean = -3, sd = 1)


sfd <- set_priors_rf(sfd,
                   risk_factor = "nAnim_8_8",
                   mean = 0, sd = 2)

This has updated the list of covariates with the normal distribution parameters.

show_rf(sfd)
##   risk_factor       type modality mean_prior sd_prior
## 1   Intercept  intercept       NA         -3        1
## 2   nAnim_8_8 continuous       NA          0        2

These priors can be plotted on the probability scale. The prior distribution for the intercept implies that in herds buying no cows, the probability of becoming positive is most likely below 0.4, with most of the density below 0.2.

plot_priors_rf(sfd)

Running the Bayesian model

This is done as explained above by calling the STOCfree_Stan() or STOCfree_JAGS() functions. In this case, estimates for the logistic regression parameters will appear as theta1 (model intercept) and theta2 (coefficient for the nAnim_8_8 variable).

Below is a complete example.

## Creation of the STOCfree_data object
sfd <- STOCfree_data(test_data = herdBTM,
                     test_herd_col = "Farm",
                     test_date_col = "DateOfTest",
                     test_res_col = "TestResult",
                     test_name_col = "Test",
                     status_dynamics_scale = "logit")

## Priors for test characteristics
sfd <- set_priors_tests(sfd,
                 test = "BTM_ODR",
                 Se_a = 5000,
                 Se_b = 260,
                 Sp_a = 20,
                 Sp_b = 2)

sfd <- set_priors_tests(sfd,
                 test = "confirm",
                 Se_a = 20,
                 Se_b = 2,
                 Sp_a = 10000,
                 Sp_b = 1)

## Prior for infection dynamics
sfd <- set_priors_status_dyn(sfd, 
                              logit_pi1_mean = 0, 
                              logit_pi1_sd = 4, 
                              logit_tau2_mean = 3, 
                              logit_tau2_sd = 1)
## Set prior distributions for status dynamics using:
## sfd <- set_priors_status_dyn(sfd, logit_tau1_mean = , logit_tau1_sd = )
## Adding a risk factor to the TSOCfree_data object
sfd <- sf_add_risk_factor(
  sfd = sfd,
  risk_factor_data = intro,
  risk_herd_col = "Farm",
  risk_date_col = "dateIntr",
  risk_factor_col = "nAnim",
  risk_factor_type = "continuous",
  lag1 = 8,
  lag2 = 8,
  FUN = sum)
## Joining, by = "Farm"

## Joining, by = "date__1"
## Joining, by = c("herd_id", "lagged_month")
## Priors for risk factors association with log-odds of becoming positive
sfd <- set_priors_rf(sfd,
                   risk_factor = "Intercept",
                   mean = -3, sd = 1)


sfd <- set_priors_rf(sfd,
                   risk_factor = "nAnim_8_8",
                   mean = 0, sd = 2)

Stan version:

set.seed(124)
sfm_stan_rf <- STOCfree_Stan(sfd,
                      n_chains = 3,
                      n_warmup = 10,
                      n_iter = 100,
                      n_thin = 1,
                      out_path = "STOCfree_Stan_2")
## Running MCMC with 3 sequential chains...
## 
## Chain 1 WARNING: No variance estimation is 
## Chain 1          performed for num_warmup < 20 
## Chain 1 Iteration:   1 / 110 [  0%]  (Warmup) 
## Chain 1 Iteration:  11 / 110 [ 10%]  (Sampling) 
## Chain 1 Iteration: 110 / 110 [100%]  (Sampling) 
## Chain 1 finished in 11.9 seconds.
## Chain 2 WARNING: No variance estimation is 
## Chain 2          performed for num_warmup < 20 
## Chain 2 Iteration:   1 / 110 [  0%]  (Warmup) 
## Chain 2 Iteration:  11 / 110 [ 10%]  (Sampling) 
## Chain 2 Iteration: 110 / 110 [100%]  (Sampling) 
## Chain 2 finished in 20.1 seconds.
## Chain 3 WARNING: No variance estimation is 
## Chain 3          performed for num_warmup < 20 
## Chain 3 Iteration:   1 / 110 [  0%]  (Warmup) 
## Chain 3 Iteration:  11 / 110 [ 10%]  (Sampling) 
## Chain 3 Iteration: 110 / 110 [100%]  (Sampling) 
## Chain 3 finished in 9.2 seconds.
## 
## All 3 chains finished successfully.
## Mean chain execution time: 13.7 seconds.
## Total execution time: 41.6 seconds.

## Warning: 9 of 300 (3.0%) transitions ended with a divergence.
## See https://mc-stan.org/misc/warnings for details.

JAGS version

sfm_jags_rf <- STOCfree_JAGS(sfd,
                      n_chains = 3,
                      n_burnin = 1000,
                      n_iter = 1000,
                      n_thin = 1,
                      out_path = "STOCfree_JAGS_2")
## Warning: You attempted to start parallel chains without setting different PRNG
## for each chain, which is not recommended. Different .RNG.name values have been
## added to each set of initial values.

## Calling 3 simulations using the parallel method...
## Following the progress of chain 1 (the program will wait for all chains
## to finish before continuing):
## Welcome to JAGS 4.3.1 on Mon Sep 12 11:18:53 2022
## JAGS is free software and comes with ABSOLUTELY NO WARRANTY
## Loading module: basemod: ok
## Loading module: bugs: ok
## . . Reading data file data.txt
## . Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 824
##    Unobserved stochastic nodes: 3407
##    Total graph size: 37768
## . Reading parameter file inits1.txt
## . Initializing model
## . Adapting 1000
## -------------------------------------------------| 1000
## ++++++++++++++++++++++++++++++++++++++++++++++++++ 100%
## Adaptation successful
## . Updating 1000
## -------------------------------------------------| 1000
## ************************************************** 100%
## . . . . . . . Updating 1000
## -------------------------------------------------| 1000
## ************************************************** 100%
## . . . . Updating 0
## . Deleting model
## . 
## All chains have finished
## Simulation complete.  Reading coda files...
## Coda files loaded successfully
## Note: Summary statistics were not produced as there are >50 monitored
## variables
## [To override this behaviour see ?add.summary and ?runjags.options]
## FALSEFinished running the simulation

Summary of parameters estimated with the Stan model.

param <- extract_STOCfree_param(sfm_stan_rf)

summary(param)
## Warning: Dropping 'draws_df' class as required metadata was removed.

## Warning: Dropping 'draws_df' class as required metadata was removed.

## Warning: Dropping 'draws_df' class as required metadata was removed.

## Warning: Dropping 'draws_df' class as required metadata was removed.

##              mean           sd      median       2.5%       97.5%       ess
## Se1     0.9499266 3.138467e-03  0.95014050  0.9437878  0.95555128 534.92502
## Se2     0.8822056 2.033061e-02  0.88216400  0.8436933  0.91806948 234.79601
## Sp1     0.9852370 9.566092e-03  0.98677450  0.9643835  0.99876630  35.12823
## Sp2     0.9998894 9.548877e-05  0.99992050  0.9996453  0.99999500  48.62949
## pi1     0.5348903 5.309744e-02  0.53516300  0.4268829  0.63639030 311.47774
## tau2    0.9645711 6.932421e-03  0.96434450  0.9511997  0.97729753 246.65370
## theta1 -3.0481417 1.714629e-01 -3.04740000 -3.3983385 -2.71569500 281.12591
## theta2 -0.5137474 9.223785e-01 -0.09879875 -3.3216173  0.02813575  53.13678


AurMad/STOCfree documentation built on Sept. 13, 2022, 3:20 a.m.