HaldDP: Builds a HaldDP source attribution model

Description Usage Arguments Format Value Description HaldDP Object Methods Details Author(s) References Examples

View source: R/interface.R

Description

Builds a HaldDP source attribution model

Usage

1
HaldDP(y, x, k, priors, a_q, inits = NULL)

Arguments

y

a Y object containing case observations

x

an X object containing source observations

k

a Prev object containing source prevalences

priors

priors list with elements named a_r, a_alpha, a_theta and b_theta, corresponding to the prior parameters for the r, alpha, and base distribution for the DP parameters respectively.

Parameter Prior Distribution Prior Parameters
a_r Dirichlet(concentration) A single positive number or an X
object containing the prior values for each source,
time and type. If a single number is supplied,
it will be used for all times, sources and types.
a_alpha Dirichlet(concentration) A single positive number or an Alpha
object containing the prior values for each source,
time and location. If a single number is supplied,
it will be used for all times, sources and locations.
Type effects base Gamma(shape, rate) Single number for each of the shape (a_theta) and
distribution parameters rate (b_theta) of the Gamma base distribution.
a_q

the Dirichlet Process concentration parameter.

inits

initial values for the mcmc algorithm. This is an optional list that may contain any of the following items: alpha,q, and r.

Parameter Description
r An object of type X giving the initial values for $R$ matrix,
If not specified defaults to the element-wise maximum likelihood
estimates of r from the source matrix.
Source effects (alpha) An object of type Alpha specifying alpha value for each source/time/location.
If not specified, default initial values
for the source effects are drawn from the prior distribution.
Type effects (q) An object of type Q giving the initial clustering and values for q
If not specified, defaults to a single group with a theta value calculated as
θ = sum(y_itl) / sum_l=1^L(sum_t=1^T(sum_i=1^n(sum_j=1^m(alpha_jtl * r_ijt * k_jt)))).
i.e. theta = sum(y_itl) / sum(lambda_ijtl / theta)

Format

Object of R6Class with methods for creating a HaldDP model, running the model, and accessing and plotting the results.

Value

Object of HaldDP with methods for creating a HaldDP model, running the model, and accessing and plotting the results.

Description

This function fits a non-parametric Poisson source attribution model for human cases of disease. It supports multiple types, sources, times and locations. The number of human cases for each type, time and location follow a Poisson likelihood.

HaldDP Object Methods

mcmc_params(n_iter = 1000, burn_in = 0, thin = 1, n_r = ceiling(private$nTypes * 0.2), update_schema = c('q','alpha','r'))

when called, sets the mcmc parameters.

n_iter sets the number of iterations returned (after removing burn_in and thinning results by thin i.e. a total of (n_iter * thin) + burn_in iterations are run)

n_r is a positive integer that sets the total number of r_{ijtl} parameters to be updated at each time-location-source combination (the default is 20 percent updated per iteration)

update_schema a character vector containing the parameters to update (any of 'q','alpha','r').

update(n_iter, append = TRUE)

when called, updates the HaldDP model by running n_iter iterations.

If missing n_iter, the n_iter last set using mcmc_params() or update() is used.

append is a logical value which determines whether the next n_iter iterations are appended to any previous iterations, or overwrites them. When append = TRUE, the starting values are the last iteration and no burn_in is removed. Running the model for the first time, or changing any model or fitting parameters will set append = FALSE.

get_data

returns a list containing the human data y (an array y[types, times, locations]), the source data X (an array X[types, sources, times]), the prevalence data (an array k[sources, times]), the type names, source names, time names, location names and number of different types, sources, times and locations.

get_priors

returns a list containing the DP concentration parameter a_q, and the priors (R6 class with members named a_alpha (members are array a_alpha[sources, times, locations]), a_r (an array a_r[types, sources, times]), a_theta and b_theta).

get_inits

returns an R6 class holding the initial values (members are alpha (an array alpha[sources, times, locations]), theta (an array theta[types, iters]), s (an array s[types, iters]), and r (an array r[types, sources, times])).

get_mcmc_params

returns a list of fitting parameters (n_iter, append, burn_in, thin, update_schema (R6 class with members alpha, q, r)).

get_acceptance

returns an R6 class containing the acceptance rates for each parameter (members are alpha (an array alpha[sources, times, locations]), and r (an array r[types, sources, times])).

extract(params = c("alpha", "q", "s", "r", "lambda_i", "xi", "xi_prop"), times = NULL, locations = NULL, sources = NULL, types = NULL, iters = NULL, flatten = FALSE, drop = TRUE)

returns a list contining a subset of the parameters (determined by the params vector, times, locations, sources, types and iters).

If flatten is set to TRUE, it returns a dataframe with 1 column per parameter, otherwise it returns a list containing params containing a subset of the following arrays: alpha[Sources, Times, Locations, iters], q[Types, iters], s[Types, iters], r[Types, Sources, Times, iters], lambda_i[Types, Times, Locations, iters], xi[Sources, Times, Locations, iters].

drop determines whether to delete the dimensions of an array which have only one level when flatten = FALSE.

summary(alpha = 0.05, params = c("alpha", "q", "s", "r", "lambda_i", "xi" ,"xi_prop"), times = NULL, locations = NULL, sources = NULL, types = NULL, iters = NULL, flatten = FALSE, drop = TRUE, CI_type = "chen-shao")

returns a list contining the median and credible intervals for a subset of the parameters. The default credible interval type are Chen-Shao ("chen-shao") highest posterior density intervals (alternatives are "percentiles" and "spin"). See extract for details on the subsetting. xi_prop returns the proportion of cases attributed to each source j and is calculated by dividing each iteration of lambda_{jtl} values by their sum within each time t and location l.

plot_heatmap(iters, cols = c("blue","white"), hclust_method = "complete")

Creates a dendrogram and heatmap for the type effect groupings (s parameter in the model). This uses the heatmap.2 function from gplots.

iters is a vector containing the iterations to be used in constructing the graph. Default is all iterations in posterior.

hclust_method allows the user to select the method used by stats::hclust to cluster the type effect groupings s.

cols gives the colours for completely dissimilar (dissimilarity value of 1), and identical (dissimilarity value of 0). All other values will be in between the two chosen colours. See ?colorRampPalette for more details..

Details

This function fits a source attribution model for human cases of disease. It supports multiple types, sources, times and locations. The number of human cases for each type, time and location follows a Poisson or Negative Binomial likelihood. Model

y_{itl}\sim\textsf{Poisson}(λ_{itl})

where

λ_{itl}=∑_{j=1}^{m}λ_{ijtl}=q_{k(i)}∑_{j=1}^{m}(r_{ijt}\cdot k_{j}\cdot alpha_{jtl})

The parameters are defined as follows:

a_{jtl}

is the unknown source effect for source j, time t, location l

q_{s(i)}

is the unknown type effect for type i in group s.

x_{ij}

is the known number of positive samples for each source j typei combination

n_{ij}

is the known total number of samples for each source j type i combination

k_{j}

is the fixed prevalence in source (i.e. the number of positive samples divided by the number of negative samples) j

r_{ijt}

is the unknown relative occurrence of type i on source j.

Priors

r_{.jt}\sim Dirichlet(a\_r_{1jt},..., a\_r_{njt})

a_{tl}\sim Dirichlet(a\_alpha_{1tl},..., a\_alpha_{mtl})

q\sim DP(a_q, Gamma(a_{theta},b_{theta}))

Author(s)

Chris Jewell and Poppy Miller p.miller at lancaster.ac.uk

References

Chen, M.-H. and Shao, Q.-M. (1998). Monte Carlo estimation of Bayesian credible and HPD intervals, Journal of Computational and Graphical Statistics, 7.

Liu Y, Gelman A, Zheng T (2015). Simulation-efficient shortest probability intervals, Statistics and Computing.

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#### Format data using Y, X, and Prev functions #############################
## Input data must be in long format
y <- Y(                      # Cases
  data = sim_SA$cases,
  y = "Human",
  type = "Type",
  time = "Time",
  location = "Location"
)

x <- X(                      # Sources
  data = sim_SA$sources,
  x = "Count",
  type = "Type",
  time = "Time",
  source = "Source"
)

k <- Prev(                   # Prevalences
  data = sim_SA$prev,
  prev = "Value",
  time = "Time",
  source = "Source"
)

#### Create Dirichlet(1) priors #############################################

## Create alpha prior data frame
prior_alpha_long <- expand.grid(
  Source   = unique(sim_SA$sources$Source),
  Time     = unique(sim_SA$sources$Time),
  Location = unique(sim_SA$cases$Location),
  Alpha    = 1
)
# Use the Alpha() constructor to specify alpha prior
prior_alpha <- Alpha(
  data     = prior_alpha_long,
  alpha    = 'Alpha',
  source   = 'Source',
  time     = 'Time',
  location = 'Location'
)

## Create r prior data frame
prior_r_long <- expand.grid(
  Type   = unique(sim_SA$sources$Type),
  Source = unique(sim_SA$sources$Source),
  Time   = unique(sim_SA$sources$Time),
  Value  = 0.1
)
# Use X() constructor to specify r prior
prior_r <- X(
  data   = prior_r_long,
  x      = 'Value',
  type   = 'Type',
  time   = 'Time',
  source = 'Source'
)

## Pack all priors into a list
priors <- list(
  a_theta = 0.01,
  b_theta = 0.00001,
  a_alpha = prior_alpha,
  a_r     = prior_r
)

## If all prior values are the same, they can be specified in shorthand
## Equivalent result to the longform priors specified above
priors <- list(
  a_theta = 0.01,
  b_theta = 0.00001,
  a_alpha = 1,
  a_r     = 0.1
)

#### Set initial values (optional) ##########################################
types  <- unique(sim_SA$cases$Type)
q_long <- data.frame(q=rep(15, length(types)), Type=types)
init_q <- Q(q_long, q = 'q', type = 'Type')
inits <- list(q = init_q) # Pack starting values into a list

#### Construct model ########################################################
my_model <- HaldDP(y = y, x = x, k = k, priors = priors, inits = inits, a_q = 0.1)

#### Set mcmc parameters ####################################################
my_model$mcmc_params(n_iter = 2, burn_in = 2, thin = 1)

#### Update model ###########################################################
my_model$update()
## Add an additional 10 iterations
my_model$update(n_iter = 2, append = TRUE)

#### Extract posterior ######################################################
## returns the posterior for the r, alpha, q, c,
## lambda_i, xi and xi_prop parameters,
## for all times, locations, sources and types
## the posterior is returned as a list or arrays
## Not run: str(my_model$extract())

## returns the posterior for the r and alpha parameters,
## for time 1, location B, sources Source3, and Source4,
## types 5, 25, and 50, and iterations 200:300
## the posterior is returned as a list of dataframes
## Not run: 
str(my_model$extract(params = c("r", "alpha"),
                 times = "1", location = "B",
                 sources = c("Source3", "Source4"),
                 types = c("5", "25", "50"),
                 iters = 5:15,
                 flatten = TRUE))

## End(Not run)

#### Calculate medians and credible intervals ###############################
## Not run: my_model$summary(alpha = 0.05, CI_type = "chen-shao")
## subsetting is done in the same way as extract()
## Not run: my_model$summary(alpha = 0.05, CI_type = "chen-shao",
                 params = c("r", "alpha"),
                 times = "1", location = "B",
                 sources = c("Source3", "Source4"),
                 types = c("5", "25", "50"),
                 iters = 5:15,
                 flatten = TRUE)

## End(Not run)

#### Plot heatmap and dendrogram of the type effect grouping ################
my_model$plot_heatmap()

#### Extract data, initial values, prior values, acceptance
## rates for the mcmc algorithm and mcmc parameters
my_model$get_data()
my_model$get_inits()
my_model$get_priors()
my_model$get_acceptance()
my_model$get_mcmc_params()

sourceR documentation built on Aug. 31, 2020, 5:06 p.m.

Related to HaldDP in sourceR...