In skgallagher/TBornotTB: Analyze Tuberculosis Patients in Maryland

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE
)
devtools::load_all()
<<<<<<< HEAD
library(knitr)
library(ggplot2)
library(kableExtra)
=======
>>>>>>> 7e1a77c693066842666d17ce2eedcb06dc5d9643

## Only need to do this step once per session
remove.packages("TBornotTB" )
devtools::install_github("skgallagher/TBornotTB")  ## Get the latest version from github
library(TBornotTB)
<<<<<<< HEAD
library(ggplot2)
library(knitr)
library(kableExtra)
=======
>>>>>>> 7e1a77c693066842666d17ce2eedcb06dc5d9643

Purpose

In this vignette, we explore the generation scheme and corresponding likelihood for clusters where the root node is unobserved.

Background

In the previous vignette, we explored the significance of a covariate when we assumed that all infections within a cluster could be traced back to an observed infection also within that cluster. That is, we assumed there was a root node in our transmission tree in which a path could always be traced back to.

We now assume that the root node is unobserved.

Generation scheme

We simulate outbreak data via the following procedure.

For cluster $k = 1, \dots, K.$

Randomly draw a covariate for index case (+/-) and assign to generation 0, index 1.
Flip a weighted coin based on the covariate and record number $J_{i,j}$ where $J_{i,j}$ is the number of individuals infected by the person in generation $i$ and index $j$. Here, $J_{i,j}$ is a geometric draw with probability of successful infection $p_{ij}$ where $p_{ij}$ is a function of generation $i$ index $j$'s covariate status (+/-), $$ J_{i,j} \sim Geometric(p_{ij}) $$
Assign these individuals indices $(1, k)$ for $k= 1, \dots, J_{0,1}$
While ($\sum_{i,j}1 + J_{i,j} <$ max cluster size)

For i = 1, ...
  For j = 1, ...
    p[i,j] <- draw_covariate()
    J[i,j] <- Geometric(p[i,j])
    if(J[i,j] > 0){
      if(j==1){
        Generate_individuals (i+1, 1), ..., (i+1, J[i,j])
      }else {
        Generate_individuals (i+1, J[i, j-1] + 1), ..., (i+1, J[i,j-1] + J[i,j])
      }
      Store_infector_ID()
    }
  return (indices, infector IDs, covariates)

Remove individual (0,1) from the cluster.
Summarize the (root-less) cluster as $(n,x)$ where $n$ is the size of the cluster and $x$ is the number of positive smears in the cluster.

Basically, this is a repeated coin flipping procedure where we create a new individual for every successful flip. Once the coin flip is not successful individual we move to the next indexed individual within the same generation. If there are no more individuals within the generation we move on to the first individual within the next generation. We do this until there are no new individuals.

Likelihood

The likelihood for an observed cluster with size $n$ and total number of positive smears where the outside generator is unobserved is dependent on the the smear status and number of root nodes generated by the outside generator,

$$ L_O(p_+, p_-;n,x) = \sum_{p_0}P(p_0) \sum_{n_0=1}^n P(n_0)\sum_{\mathrm{parts}\; \boldsymbol{n}^\prime, \boldsymbol{x}^\prime} L_1(p_+,p_-;n_i^\prime,x_i^\prime) $$

This is a recursive expression where $L_1(\cdot; n, x)$ is the likelihood contribution of a fully observed tree of size $n$ and positive smears $x$ where the $1$ subscript denotes that the generator is observed.

Intuitively, each separate infection created by the outside generator forms its own infection sub-tree, such that the sum of all the sub-trees nodes sums to the total cluster size $n$.

In the no-outsider case, we were able to approximate $L_1(\cdot; n,x)$ by averaging the likelihood contribution from samples of uniform trees. We are able to extend this sampling/averaging process here to the outside generator scenario.

That is, to approximate $L_O(\cdot; n,x)$, we repeatedly sample draws of uniform trees of form $(x_0, n_0, \boldsymbol{n}^\prime, \boldsymbol{x}^\prime)^{(b)}$,

$$ E[L_O(p_+, p_-;n,x)] \approx \frac{1}{B} \sum_{(x_0, n_0, \boldsymbol{n}^\prime, \boldsymbol{x}^\prime)^{(b)}}P(p_0^{(b)}) P(n_0^{(b)}) \sum_{i=1}^{\mathrm{length}\bf{x}^{\prime, (b)}} \bar{L}1(p+,p_-;n_i^{\prime, (b)},x_i^{\prime, (b)}) $$ where the probability of smear positive occurs with equal chance as a smear negative $$ P(x_0 = +) = \frac{1}{2} $$ and the probability of the number of root infections is a doubly truncated geometric random draw, $$ P(n_0 = y;p_0, n,x) = \frac{(1-p_0)p_0^y}{\sum_{i=1}^{n}(1-p_0)p_0^i}. $$

With regards to the sampling then, we have the following procedure where RUP$(n,m)$ stands for a random uniform partition of $n$ into $m$ parts where the parts are, respectively, non-empty if the subscript is $>0$ and possibly empty if the subscript is $\ge 0$. $$ p_0 \sim \mathrm{Discrete Uniform} {+, -} $$ $$ n_0 \sim \mathrm{Discrete Uniform} {1,\dots, n} $$ $$ \boldsymbol{n}^\prime \sim \mathrm{RUP}{>0}(n, n_0) $$ $$ \boldsymbol{x}^\prime \sim \mathrm{RUP}{\ge 0}(n, n_0). $$

<<<<<<< HEAD

Example

Data

sim_list <- readRDS("~/TBornotTB/inst/tree_sampling/outside_sim.RDS")

kable(sim_list$data)

There are a total of r sum(sim_list$data$freq) clusters where 5,000 were initially generated (meaning that r 5000 - sum(sim_list$data$freq) infection processes died before being observed).

The best parameters are r kable(sim_list$best_params$par, digits = 2) and the true values were r kable(sim_list$true_params, digits = 2).

Here is the likelihood profile for $\beta_1$, the impact of smear status.

sim_list$like_prof_plot

The 95\% CI for $\beta_1$ is

kable(sim_list$beta1[2:3], digits = 2)

=======

Computationally

Computationally, we can reuse much of our past work. Specifically, we need to store the result of sampling $B$ random uniform trees of size $n$ and $x$ smear positives, namely the number of positive transmissions $i$. Then, it is possible to quickly compute any $\bar{L}_1(\cdot;n,x)$ on the fly as a function of only the parameters to be optimized.

## Proposed structure

sampled_trees # a pre-processed structure of tables with form
## n | x| i_pos |freq

like1(par, n, x, trees) # average contribution of L1

freq_mat <- sample_smears(par, B)

prob_geo_dt(par, freq_mat)
prob_smear_pos(w_pos)

like_out_cond <- function(p_plus, p_neg, n,x, B, sampled_trees, w_pos){
    sampled_vars_list <- sample_outside_gen_vars(n, x, B)
    like <- lapply(sampled_vars_list, like_in_cond, sampled_trees,
    w_pos)
    return(mean(like))
}

like_in_cond <- function(sampled_vars_list, 
    p_plus, p_neg,
    sampled_trees,
    w_pos){
        n_vec <- sampled_vars_list$n
        x_vec <- sampled_vars_list$x
        x0 <- sampled_vars_list$x0
        n0 <- sampled_vars_list$n0
        ## probabilities
        p_x0 <- prob_x0(w_pos)
        p_n0 <- prob_n0(p_plus, p_neg, x0)
        p_like1 <- sum(sapply(1:length(n_vec), function(ii){
            like1(n_vec[ii], x_vec[ii], sampled_trees, p_plus, p_neg)
        }))
        like <- p_x0 * p_n0 * p_like1
        return(like)



}

like1 <- function(n, x, sampled_trees,
    p_plus, p_neg){
    sampled_trees <- sampled_trees %>% mutate(p_plus = p_plus,
    p_neg = p_neg)

    sub_trees <- sampled_trees %>% filter(.data$n_pos == x,
            .data$n == n)

    sub_trees <- sub_trees %>%
            mutate(like = (1- .data$p_plus)^x *
                (1- .data$p_neg)^(n-x) *
                .data$p_pos^(.data$i_pos) *
                .data$p_neg^(.data$i_neg) *
                freq)

    avg_like <- sum(sub_trees$like) / sum(sub_trees$freq)
    return(avg_like)


        }