In skgallagher/TBornotTB: Analyze Tuberculosis Patients in Maryland

knitr::opts_chunk$set(echo = FALSE,
                      message = FALSE,
                      warning = FALSE,
                      cache = TRUE)

devtools::load_all()

library(dplyr)
library(knitr)
library(kableExtra)
set.seed(2020)

The data

The observed data is of the following format where each cluster can be described with $n$, the total number of people in the cluster and $n_+$, the number of positive smears in that cluster. For example, the data may look like the following table (Table (\@ref(tab:tab1)) ):

df <- data.frame(id = 1:5,
                 n = c(3, 2, 1, 5, 2),
                 n_pos = c(0, 2, 1, 3, 2))

kable(df, row.names = FALSE, format = "latex", booktabs = TRUE,
      col.names = c("Cluster ID", "n", "$n_+$"), escape = FALSE,
      caption = "Example data set.") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

which has 5 clusters with different cluster sizes $n$ and number of positive smears $n_+$. We are interested in two questions.

Does having a positive smear increase the chance that an infected will pass on the disease?
What does the transmission tree look like for the given cluster with total size $n$ and positive smears $n_+$?

Issue (1) and (2) are, however, difficult to answer separately because the probability of onward transmission may be dependent on the individual's position within the tree and because the the tree's generations, nodes, and branches are dependent on the probability of forward transmission.

Likelihood of a cluster conditioned on $n$ and $n_+$

Assume we are examining the individuals in a single cluster.

Let $p_i = \frac{1}{ 1 + e^{-(\beta_0 + \beta_1 x_i)}}$ be the probability of transmitting the disease for individual $i$ with smear status $x_i$ (0/1) for (-/+). Since $x_i$ is a binary variable, let $p_+ = \frac{1}{ 1 + e^{-(\beta_0 + \beta_1)}}$ and $p_- = \frac{1}{ 1 + e^{-(\beta_0)}}$

We also assume that all infections in a cluster are observed. That is, there exists a spanning tree $T_{n}$ (i.e. no cycles and every node is connected to every other node via a path) with a single root node and $n$ total nodes. We denote the root node/individual as $t_1$. Then the generation $g$ of an individual is the length of the minimal path from the root node to the individual plus one. We say there are $n_g$ individuals in generation $g$. Let $G$ be the maximum generation in the cluster. Then $n_g > 0$ and $n_1 + \dots + n_{G} = n$.

We arbitrarily index the individuals in generation $g$ as $t_{g,1}, \dots, t_{g,n_g}$.

So the tree $T_{n}$ consists of the nodes ${t_{g,i_g}: g = 1, \dots, G, i_g = 0, \dots, n_g}$ along with a set of edges $(t_{g, i}, t_{g+1,j})$ for infector and infectee pairs in adjacent generations. Let each individual $t_{g,i}$ be associated with smear status $x_{g,i}$ (0/1).

Finally, let $i_+$ be the number of infections that are smear positive, that is the number of edges in the tree $(t_{g, i}, t_{g+1,j})$ such that $x_{g,i} = 1$. Since this is a spanning tree with a single root node, then the total number of transmissions in the tree is $n-1$.

For the transmission of an infection, we assume an infected individual $i$ flips a coin with probability $p_i$. If the coin is heads, then he makes a new infection and flips the coin again with probability $p_i$ until there is a tails, at which point the individual's infection period is over.

Thus, in each tree of size $n$ with $n_+$ smears, there needs to be exactly $n-1$ successful transmissions and $n$ failures. Then, for a given cluster $n$ individuals and $n_+$ positive smears the likelihood is

\begin{align} \mathcal{L}(\beta_0, \beta_1; n, n_+, i_+) &= \left(p_+^{i_+}\right)\left(p_-^{n-1 - i_+}\right) \left( (1-p_+)^{n_+}\right )\left((1-p_-)^{n - n_+} \right). \end{align} where $i_+$ is the number of transmissions given by a person with a positive smear status.

Example $n= 6, n_+ = 4$

In this example, we have more data than was presented in Table (\@ref(tab:tab1)). We now have information about the individuals within the cluster.

df <- data.frame(id = c("g1-n1", 
                        "g2-n1", "g2-n1",
                        "g3-n1", "g3-n1",
                        "g4-n1"),
                 smear = c("+", 
                           "+", "-",
                           "+", "-",
                           "+"),
                 infector = c(NA, 
                              "g1-n1", "g1-n1",
                              "g2-n1", "g2-n2",
                              "g3-n1"),
                 n_inf = c(2, 1, 1, 1, 0,0))

kable(df, row.names = FALSE, format = "latex", booktabs = TRUE, align = "c",
      col.names = c("ID", "Smear", "Infector ID", "\\# Trans."), escape = FALSE,
      caption = "Example of individuals in a cluster.", linesep = "") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

Here, we already know the infector of each individual and hence the entire tree. From Table (\@ref(tab:tab2)) $i_+ = 2 + 1 + 1 = 4$.

The likelihood is then \begin{align} \mathcal{L}(\beta_0, \beta_1; n, n_+, i_+) &= \left(p_+^{4}\right)\left(p_-^{1}\right) \left( (1-p_+)^{4}\right )\left((1-p_-)^{2} \right). \end{align}

We can find the arguments $\beta_0$ and $\beta_1$ that maximize this likelihood.

like <- function(par = c("beta_0" = 0, "beta_1" = 1)){
  p_plus <- 1 / (1 + exp(-par[1] - par[2]))
  p_minus <- 1 / (1 + exp(-par[1]))
  -p_plus^4 * p_minus * (1-p_plus)^4 * (1-p_minus)^2
}
out <- optim(par = c("beta_0" = 0, "beta_1" = 1),
             fn = like)
print(out$par, digits = 2)

which makes sense as the probability of transmission from a positive smear is 1/2, as there were 4 total failures and 4 successful transmissions. The probability of infection from a smear negative is r round(100 / (1 + exp(.69)), digits = 0)%, which again makes sense as there was one successful infection from a smear negative individual and two failures.

The issue: $i_+$ is unknown

In the above example, the likelihood was easy to calculate and hence optimize because because we knew $i_+$ the number of infections transmitted from a smear positive individual. In fact, we knew the entire transmission tree!

The problem is that we neither know transmission the tree nor $i_+$ for a given cluster. We do know that the total number of successful transmissions in a cluster of size $n$ is $n-1$.

One possible solution is to marginalize over $i_+$. The likelihood is then

\begin{align}\label{eq:sum-likes} \mathcal{L}(\beta_0, \beta_1; n, n_+) &= \sum_{i=0}^{n-1}\left(p_+^{i}\right)\left(p_-^{n-1 - i}\right) \left( (1-p_+)^{n_+}\right )\left((1-p_-)^{n - n_+} \right)P(i_+ = i), \end{align} which shows the likelihood summing over all possibilities of $i_+$. Here, $P(i_+ = i|n, n_+)$, the probability that the number of smear positivies is $i$ given the size and number of smear positives, may still be a difficult term to calculate.

Another solution is to sample from the set of trees, which requires an algorithm or generating mechanism to sample from said set trees. If we sample $M$ trees of size $n$ with $n_+$ positiives, then we can approximate the likelihood contribution over all such trees for $\beta_0$ and $\beta_1$. This will give us a good approximation of Eq. \ref{eq:sum-likes}, provided our generating mechanism is such that $\tilde{P}(i_+ = i) \approx P(i_+ = i)$, or that generating mechanism has the same probability of positive transfers as the actual mechanism.

To this, we propose the following generating mechanism by the following pseudocode.

## Generate Tree of size n with n+ positive smears
g ~ Binomial(N = n, p = .5) # Draw number of generations
n_1 <- 1 # always one person in first gen
(n_2, n_2, ..., n_g)|g  ~ Multinomial(n-1, (1/(g-1), ..., 1/(g-1))) # Draw partition into generations
for(ii in 2:g){  ## assign infectors
  for(jj in 1:n_g){
    kk <- Multinomial(1, (1/n_(g-1), ..., 1/n_(g-1)) )
    add.edge(T.edges) <- (ii-jj, (ii-1)-kk)
  }
}
(x_11, ..., x_gn_g) <- randomly.permuate.smears(n+) # assign smears

The above is a way to sample uniformly from the set of trees of size $n$ with $n+$ smear positives. We first draw the number of generations $g$. The number of ways to partition $n$ indinstinguishable objects into $k$ non-empty partitions is given by the classic stars and bars problem in combinatorics and so drawing $g$ is equivalent to drawing a Binomial variable with probability of success of 1/2 with size $n$.

We then uniformly draw the partition of size $g$, which tells us how many individuals are in each generation.

At this point, we index each individual by $ii,jj$ where $ii$ is the generation and $jj$ is the index of the individual in that generation.

We assign infectors for each individual by uniformly picking an infector from the previous generation. Individuals in generation 1 have no outside infector.

Finally, we assign smears to the individuals by drawing a uniform permutation of the vector (1, ...,1, 0,..., 0) where there are $n_+$ 1s and $n-n_+$ 0s.

Example: Sampling trees

  part_list <- generate_part_list(n = 10)
  g_weight_list <- get_weight_list(part_list)

set.seed(2020)
n_trials <- 2
n <- 6
n_pos <- 4
n_vec <- rep(n, each = n_trials)
n_pos_vec <- rep(n_pos, each = n_trials)



## Sample trees of n people with n_pos
sampled_trees <- simulate_many_cond_bp(K = length(n_vec),
                                       n_vec = n_vec, n_pos_vec = n_pos_vec,
                                       part_list = part_list,
                                       g_weight_list = g_weight_list,
                                       one_init = TRUE)

sampled_trees %>% kable(booktabs = TRUE, format = "latex", linesep = "") %>%
  kable_styling(latex_options = c("striped", "hold_position"))

Estimating parameters

From here we can estimate parameters $(\beta_0, \beta_)$.

We first simulate data from our flip until failure model, where person 0 flips a weighted coin until he lands a tails. The number of heads are the number of infections in the next generation. We then proceed to the first person in the second generation and flip until failure. We repeat for all the others in the second generation. The sum of their heads constitute the second generation. We repeat the process for subsequent generations until either there are no more heads or until a maximum cluster size is reached.

Below we show a subset of the summary of such a tree generation process, where we show the number in each cluster, the number of smear positives, and the times this occurred in the $K = 1000$ clusters of infections. We show the top 8 occurrences here.

## Flip til failure
K <- 1000
forest <- simulate_flip_til_failure(K = K,
                                    inf_params = c("beta_0" = -2,   "beta_1" = 1.3),
                                    smear_pos_prob = .5,
                                    max_size = 26)
bark <- summarize_clusters(forest)
kable(bark %>% arrange(desc(freq) ) %>% head(8),  linesep = "", format = "latex",
      booktabs = TRUE,
      col.names = c("Frequency", "Cluster Size", "# Pos.")) %>%
  kableExtra::kable_styling(position = "center", 
                            latex_options = "striped")

In this infection group there are r nrow(bark) different combinations of cluster size and number of positives. The max cluster size is r max(bark$n).

Uniformly sampling from trees of fixed size and positive smear

In our data (above), we do not know, ostensibly, the number of infections from positive smear persons. We call this number $i$. We will enumerate the number of unique trees of size $n$ and number of smear positive $x$ as $1, \dots, m(x,n)$.

We draw the trees of fixed size $n$ and $x$ via the following process.

We assume there is a single root node and so the first generation of size 1. Thus, the number of non-empty generations $N_G = 1$ if and only if $n=1$. We first select the number of (non-empty) generations for $n > 1$ according to a Binomial draw with probability 1/2.

From there we draw the number in each generation denoted as $\bf{G}$, a vector of length $N_G$. such that the entries of $\bf{G}$ sum to $n$ and the first entry is always 1 (the root node). The generations $\bf{G}$ come from distribution $F(n, N_G)$ which is the uniform draw of all unique $N_G$-tuples $\bf{G}$ as described above. (Practically this is done using partitions::rparts() and then uniquely permuting the vectors.)

At this point we index the individuals by generation and index within the generation, $(g,j)$.

We then assign infectors to generation $g$ from the previous generation $g-1$. This is done by uniformly at random selecting one of the individuals from the previous generations to the individual in generation $g$ and index $j$. The root node is assigned an infector of NA.

Finally we assign the smears in vector $\mathbf{x}$ where the first entry corresponds to generation 1, and individual 1, the second correspoinds to generation 2 and individual 1, the third to generation 2 and individual 2, and so on. We uniformly at random sample a permutation of $x$ 1s and $(n-x) -1s$ to assign to $\bf{x}$. We call this distribution $H(n,x)$. This randomly assigns $x$ smear positives to individuals in the tree.

\begin{align} N_G &\sim 2 + \textnormal{Binomial} \left (n-1, .5 \right)\ \bf{G} &\sim F(n,N_G) \ inf_{g,j} &\sim \left(g-1, \textnormal{Discrete Uniform} \left (1, \dots, G_{g-1} \right) \right ) \ \bf{x} &\sim H(n,x) \end{align}

This process completely defines a tree of size $n$ with $x$ smear positives, which we call $T(n,x)_k$. Moreover, this (we hope) is a uniform draw of trees with distinguishable nodes and all possible permutations of having $x$ smear positives. From this, tree we can ascertain $i_k$ the number of smear positive transmissions in the tree.

Then the likelihood contribution from all such trees

\begin{align} \mathcal{L}(p_+, p_-; n, x) &= \sum_{j=1}^{m(n_j,x_j)} (1-p_+)^{x_j}(1-p_-)^{(n_j-x_j)}p_+^{(i_j)}p_-^{(n_j - 1 - i_j)}. \end{align}

Note that any trees where $i_j = i_k$ then the likelihood contribution will be the same. We say these trees are isomorphic to each other.

part_list <- generate_part_list(n = 25)
g_weight_list <- get_weight_list(part_list)

n_trials <- 100


cluster_summary <- bark

n <- cluster_summary$n
n_pos <- cluster_summary$n_pos
n_vec <- rep(n, each = n_trials)
n_pos_vec <- rep(n_pos, each = n_trials)
par <- c("beta_0" = -3, "beta_1" = 2)
one_init <- TRUE



sampled_trees <- simulate_many_cond_bp(K = length(n_vec),
                                       n_vec = n_vec, n_pos_vec = n_pos_vec,
                                       part_list = part_list,
                                       g_weight_list = g_weight_list,
                                       one_init = one_init)


##################################################################################


sum_trees <- summarize_cond_trees(sampled_trees)

An example of sampling from the conditional trees is shown below. We use the data from the true distribution with $(\beta_0 = -2, \beta_1 = 1.3)$ and maximize over the likelihood. The first line is the parameter estimates and the second is the standard error.

optim_results <- optim(par = c("beta_0" = 0,
                              "beta_1" = 0),
               fn = loglike_obs_data,
               data = cluster_summary,
               sampled_trees_summary = sum_trees,
               return_neg = TRUE, hessian = TRUE)
round(optim_results$par, 3)
round(sqrt(diag(solve(optim_results$hessian))), 3)

We see that our paremeter estimates are fairly accurate.

Curent issues

We can only generate unique partitions from trees of maximum size 25. We need to handle censoring.
Increasing sample size of either the data or the clusters sampled in each conditional cluster does not necessarily mean we will get better estimates because of our maximum size truncation.
What is the distribution of $i$, the number of positive smear transmissions from the uniform sampling of trees? It is decidedly not uniform.

Ex.

library(ggplot2)

n <- 8
n_pos <- 4
n_trials <- 1000
n_vec <- rep(n, each = n_trials)
n_pos_vec <- rep(n_pos, each = n_trials)
one_init <- TRUE



sampled_trees <- simulate_many_cond_bp(K = length(n_vec),
                                       n_vec = n_vec, n_pos_vec = n_pos_vec,
                                       part_list = part_list,
                                       g_weight_list = g_weight_list,
                                       one_init = one_init)



sum_trees <- summarize_cond_trees(sampled_trees)

p <- mean(sum_trees$i_pos) / n
ggplot(data = sum_trees) + geom_bar(aes(x = i_pos)) +
  labs(x = "# Pos. Transmissions", y = "Count",
       title = "Uniform sampling of trees of size 8 and 4 smear +",
       subtitle = latex2exp::TeX("$\\hat{p} = .438$"))

### SKEWED


n <- 4
n_pos <- 1
n_trials <- 10000
n_vec <- rep(n, each = n_trials)
n_pos_vec <- rep(n_pos, each = n_trials)
one_init <- TRUE



sampled_trees2 <- simulate_many_cond_bp(K = length(n_vec),
                                       n_vec = n_vec, n_pos_vec = n_pos_vec,
                                       part_list = part_list,
                                       g_weight_list = g_weight_list,
                                       one_init = one_init)



sum_trees2 <- summarize_cond_trees(sampled_trees2)
p2 <- mean(sum_trees2$i_pos) / n
ggplot(data = sum_trees2) + geom_bar(aes(x = i_pos, y = ..prop..), stat = "count") +
  labs(x = "# Pos. Transmissions", y = "Prob.",
       title = paste0("Uniform sampling of trees of size ", n, " and ", n_pos, " smear +"),
        subtitle = latex2exp::TeX(sprintf("$\\hat{p} = %.3f$", p2))) + theme_bw() +
   scale_y_continuous(labels = scales::percent)

High Importance Sampling

The above method can be inefficient and time consuming to estimate $\beta_0$ and $\beta_1$. If there were a way to sample from a more informative distribution, then we could speed up results and potentially move to larger maximum cluster size.

Attempt 1.

Fix $n$ and $x$, the cluster size and number of smear positives, respectively. We sample a number of positive transmissions for $b=1, \dots, B$ and iteration $\ell$.

\begin{align} i_b^{(\ell)} &\sim \textnormal{Binomial} \left (n-1, \hat{p}^{(\ell-1)} \right ) \ w_b^{(\ell)} &= \left (\begin{array}{c}n-1 \ i_b^{(\ell)} \end{array}\right) \left(\hat{p}^{(\ell-1)}\right)^{i_b^{(\ell)}}\left(1-\hat{p}^{(\ell-1)}\right)^{ n- 1 - i_b^{(\ell)}}\ t(n,x)b^{(\ell)} &= (1-p+)^{x}(1-p_-)^{(n-x)}p_+^{(i_b^{(\ell)})}p_-^{(n - 1 - i)} \end{align} where $\hat{p}^{(\ell-1)} = \frac{\hat{p}+}{\hat{p}+^{(\ell-1)} +\hat{p}_-^{(\ell -1)} }$ is a function of the previous iteration estimates of $\beta_0^{(\ell-1)}$ and $\beta_1^{(\ell-1)}$.

Then the estimated likelihood contribution over all $i_+$ is, accounting for weighting for iteration $\ell$ is \begin{align} \bar{\mathcal{L}}(\beta_0^{(\ell)}, \beta_1^{(\ell)}) = \sum_{b=1}^B \frac{t_b^{(\ell)}}{w_b^{(\ell)}}\ \left ( \beta_0^{(\ell)}, \beta_1^{(\ell)} \right ) = \arg \max \bar{\mathcal{L}}(\beta_0^{(\ell)}, \beta_1^{(\ell)}) \end{align}

The initial estimate of the $\beta$s can be taken from the uniform sampling estimate or some guess. We iterate until convergence or we get too tired.

Here is an example of this not working.

set.seed(89325795)


  K <- 1000
  inf_params <- c("beta_0" = -2, "beta_1" = 1.5)
  max_size <- 26
  smear_pos_prob <- .5

  forest <- simulate_flip_til_failure(K = K,
                                      inf_params = inf_params,
                                      smear_pos_prob = smear_pos_prob,
                                      max_size = max_size)
  bark <- summarize_clusters(forest)

  ################################
  init_par <- c(-1.82, 1.3)
  freq <- bark$freq
  n_vec <- bark$n
  x_vec <- bark$n_pos
  B <- 1000
  iters <- 5
  out <- iter_est_ws(init_par = init_par,
                     freq = freq,
                     n_vec = n_vec,
                     x_vec = x_vec,
                     B = B,
                     iters = iters)
  out$par
  sqrt(diag(solve(out$hessian)))