knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
devtools::load_all() library(kableExtra) library(knitr)
## devtools::install_github("skgallagher/TBornotTB") library(TBornotTB) library(kableExtra) library(knitr)
We analyze the incidence of Pulmonary Tuberculosis (TB) cases in the state of Maryland from 2003-2009.
The summary data can be loaded with the following commands.
data("clusts_tb") kable(head(clusts_tb))
This data corresponds to summary information of clusters of infected individuals (one row = one cluster). The individuals were clustered by analyzing the genetic similarity of the TB strains.
The important variables are size, the total number of individuals in the cluster and n_pos, the total number of smear positives in the cluster.
The remaining variable descriptions can be found in the following table.
| Variable | Description
-------|----------------|--------------------------------------
| PCR.Cluster
| ID of the cluster
| size
| # of individuals in the cluster
| n_pos
| # of positive smears in the cluster
| n_neg
| # of negative smears in the cluster
| n_unk
| # of unknown smears in the cluster
| n_hiv_pos
| # of HIV positive in the cluster
| n_hiv_neg
| # of HIV negative in the cluster
| n_hiv_unk
| # of HIV unknown in the cluster
| inf_range
| Days between first and last detection date in the cluster
| inf_iqr
| Interquartile range of detection dates (relative to first date)
-------------------------------------------------------------
There are a total of r nrow(clusts_tb)
unique clusters which encompass r sum(clusts_tb$size)
individuals. There are r as.numeric(table(clusts_tb$size)[1])
clusters with only one individual. The largest cluster is of size r max(clusts_tb$size)
.
We assume that each person received the infection from one other individual within the cluster or is the root node, or the first infection in the cluster. As such, each cluster can be described as a graph, specifically a spanning tree with a single root node and directed edges. The nodes of the tree are the individuals and an edge between individuals as infections from one individual to another. All nodes except for the root node have exactly one edge pointing into the node, and the root node has zero edges pointing in toward the node. The generation of a node is the length of the path from the root node to it plus 1, and the root node is generation 1.
We assume that the probability of an agent infecting another is dependent on their smear status. A positive smear individual successfully infects another individual with probability $p_+$ and a negative smear individual successfully infects another individual with probability of $p_-$.
We assume the number of infections produced by a single individual in generation $g$, infection $i$ indexed by $(g,i)$ is a geometric draw, $$Y_{gi} \sim Geometric(p_{gi})$$ and $p_{gi}$ is either $p_+$ or $p_-$, depending on individual $g,i$'s smear status.
Then the total size of cluster $k$ is $n_k = \sum_{g,i} Y_{gi}^{(k)}$, but we typically will suppress $k$. If $p_{gi} = p$ for all $g,i$, then it is known that $P(n = \infty) > 0$ when $p>= 0.5$ and $P(n < \infty) = 1$ when $p < 0.5$. [CITE]
We, instead, assume $p$ is dependent on the individual's smear status $S_{gi}$. Specifically, $$ p_{gi} = logit \left ( \beta_0 + \beta_1 S_{gi} \right ) $$
We interested in whether being smear positive increases the probability of transmission or equivalently $$ H_0:\; \beta_1 > 0\ H_A: \; \beta_1 \le 0. $$
We can write the likelihood of $\beta_1, \beta_0$ given the cluster data. Let $i$ be the number of transmissions from positive smear status individuals.
The likelihood for a single cluster of size $n$ and $x$ positive smear indviduals is then $$ \mathcal{L}\left (\beta_0, \beta_1; n,x;i \right ) = \sum_{i=1}^{n-1}(1-p_+)^{x}(1-p_-)^{n-x}p_+^ip_-^{n-1-i}. $$ Since the cluster is finite, then we know each individual in the cluster failed to pass on the infection after some prior number of successes (possibly zero). The number of successful infections from smear positives is $i$ and so the likelihood contribution is $p_+^i$ and similarly the number of infections from smear negatives is $n-1-1$ (as the root node infector status is unknown) and so the likelihood contribution is $p_-^{n-1-i}$.
Let $f$ be the frequency of the clusters of size $n$ with $x$. Then the total likelihood is then $$ \mathcal{L}\left (\beta_0, \beta_1 \right ) = \prod_{all\; n,x} \mathcal{L}(\beta_0, \beta_1; n,x)\ \mathcal{L}\left (\beta_0, \beta_1 \right ) = \prod_{unique \;n,x} \mathcal{L}(\beta_0, \beta_1; n,x)^f, $$ and the log likelihood is
$$ \mathcal{\ell}\left (\beta_0, \beta_1 \right ) = \sum_{unique \;n,x} f \log \left (\mathcal{L}(\beta_0, \beta_1; n,x) \right ). $$
However, we do not know $i$, the number of positive smear transmissions, nor a closed form distribution as the likelihood unconditional on $I$ for a single cluster is $$ \mathcal{L}\left (\beta_0, \beta_1; n,x\right ) = \sum_{i=1}^{n-1}(1-p_+)^{x}(1-p_-)^{n-x}p_+^ip_-^{n-1-i}P(I=i|n,x). $$
As such, we need to estimate $P(I=i; n,x)$. We know that
$$
P(I=i;n,x) = \frac{# \mathrm{trees\; with\;} i}{\mathrm{#total\; trees}}.
$$
If each tree occurs with equal probability, then we can uniformly sample from all realizations of trees and can approximate the likelihood and hence maximize it. We can do this quickly in R
in our package for clusters up to size 60.
library(dplyr) ## Put data in correct format bark <- clusts_tb %>% group_by(size, n_pos) %>% summarize(freq = dplyr::n()) %>% rename(n = size) bark %>% head() %>% kable()
## Sample from uniform trees set.seed(2020) n_trials <- 1000 sampled_tree_summary <- sample_uniform_trees(n_vec = bark$n, x_vec = bark$n_pos, B = n_trials)
## Maximize likelihood par <- c(0,0) best_pars <- optim(par, fn = loglike_fast, data = bark, sampled_trees_summary = sampled_tree_summary, return_neg = TRUE, hessian = TRUE) est <- best_pars$par se <- sqrt(diag(solve(best_pars$hessian))) mat <- matrix(round(c(est, se), 2), nrow = 2) rownames(mat) <- c("$\\beta_0$", "$\\beta_1$") colnames(mat) <- c("Est.", "SE") mat %>% kable()
From this, we can conclude that smear status does not have a significant effect on the probability of transmission.
Since we only care about $\beta_1$, we can use likelihood profiles in contrast to the above approximation of the Wald Confidence Interval.
## Draw the profile beta_0 <- best_pars$par[1] beta_1 <- seq(-1, 1, length.out = 201) loglike <- numeric(length(beta_1)) loglike2 <- loglike for(ii in 1:length(loglike)){ loglike[ii] <- loglike_fast(par = c(beta_0, beta_1[ii]), data = bark, sampled_trees_summary = sampled_tree_summary, return_neg = FALSE) } df <- data.frame(beta_1 = beta_1, loglike = loglike) alpha <- .05 lp <- -best_pars$value - 1.92 like_wrapper <- function(x){ return(loglike_fast(par = c(beta_0, x), data = bark, sampled_trees_summary = sampled_tree_summary, return_neg = FALSE) -lp)} lower <- uniroot(f = like_wrapper, c(-1, best_pars$par[2])) upper <- uniroot(f = like_wrapper, c(best_pars$par[2], 1)) library(ggplot2) ggplot(data = df, aes(x = beta_1, y = loglike)) + geom_vline(xintercept = best_pars$par[2], col = "blue", linetype = "dashed") + geom_hline(yintercept = lp, col = "darkgreen", linetype = "dashed") + geom_line(col = "red", size = 2) + theme_bw()
Using the likelihood profile, we get a 95% CI for $\hat{\beta}_1$ of [r round(lower$root, 2)
, r round(upper$root, 2)
], which is considerably smaller than the Wald CI but our conclusion remains the same: there is no evidence that being smear positive increases the probability of transmission.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.