knitr::opts_chunk$set( echo = FALSE, fig.align = "center", fig.height = 3, fig.pos = "ht", fig.width = 6, message = FALSE, warning = FALSE ) options(stringsAsFactors = FALSE) knitr::opts_knit$set(root.dir = '../..')
library(ggplot2) library(gridExtra) library(data.table) library(tidyverse) library(tidybayes) library(kableExtra) library(doParallel) library(latex2exp) theme_set(theme_bw(base_size = 9)) # source("R/binary_two_arm_functions.R") library(optimum)
The end-point is food allergy at 18-months. For simplicity, assume babies are enrolled and randomised at 0 months of age. So, no follow-up data is available until 18-months after the first infant is enrolled.
Suppose accrual is 20 infants per week, then, by the time we have follow-up on the first individual we will have enrolled 1,560 infants (78 weeks $\times$ 20 per week). Assuming the first analysis was at $n=500$, we would have about 2,000 individuals enrolled, so 1,500 with missing information at the time of the first interim. Full follow-up would occur at about week 228 (Figure \@ref(fig:unnamed-chunk-1)).
Suppose accrual is 10 infants per week, then, by the time we have follow-up on the first individual we will have enrolled 780 infants. Assuming the first analysis was at $n=500$, we would have about 1,300 enrolled, so 800 with missing information at the first interim. Full follow-up would occur at about week 378.
The minimum acrrual rate needed to enroll 3,000 infants over 5 years is about 11.5 per week.
H <- function(x) as.numeric(x>0) weeks_followup <- 78 per_week_accrual <- 20 week <- seq(0, (52*8), 1) d <- data.frame(week = week, enrolled = pmin(3000, week*per_week_accrual), followup = H(week - weeks_followup)*(week - weeks_followup)*per_week_accrual) d <- d[which(d$followup <= 3000), ] d$interim <- as.numeric(d$followup %in% seq(500, 3000, 500)) d$anyfollowup <- as.numeric(d$followup > 0) # d[d$followup == d$enrolled, ] p1 <- ggplot(d, aes(x = week)) + geom_line(aes(y = enrolled, colour = "Enrolled")) + geom_line(aes(y = followup, colour = "Follow-up"), linetype = 2) + geom_vline(data = d[d$interim == 1, ], aes(xintercept = week)) + geom_segment(data = d[min(which(d$anyfollowup == 1)), ], linetype = 3, aes(x = week, xend = week, y = 0, yend = enrolled)) + geom_segment(data = d[min(which(d$anyfollowup == 1)), ], linetype = 3, aes(y = enrolled, yend = enrolled, x = 0, xend = week)) + scale_y_continuous(breaks = seq(0, 2500, 500)) + xlim(0, 400) + scale_color_discrete("") + labs(x = "Week", y = "Count") + theme(legend.position = c(0.8, 0.6), panel.grid = element_blank(), legend.background = element_blank()) weeks_followup <- 78 per_week_accrual <- 10 week <- seq(0, (52*8), 1) d <- data.frame(week = week, enrolled = pmin(3000, week*per_week_accrual), followup = H(week - weeks_followup)*(week - weeks_followup)*per_week_accrual) d <- d[which(d$followup <= 3000), ] d$interim <- as.numeric(d$followup %in% seq(500, 3000, 500)) d$anyfollowup <- as.numeric(d$followup > 0) p2 <- ggplot(d, aes(x = week)) + geom_line(aes(y = enrolled, colour = "Enrolled")) + geom_line(aes(y = followup, colour = "Follow-up"), linetype = 2) + geom_vline(data = d[d$interim == 1, ], aes(xintercept = week)) + geom_segment(data = d[min(which(d$anyfollowup == 1)), ], linetype = 3, aes(x = week, xend = week, y = 0, yend = enrolled)) + geom_segment(data = d[min(which(d$anyfollowup == 1)), ], linetype = 3, aes(y = enrolled, yend = enrolled, x = 0, xend = week)) + scale_y_continuous(breaks = seq(0, 2500, 500)) + xlim(0, 400) + scale_color_discrete("", guide = F) + labs(x = "Week", y = "Count") + theme(legend.position = c(0.2, 0.8), panel.grid = element_blank()) grid.arrange(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "first"))
\clearpage
Let $\theta_a$ be the probability of food allergy at 18-months amongst infants who receive the acellular pertussis vaccine at first dose, and $\theta_w$ the probability of food allergy at 18-months amongst infants received the whole-cell pertussis vaccine at first dose. We are interested in estimating $\delta = \theta_w - \theta_a$ the difference in probability (or $\delta = \ln(\theta_w/(1-\theta_w)) - \ln(\theta_a/(1-\theta_a))$ the difference in log-odds) and investigating the statistical hypothesis $$ \begin{aligned} H_0: \delta &\geq 0 \ H_1: \delta &< 0 \end{aligned} $$ That is, that $\theta_w$ is no lower than $\theta_a$ versus $\theta_w$ is lower than $\theta_a$.
We could approach this in two ways:
Given the lengthy delay in information, we will likely utilise posterior predictive probabilities at any interim analyses to impute the as yet unobserved data.
For the purpose of simplifying the simulations the independent Beta-Binomial models make sense. However, for the actual analyses, if covariates are to be included, logistic or logistic mixed models would be used. In any case, the Beta-Binomial model should reasonable approximate a main effects only logistic regression.
Suppose that at each analysis $k=1,...,K$ we have data on $n_k^i$ individuals with $y_k^i$ responses for $i\in{a,w}$. We also assume that we have $m_k^i\geq n_k^i$ total enrolled but not all with data. The number without data is $\tilde n_k^i = m_k^i - n_k^i$. At an interim analysis we wish to impute the data for individuals enrolled but without follow-up. We denote these missing number of responses by $\tilde y_k^i$.
In addition to enrolled individuals with missing data, there are the yet to be enrolled individuals making up the maximum sample size. At stage $K$ we have $n_K^i$ individuals with $y_K^i$ responses, and so for this end point we have $\tilde n_k^i = n_K^i - n_k^i$ data points missing. In either case, the posterior predictive will have the same parameters but with a different sample size parameter. Therefore in what follows we do not distinguish between the two, however, it is standard to use $\tilde n_k^i = m_k^i - n_k^i$ in deciding expected success and $\tilde n_k^i = n_K^i - n_k^i$ in deciding futility.
We specify the following model for $i\in{a,w}$ and $k\in{1,...,K}$, $$ \begin{aligned} \pi_0^i(\theta^i) &= \text{Beta}(\theta^i|a^i,b^i) \ f_k^i(y_k^i|\theta^i) &= \text{Binomial}(n_k^i, y_k^i) \ \pi_k^a(\theta^i|y_k^i) &= \text{Beta}(\theta^i|a^i + y_k^i,b^i + n_k^i - y_k^i) \ P_k &= \mathbb P_{\Theta^a,\Theta^w|Y_k^a,Y_k^w}(\theta^w<\theta^a) \ &= \int_0^1 \pi_k^a(\theta^a|y_k^a) \left[\int_0^{\theta^a} \pi_k^w(\theta^w|y_k^w)d\theta^w\right] d\theta^a\ \tilde f_k^i(\tilde y_k^i|y_k^i) &= \text{Beta-Binomial}(\tilde y_k^i|\tilde n_k^i, a^i+y_k^i, b^i+n_k^i-y_k^i)\ \tilde \pi_k^i(\theta^i|y_k^i + \tilde y_k^i) &= \text{Beta}(\theta^i|a^i+y_k^i+\tilde y_k^i, b^i+n_k^i+\tilde n_k^i - y_k^i-\tilde y_k^i)\ \tilde P_k &= \mathbb P_{\Theta^a,\Theta^w|Y_k^a+\tilde Y_k^a,Y_k^w+\tilde Y_k^w}(\theta^w<\theta^a) \ &= \int_0^1 \tilde \pi_k^a(\theta^a|y_k^a + \tilde y_k^a) \left[\int_0^{\theta^a} \tilde \pi_k^w(\theta^w|y_k^w + \tilde y_k^w)d\theta^w\right] d\theta^a \ \text{PPoS}k(q) &= \mathbb E{\tilde Y_k^a, \tilde Y_k^w|Y_k^a,Y_k^w}\left[\mathbb I\left{\tilde P_k > q\right}\right] \ &= \sum_{i=0}^{\tilde n_k^a} \sum_{j=0}^{n_k^w} \mathbb I\left{\tilde P_k>q\right} \tilde f_k^w(j|y_k^w)\tilde f_k^a(i|y_k^a) \end{aligned} $$
The quantity $P_k$ cannot be calculated analytically but can be evaluated numerically or estimated using Monte Carlo methods. Although $\text{PPoS}_k$ can be computed analytically (assuming we have calculated the relevant $\tilde P_k$) it may still be more efficient to estimate using Monte Carlo methods for large sample sizes.
Actual analyses would likely use logistic regression to allow adjusting for any confounders. Ideally, simulations would use the same model used in analyses, however, this would be computationally intensive unless we resort to density approximations (e.g. Laplace, variational). The Beta-Binomial model used for simulations should reasonably approximate results for logistic regression with only treatment as a covariate.
Let $X$ denote population-level covariates, and $Z$ the group-specific covariates, then we model for individuals $i=1,...,n_k$ and stages $k=1,...,K$: $$ \begin{aligned} \pi_0(\beta) &= N(0,\Sigma) \ \pi_0(\gamma) &= N(0,\Omega) \ \eta_i &= x_i^\top\beta +z_i^\top\gamma\ p_i &= \text{logit}^{-1}(\eta_i) \ f_k(y_{k,i}|\beta) &= \text{Bernoulli}(p_i) \ \pi_k(\beta|y_k) &\approx N^{-1}\sum_{j=1}^N \delta_{\beta^j}(d\beta),\quad \beta^{1:N}\sim\pi_k(\beta|y_k) \ P_k &= \mathbb P_{\beta_1|Y_k}(\beta_1 < 0)\ &= \int_{{\beta:\beta_1<0}} \pi_k(\beta|y_k)d\beta \ &\approx N^{-1}\sum_{j=1}^N \mathbb I\left{\beta_2^j<0\right}\ \tilde f_k(\tilde y_k|\beta,\gamma) &\approx N^{-1}\sum_{j=1}^N \delta_{\tilde y_k^j}(d\tilde y_k),\quad \tilde y_k^{1:N}\sim \tilde f_k(\tilde y_k|\beta,\gamma) \ \tilde \pi_k(\beta|y_k,\tilde y_k^j) &\approx M^{-1}\sum_{l=1}^M \delta_{\beta^l}(d\beta),\quad \beta^{1:M}\sim\tilde\pi_k(\beta|y_k,\tilde y_k^j),\quad j=1,...,N \ \tilde P_k &= \mathbb P_{\beta_1|Y_k,\tilde Y_k}(\beta_1 < 0) \ &\approx M^{-1} \sum_{l=1}^M \mathbb I{\beta^l_1 < 0}\ \text{PPoS}k(q) &\approx N^{-1}\sum{j=1}^N \mathbb I{\tilde P_k > q} \end{aligned} $$
At the final analysis (full follow-up on all individuals), a terminal decision is made regarding the difference in response between the two vaccines. This decision rule declares $\delta < 0$, $\delta \geq 0$, or that the study was inconclusive. $$ \delta_K(y_K) = \begin{cases} a_0 \text{ if } P_k \leq \underline c_K &\implies \text{accept } H_0 \ a_1 \text{ if } P_k \geq \overline c_K &\implies \text{accept } H_1 \ a_2 \text{ otherwise } &\implies \text{inconclusive} \end{cases} $$ At each interim analysis, a decision is made whether the study should be stopped for futility, expected success, or to continue enrolment. This decision is based on $\text{PPoS}_k(q)$ which depends on the chosen $q$. Perhaps setting $q=\overline c_K$ makes the most sense, as this is the criteria which would be used in assessing success at the final analysis. The interim decision rule is then $$ \delta_k(y_k) = \begin{cases} a_3 \text{ if } \text{PPoS}_k(\overline c_K) < \underline \kappa_k &\implies \text{futile to continue} \ a_4 \text{ if } \text{PPoS}_k(\overline c_K) > \overline \kappa_k &\implies \text{expect success at interim} \ a_5 \text{ otherwise } &\implies \text{continue to enrol to } k+1. \end{cases} $$ If stop for futility, when remaining enrolled individuals are followed-up underatake another predicitve probabilitiy analysis of futility to the original full sample size.
If stop for expected success, assess terminal decision rule when complete follow-up on remaining enrolled individuals is avaialable.
\clearpage
We want to investigate the operating characteristics of the trial for varying $\theta^a$ and $\theta^w$ and determine appropriate values of the following trial parameterse:
Assuming a maximum sample size of 3,000 (1,500 in each arm) at a fixed final anlaysis we estimate the following probability of success.
P <- outer(0:1500, 0:1500, function(x,y) dbinom(x, 1500, 0.1)*dbinom(y,1500,0.07)) sigPid <- which(P > 1e-6, arr.ind = T) p1 <- sum(apply(sigPid, 1, function(x) P[x[1], x[2]] * (Vectorize(beta_ineq_approx)(x[1], 1500 - x[1], x[2], 1500 - x[2]) > 0.95))) P <- outer(0:1500, 0:1500, function(x,y) dbinom(x, 1500, 0.03)*dbinom(y,1500,0.015)) sigPid <- which(P > 1e-6, arr.ind = T) p2 <- sum(apply(sigPid, 1, function(x) P[x[1], x[2]] * (Vectorize(beta_ineq_approx)(x[1], 1500 - x[1], x[2], 1500 - x[2]) > 0.95))) P <- outer(0:1500, 0:1500, function(x,y) dbinom(x, 1500, 0.28)*dbinom(y,1500,0.21)) sigPid <- which(P > 1e-6, arr.ind = T) p3 <- sum(apply(sigPid, 1, function(x) P[x[1], x[2]] * (Vectorize(beta_ineq_approx)(x[1], 1500 - x[1], x[2], 1500 - x[2]) > 0.95))) data.frame(p1tru = c(0.1, 0.03, 0.28), p2tru = c(0.07, 0.015, 0.21), p_1 = c(p1, p2, p3)) %>% kable(booktabs = TRUE, digits = 3, escape=F, col.names = c("$\\theta_w^\\star$", "$\\theta_w^\\star$", "$\\mathbb P(\\Theta_1|y_K)$")) %>% kableExtra::kable_styling(latex_options = "hold_position")
For assessing interim analyses, we assume two accrual scenarios: 20 per week and 10 per week. The difference between the two (apart from the total study length required), corresponds to varying the delay in information from follow-up. As previously stated, In the 20 per week case, we expect individuals with follow-up to be about 1,500 behind the number of individuals enrolled at the time of the first interim, whereas for 10 per week we expect individuals with follow-up to be about 800 behind at the number enrolled at the time of the first interim. This effects the value of $\text{PPoS}_k$ which will generally be closer to $P_k$ the larger the dimension of the posterior predictive distribution considered.
We also investigate the effect of setting $(\underline\kappa_k,\overline\kappa_k) \in {(0.1,0.9), (0.05,0.95)}$ while keeping $(\underline c_K, \overline c_K) = (0.05, 0.95)$ fixed. Additionally we fix $q = \overline c_K = 0.95$.
We run 1,000 simulations for each scenario assuming interim analyses for every 250 individuals with follow-up, so 10 interims and the final analysis at stage 11.
As an example we simulate two triasl with $\theta_a = \theta_w = 0.1$. In this case the correct conclusion would be $\delta\geq 0$. The procession of posteriors is shown in Figure \ref{fig:unnamed-chunk-4}. One of the trials ends early for futility and the other ends early for expected success.
set.seed(6547654) n_foll <- seq(500, 3000, 250)/2 n_enro <- pmin(3000/2, n_foll + 1500/2) n_miss <- n_enro - n_foll sce <- sim_scenario(100, p1tru = 0.1, p2tru = 0.1, n1int = n_foll, n2int = n_foll) sce_ppos <- calc_scenario_ppos(sce, k_ppos = 0.95, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_final") sce_ppos <- calc_scenario_ppos(sce_ppos, k_ppos = 0.95, m1int = n_miss, m2int = n_miss, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_interim") res <- decide_trial(sce_ppos)
p1 <- plot_trial(sce_ppos[sim_id == res[res == "futile" & stage > 1, sim_id][1]], seq(0, 0.25, 0.001)) + geom_text(data = sce_ppos[sim_id == res[res == "futile" & stage > 1, sim_id][1]], aes(x = 0.02, y = stage + 0.4, label = unlist(lapply(sprintf('$P_k = %.2f$, $PPoS_k^{fin} = %.2f$', ptail, ppos_final), TeX)), height = NULL), size = 2.5, parse = T) + geom_text(data = res[res == "futile" & stage > 1][1], aes(x = 0.2, y = stage + 0.4, label = paste("stop here because", res), height = NULL), size = 2.5) + scale_y_continuous(breaks = 1:11) + theme(legend.position = "top", panel.grid = element_blank()) + guides(fill = guide_legend(title.position = "top", title.hjust = 0.5)) p2 <- plot_trial(sce_ppos[sim_id == res[res == "expect success" & stage > 1, sim_id][1]], seq(0, 0.25, 0.001)) + geom_text(data = sce_ppos[sim_id == res[res == "expect success" & stage > 1, sim_id][1]], aes(x = 0.02, y = stage + 0.4, label = unlist(lapply(sprintf('$P_k = %.2f$, $PPoS_k^{int} = %.2f$', ptail, ppos_interim), TeX)), height = NULL), size = 2.5, parse = T) + geom_text(data = res[res == "expect success" & stage > 1][1], aes(x = 0.2, y = stage + 0.4, label = paste("stop here because", res), height = NULL), size = 2.5) + scale_y_continuous(breaks = 1:11) + guides(fill = "none") + theme(panel.grid = element_blank()) grid.arrange(rbind(ggplotGrob(p1), ggplotGrob(p2), size = "last"))
\clearpage
n_foll <- seq(500, 3000, 250)/2 n_enro <- pmin(3000/2, n_foll + 1500/2) n_miss <- n_enro - n_foll sims <- 1000 scenarios <- data.frame(scenario = 1:6, p1tru = c(0.1, 0.1, 0.03, 0.03, 0.28, 0.28), p2tru = c(0.1, 0.07, 0.03, 0.015, 0.28, 0.21)) pt <- proc.time() registerDoParallel(cores = 4) out <- foreach(i = 1:nrow(scenarios), .packages = c("optimum", "data.table")) %dopar% { sce <- sim_scenario(sims, p1tru = scenarios[i, "p1tru"], p2tru = scenarios[i, "p2tru"], n1int = n_foll, n2int = n_foll) # Get PPoS at final sample size sce_ppos <- calc_scenario_ppos( sce, k_ppos = 0.95, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_final") # Get PPoS at interim imputing missing follow-up sce_ppos <- calc_scenario_ppos( sce_ppos, k_ppos = 0.95, m1int = n_miss, m2int = n_miss, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_interim") } rt <- proc.time() - pt saveRDS(out, "inst/rmd/sim1_dat.rds") out_dt <- rbindlist(out, idcol = "scenario") res1 <- rbindlist(lapply(out, decide_trial), idcol = "scenario") res1 <- out_dt[res1, on = .(scenario, sim_id, stage)] res2 <- rbindlist(lapply(out, decide_trial, fut_k = 0.05, suc_k = 0.95), idcol = "scenario") res2 <- out_dt[res2, on = .(scenario, sim_id, stage)]
tab <- res1[, .(p_early_success = mean(res == "expect success"), p_late_success = mean(res == "superior"), p_early_failure = mean(res == "futile"), p_late_failure = mean(res == "inferior"), p_success = mean(res %in% c("expect success", "superior")), p_failure = mean(res %in% c("futile", "inferior")), p_inconclusive = mean(res == "inconclusive"), p_stop_early = mean(stage < length(n_foll))), keyby = .(scenario, p1tru, p2tru, fut_k, suc_k)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 3, caption = "Decision probabilities, $(\\underline\\kappa_k = 0.1, \\overline\\kappa_k=0.9)$", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\underline\\kappa_k$", "$\\overline\\kappa_k$", "$\\mathbb P(\\text{e.s})$", "$\\mathbb P(\\text{l.s})$", "$\\mathbb P(\\text{e.f})$", "$\\mathbb P(\\text{l.f})$", "$\\mathbb P(\\text{s})$", "$\\mathbb P(\\text{f})$", "$\\mathbb P(\\text{inc})$", "$\\mathbb P(\\text{s.e})$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
tab <- res2[, .(p_early_success = mean(res == "expect success"), p_late_success = mean(res == "superior"), p_early_failure = mean(res == "futile"), p_late_failure = mean(res == "inferior"), p_success = mean(res %in% c("expect success", "superior")), p_failure = mean(res %in% c("futile", "inferior")), p_inconclusive = mean(res == "inconclusive"), p_stop_early = mean(stage < length(n_foll))), keyby = .(scenario, p1tru, p2tru, fut_k, suc_k)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 3, caption = "Decision probabilities, $(\\underline\\kappa_k = 0.05, \\overline\\kappa_k=0.95)$", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\underline\\kappa_k$", "$\\overline\\kappa_k$", "$\\mathbb P(\\text{e.s})$", "$\\mathbb P(\\text{l.s})$", "$\\mathbb P(\\text{e.f})$", "$\\mathbb P(\\text{l.f})$", "$\\mathbb P(\\text{s})$", "$\\mathbb P(\\text{f})$", "$\\mathbb P(\\text{inc})$", "$\\mathbb P(\\text{s.e})$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
tab <- res1[, .(expected_n = mean(n1 + n2), expected_p1 = mean(a1 / (a1 + b1)), expected_p2 = mean(a2 / (a2 + b2))), keyby = .(scenario, p1tru, p2tru, fut_k, suc_k)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 2, caption = "Expected sample size and point estimates, $(\\underline\\kappa_k = 0.1, \\overline\\kappa_k=0.9)$.", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\underline\\kappa_k$", "$\\overline\\kappa_k$", "$\\mathbb E(N)$", "$\\mathbb E(\\theta^a)$", "$\\mathbb E(\\theta^w)$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
tab <- res2[, .(expected_n = mean(n1 + n2), expected_p1 = mean(a1 / (a1 + b1)), expected_p2 = mean(a2 / (a2 + b2))), keyby = .(scenario, p1tru, p2tru, fut_k, suc_k)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 2, caption = "Expected sample size and point estimates, $(\\underline\\kappa_k = 0.05, \\overline\\kappa_k=0.95)$.", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\underline\\kappa_k$", "$\\overline\\kappa_k$", "$\\mathbb E(N)$", "$\\mathbb E(\\theta^a)$", "$\\mathbb E(\\theta^w)$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
pdat <- res1[, .N, keyby = .(res, n1, p1tru, p2tru)] pdat[, lab := paste("atop(theta[a]*'='*", p1tru, ", theta[w]*'='*", p2tru, ")", sep="")] ggplot(pdat, aes(n1, N / sims)) + facet_grid(lab ~ res, labeller = labeller(lab = label_parsed)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(250, 1500, 500)) + ylim(0,1) + labs(x = "Sample size in each group", y = "Number of trials") + theme(panel.grid = element_blank())
pdat <- res2[, .N, keyby = .(res, n1, p1tru, p2tru)] pdat[, lab := paste("atop(theta[a]*'='*", p1tru, ", theta[w]*'='*", p2tru, ")", sep="")] ggplot(pdat, aes(n1, N / sims)) + facet_grid(lab ~ res, labeller = labeller(lab = label_parsed)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(250, 1500, 500)) + ylim(0,1) + labs(x = "Sample size in each group", y = "Number of trials") + theme(panel.grid = element_blank())
\clearpage
n_foll <- seq(500, 3000, 250)/2 n_enro <- pmin(3000/2, n_foll + 800/2) n_miss <- n_enro - n_foll sims <- 1000 scenarios <- data.frame(scenario = 1:6, p1tru = c(0.1, 0.1, 0.03, 0.03, 0.28, 0.28), p2tru = c(0.1, 0.07, 0.03, 0.015, 0.28, 0.21)) pt <- proc.time() registerDoParallel(cores = 4) out <- foreach(i = 1:nrow(scenarios), .packages = c("optimum", "data.table")) %dopar% { sce <- sim_scenario(sims, p1tru = scenarios[i, "p1tru"], p2tru = scenarios[i, "p2tru"], n1int = n_foll, n2int = n_foll) # Get PPoS at final sample size sce_ppos <- calc_scenario_ppos( sce, k_ppos = 0.95, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_final") # Get PPoS at interim imputing missing follow-up sce_ppos <- calc_scenario_ppos( sce_ppos, k_ppos = 0.95, m1int = n_miss, m2int = n_miss, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_interim") } rt <- proc.time() - pt saveRDS(out, "inst/rmd/sim2_dat.rds") out_dt <- rbindlist(out, idcol = "scenario") res1 <- rbindlist(lapply(out, decide_trial), idcol = "scenario") res1 <- out_dt[res1, on = .(scenario, sim_id, stage)] res2 <- rbindlist(lapply(out, decide_trial, fut_k = 0.05, suc_k = 0.95), idcol = "scenario") res2 <- out_dt[res2, on = .(scenario, sim_id, stage)]
tab <- res1[, .(p_early_success = mean(res == "expect success"), p_late_success = mean(res == "superior"), p_early_failure = mean(res == "futile"), p_late_failure = mean(res == "inferior"), p_success = mean(res %in% c("expect success", "superior")), p_failure = mean(res %in% c("futile", "inferior")), p_inconclusive = mean(res == "inconclusive"), p_stop_early = mean(stage < length(n_foll))), keyby = .(scenario, p1tru, p2tru)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 3, caption = "Decision probabilities, $(\\underline\\kappa_k = 0.1, \\overline\\kappa_k=0.9)$", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\mathbb P(\\text{e.s})$", "$\\mathbb P(\\text{l.s})$", "$\\mathbb P(\\text{e.f})$", "$\\mathbb P(\\text{l.f})$", "$\\mathbb P(\\text{s})$", "$\\mathbb P(\\text{f})$", "$\\mathbb P(\\text{inc})$", "$\\mathbb P(\\text{s.e})$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
tab <- res2[, .(p_early_success = mean(res == "expect success"), p_late_success = mean(res == "superior"), p_early_failure = mean(res == "futile"), p_late_failure = mean(res == "inferior"), p_success = mean(res %in% c("expect success", "superior")), p_failure = mean(res %in% c("futile", "inferior")), p_inconclusive = mean(res == "inconclusive"), p_stop_early = mean(stage < length(n_foll))), keyby = .(scenario, p1tru, p2tru)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 3, caption = "Decision probabilities, $(\\underline\\kappa_k = 0.05, \\overline\\kappa_k=0.95)$", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\mathbb P(\\text{e.s})$", "$\\mathbb P(\\text{l.s})$", "$\\mathbb P(\\text{e.f})$", "$\\mathbb P(\\text{l.f})$", "$\\mathbb P(\\text{s})$", "$\\mathbb P(\\text{f})$", "$\\mathbb P(\\text{inc})$", "$\\mathbb P(\\text{s.e})$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
tab <- res1[, .(expected_n = mean(n1 + n2), expected_p1 = mean(a1 / (a1 + b1)), expected_p2 = mean(a2 / (a2 + b2))), keyby = .(scenario, p1tru, p2tru, fut_k, suc_k)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 2, caption = "Expected sample size and point estimates, $(\\underline\\kappa_k = 0.1, \\overline\\kappa_k=0.9)$.", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\underline\\kappa_k$", "$\\overline\\kappa_k$", "$\\mathbb E(N)$", "$\\mathbb E(\\theta^a)$", "$\\mathbb E(\\theta^w)$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
tab <- res2[, .(expected_n = mean(n1 + n2), expected_p1 = mean(a1 / (a1 + b1)), expected_p2 = mean(a2 / (a2 + b2))), keyby = .(scenario, p1tru, p2tru, fut_k, suc_k)] kable(tab, booktabs = TRUE, escape = F, linesep = c('', '\\addlinespace'), digits = 2, caption = "Expected sample size and point estimates, $(\\underline\\kappa_k = 0.05, \\overline\\kappa_k=0.95)$.", col.names = c("Scenario", "$\\theta_a^\\star$", "$\\theta_w^\\star$", "$\\underline\\kappa_k$", "$\\overline\\kappa_k$", "$\\mathbb E(N)$", "$\\mathbb E(\\theta^a)$", "$\\mathbb E(\\theta^w)$")) %>% kableExtra::kable_styling(latex_options = "hold_position", font_size = 8)
pdat <- res1[, .N, keyby = .(res, n1, p1tru, p2tru)] pdat[, lab := paste("atop(theta[a]*'='*", p1tru, ", theta[w]*'='*", p2tru, ")", sep="")] ggplot(pdat, aes(n1, N / sims)) + facet_grid(lab ~ res, labeller = labeller(lab = label_parsed)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(250, 1500, 500)) + ylim(0, 1) + labs(x = "Sample size in each group", y = "Number of trials") + theme(panel.grid = element_blank())
pdat <- res2[, .N, keyby = .(res, n1, p1tru, p2tru)] pdat[, lab := paste("atop(theta[a]*'='*", p1tru, ", theta[w]*'='*", p2tru, ")", sep="")] ggplot(pdat, aes(n1, N / sims)) + facet_grid(lab ~ res, labeller = labeller(lab = label_parsed)) + geom_bar(stat = "identity") + scale_x_continuous(breaks = seq(250, 1500, 500)) + ylim(0,1) + labs(x = "Sample size in each group", y = "Number of trials") + theme(panel.grid = element_blank())
\clearpage
In the two arm case we are generally interested in at least one of the following equivalent probabilities $$ \begin{aligned} \mathbb P_{X,Y}(X > Y + \delta) &= \int_\delta^1\int_0^{X-\delta}f(y)dyf(x)dx \ &= \int_\delta^1 f_X(x)F_Y(x-\delta)dx \ &= 1 - \mathbb P_{X,Y}(X < Y + \delta) \ \mathbb P_{X,Y}(Y < X - \delta) &= \int_0^{1-\delta}\int_{Y+\delta}^1f(x)dxf(y)dy \ &= \int_0^{1-\delta}f_Y(y)(1 - F_X(y+\delta))dy \ &= 1 - \mathbb P_{X,Y}(Y > X -\delta) \end{aligned} $$ where $X\sim\text{Beta}(a,b)$ and $Y\sim\text{Beta}(c,d)$ are independent Beta distributions. The probability of the event, $X>Y+\delta$, cannot be caluclated analytically, but can do done so using numerical integration over a univariate integral (for reasonable values of the parameters).
In the interest of speed we might alternatively approximate the Beta distributions by Normal distributions. The approximation should be satisfactory if $\frac{a+1}{a-1}\approx 1$ and $\frac{b+1}{b-1}\approx 1$ in which case $$ \text{Beta}(a,b)\sim N\left(\frac{a}{a+b}, \frac{ab}{(a+b)^2(a+b+1)}\right). $$
par(mfrow = c(2,2), oma = c(1,1,1,1), mar = c(1,1,1,1), cex = 0.5) plot_beta_norm(2, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(2,100)") plot_beta_norm(5, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(5,100)") plot_beta_norm(10, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(10,100)") plot_beta_norm(15, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(15,100)")
Then we estimate the inequality by $$ \begin{aligned} m_X &= \frac{a}{a+b} \ s^2_X &= \frac{ab}{(a+b)^2(a+b+1)}\ m_Y &= \frac{c}{c+d} \ s^2_Y &= \frac{cd}{(c+d)^2(c+d+1)}\ z &= \frac{m_X - m_Y - \delta}{\sqrt{s_X^2+s_Y^2}} \ \mathbb P_{X,Y}(X>Y+\delta)&\approx \Phi(z) \end{aligned} $$
library(bench) mark( beta_ineq(3, 100, 13, 90), beta_ineq_approx(3, 100, 13, 90), beta_ineq_sim(3, 100, 13, 90, sims = 1000), check = F ) %>% arrange(median) %>% select(expression, min, mean, median, max, `itr/sec`) %>% kable(row.names = F, booktabs = TRUE, digits = 2) %>% kableExtra::kable_styling(latex_options = "hold_position")
Approximation is also reasonably accurate for most parameter settings, in the worse case, it is no worse than the error which may occur using Monte Carlo estimate with 1,000 particles.
P_exact <- outer(0:50, 0:50, function(x, y) Vectorize(beta_ineq)(1+x, 1+50-x, 1+y, 1+50-y)) P_approx <- outer(0:50, 0:50, function(x, y) Vectorize(beta_ineq_approx)(1+x, 1+50-x, 1+y, 1+50-y)) P_sim <- outer(0:50, 0:50, function(x, y) Vectorize(beta_ineq_sim)(1+x, 1+50-x, 1+y, 1+50-y, sims = 1000)) par(mfrow = c(1, 2), mar = c(4,1,1,1), oma = c(0,1,1,1), mgp = c(2, 1, 0), cex = 0.7) matplot(P_approx - P_exact, type = 'l', lty = 1, col = "grey50", ylim = c(-0.08, 0.08), main = "Approx", xlab = "a") matplot(P_sim - P_exact, type = 'l', lty = 1, col = "grey50", ylim = c(-0.08, 0.08), main = "Sim (N = 1,000)", xlab = "a")
A trade-off may be to use exact or simulation methods for parameter values where the approximation is known to be poor, and use the approximation otherwise.
To compute the predictive probability of success we take the expectation of an indicator function with respect to the posterior predictive distribution of the joint outcomes. In the two arm case this is a double summation over the domain.
We can:
For small sample sizes should just enumerate over all values, but for larger predicted sample sizes use Monte Carlo.
Some notes:
set.seed(6547654) n_foll <- seq(500, 3000, 250)/2 n_enro <- pmin(3000/2, n_foll + 1500/2) n_miss <- n_enro - n_foll sce <- sim_scenario(100, p1tru = 0.1, p2tru = 0.1, n1int = n_foll, n2int = n_foll) sce_ppos <- calc_scenario_ppos(sce, k_ppos = 0.95, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_final") sce_ppos <- calc_scenario_ppos(sce_ppos, k_ppos = 0.95, m1int = n_miss, m2int = n_miss, post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_interim") res <- decide_trial(sce_ppos) # VB modelling for(i in 1:nrow(sce)) { vb <- vb_logistic_n(cbind(1, c(0, 1)), c(sce[i, y1], sce[i, y2]), c(sce[i, n1], sce[i, n2]), mu0 = c(0, 0), Sigma0 = diag(1.75^2,2), alg = "sj") sce[i, `:=`(mu = vb$mu[2], s = sqrt(vb$Sigma[2,2]), p1_vb = plogis(vb$mu[1]), p2_vb = plogis(vb$mu[1] + vb$mu[2]), ptail_vb = pnorm(0, vb$mu[2], sqrt(vb$Sigma[2,2])))] } x <- c(0, 1) y <- sce[1, c(y1, y2)] n <- sce[1, c(n1, n2)] dat <- data.frame(x, y, n) m <- brm(y | trials(n) ~ x, data = dat, family = binomial, prior = c(prior(normal(0,1.75),class="b"), prior(normal(0,1.75),class="Intercept")), iter = 10000) Y_brm <- posterior_predict(m, newdata = data.frame(x = c(0, 1), n = c(25, 25))) B <- rmvnorm(1e5, vb$mu, vb$Sigma) P1 <- plogis(B[, 1]) P2 <- plogis(rowSums(B)) Y1 <- rbinom(1e5, 25, P1) Y2 <- rbinom(1e5, 25, P2) Y_vb <- cbind(Y1, Y2) s par(mfrow = c(1, 3), mar = c(3,2,1,1)) barplot(prop.table(table(factor(Y_brm[, 1], levels = 0:11))), ylim = c(0, 0.3)) barplot(prop.table(table(factor(Y_vb[, 1], levels = 0:11))), ylim = c(0, 0.3)) barplot(dbetabinom(0:11, 25, sce[1, a1], sce[1, b1]), ylim = c(0, 0.3)) par(mfrow = c(1, 3), mar = c(3,2,1,1)) barplot(prop.table(table(factor(Y_vb[, 1], levels = 0:11))), ylim = c(0, 0.3)) barplot(setNames(dbinom(0:11, 25, plogis(vb$mu[1,1]*1/sqrt((1 + pi*vb$Sigma[1,1]/8)))), 0:11), ylim = c(0, 0.3)) barplot(setNames(dbetabinom(0:11, 25, sce[1, a1], sce[1, b1]), 0:11), ylim = c(0, 0.3)) y1 <- rbeta(1e5, sce[1, a1], sce[1, b1]) y2 <- rbeta(1e5, sce[1, a2], sce[1, b2]) hist(log(y2 / (1 - y2)) - log(y1 / (1 - y1)), freq = F, breaks = 100) lines(density(posterior_samples(m)[, 2]), col = "blue") curve(dnorm(x, sce$mu[1], sce$s[1]), add = T, col = "red") x <- c(0, 1) y <- sce[11, c(y1, y2)] n <- sce[11, c(n1, n2)] dat <- data.frame(x, y, n) m <- brm(y | trials(n) ~ x, data = dat, family = binomial, prior = c(prior(normal(0,1.75),class="b"), prior(normal(0,1.75),class="Intercept")), iter = 10000, file = "inst/stan/brmlogreg") y1 <- rbeta(1e5, sce[11, a1], sce[11, b1]) y2 <- rbeta(1e5, sce[11, a2], sce[11, b2]) hist(log(y2 / (1 - y2)) - log(y1 / (1 - y1)), freq = F, breaks = 100) lines(density(posterior_samples(m)[, 2]), col = "blue") curve(dnorm(x, sce$mu[11], sce$s[11]), add = T, col = "red") library(bench) mark( vb_logistic_n(cbind(1, c(0, 1)), c(sce[1, y1], sce[1, y2]), c(sce[1, n1], sce[1, n2]), mu0 = c(0, 0), Sigma0 = diag(1.75^2,2), alg = "sj"), m <- brm(y | trials(n) ~ x, data = dat, family = binomial, prior = c(prior(normal(0,1.75),class="b"), prior(normal(0,1.75),class="Intercept")), chains = 1, iter = 2000, file = "inst/stan/brmlogreg"), check = FALSE, iterations = 50 )
# out <- sce_ppos[res, on = .(sim_id, stage)][1, ] # y1pred <- rbinom(10000, out$ppos_final_m1, out$p1tru) # y2pred <- rbinom(10000, out$ppos_final_m2, out$p2tru) # y1pred <- rbetabinom(10000, out$ppos_final_m1, out$a1, out$b1) # y2pred <- rbetabinom(10000, out$ppos_final_m2, out$a2, out$b2) # P <- beta_ineq_approx(out$a1 + y1pred, out$b1 + out$ppos_final_m1 - y1pred, # out$a2 + y2pred, out$b2 + out$ppos_final_m2 - y2pred) # mean(P > 0.95) n_foll <- seq(100, 1000, 100)/2 n_enro <- pmin(1000/2, n_foll + 50/2) n_miss <- n_enro - n_foll sce <- sim_scenario(100, p1tru = 0.1, p2tru = 0.1, n1int = n_foll, n2int = n_foll) # sce_ppos <- calc_scenario_ppos(sce, k_ppos = 0.95, # post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_final") # sce_ppos <- calc_scenario_ppos(sce_ppos, k_ppos = 0.95, # m1int = n_miss, # m2int = n_miss, # post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_interim") sce_ppos <- calc_scenario_ppos( sce_ppos, k_ppos = 0.95, m1int = seq(4500, 0, length.out=length(n_foll)), m2int = seq(4500, 0, length.out=length(n_foll)), post_method = "approx", pp_sim = 1e4, ppos_name = "ppos_test") # This makes sense, from n = 750 onwards the interim and final have # the same prospective sample size, for n = 250, 500 the final analysis # has larger prospective sample size, so should have higher probability of success # ggplot(sce_ppos, aes(ptail, ppos_interim)) + # facet_wrap( ~ n1) + # geom_point() + # geom_abline() + # labs(x = expression(P[k]), y = expression(PPoS[k])) # # ggplot(sce_ppos, aes(ptail, ppos_final)) + # facet_wrap( ~ n1) + # geom_point() + # geom_abline() + # labs(x = expression(P[k]), y = expression(PPoS[k])) ggplot(sce_ppos, aes(ptail, ppos_test)) + facet_wrap( ~ ppos_test_m1) + geom_point() + geom_abline() + labs(x = expression(P[k]), y = expression(PPoS[k]))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.