knitr::opts_chunk$set(
    echo = FALSE,
    message = FALSE,
    warning = FALSE,
    collapse = TRUE,
    comment = "#>"
)
library(knitr)
library(dplyr)
library(optimum)
library(bench)

In the two arm case we are generally interested in at least one of the following equivalent probabilities $$ \begin{aligned} \mathbb P_{X,Y}(X > Y + \delta) &= \int_\delta^1\int_0^{X-\delta}f(y)dyf(x)dx \ &= \int_\delta^1 f_X(x)F_Y(x-\delta)dx \ &= 1 - \mathbb P_{X,Y}(X < Y + \delta) \ \mathbb P_{X,Y}(Y < X - \delta) &= \int_0^{1-\delta}\int_{Y+\delta}^1f(x)dxf(y)dy \ &= \int_0^{1-\delta}f_Y(y)(1 - F_X(y+\delta))dy \ &= 1 - \mathbb P_{X,Y}(Y > X -\delta) \end{aligned} $$ where $X\sim\text{Beta}(a,b)$ and $Y\sim\text{Beta}(c,d)$ are independent Beta distributions. The probability of the event, $X>Y+\delta$, cannot be caluclated analytically, but can do done so using numerical integration over a univariate integral (for reasonable values of the parameters).

In fact, the exact PDF for $\Theta = X - Y$, in the case that $\delta=0$ has been shown to be $$ f_\Theta(\theta) = \begin{cases} \frac{B(c,b)\theta^{b+d-1}(1 - \theta)^{c+b-1}}{B(a,b)B(c,d)}F_1(b,a+b+c+d-2,1-a;b+c;1-\theta,1-\theta^2)&\text{ if }0<\theta\leq 1 \ \frac{B(a+c-1,b+d-1)}{B(a,b)B(c,d)} &\text{ if }\theta=0 \ \frac{B(a,d)(-\theta)^{b+d-1}(1+\theta)^{a+d-1}}{B(a,b)B(c,d)}F_1(d,1-c,a+b+c+d-2;a+d;1-\theta^2,1+\theta) &\text{ if }-1\leq \theta < 0 \end{cases} $$ where $$ F_1(a,b_1,b_2;c;x_1,x_2) = $$ is Appell function. However, functions to compute the above may be unstable.

In the interest of speed we might alternatively approximate the Beta distributions by Normal distributions. The approximation should be satisfactory if $\frac{a+1}{a-1}\approx 1$ and $\frac{b+1}{b-1}\approx 1$ in which case $$ \text{Beta}(a,b)\sim N\left(\frac{a}{a+b}, \frac{ab}{(a+b)^2(a+b+1)}\right). $$

par(mfrow = c(2,2), oma = c(1,1,1,1), mar = c(1,1,1,1), cex = 0.5)
plot_beta_norm(2, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(2,100)")
plot_beta_norm(5, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(5,100)")
plot_beta_norm(10, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(10,100)")
plot_beta_norm(15, 100, xlim = c(0, 0.2), xaxt = 'n', yaxt = 'n', main = "Beta(15,100)")

Then we estimate the inequality by $$ \begin{aligned} m_X &= \frac{a}{a+b} \ s^2_X &= \frac{ab}{(a+b)^2(a+b+1)}\ m_Y &= \frac{c}{c+d} \ s^2_Y &= \frac{cd}{(c+d)^2(c+d+1)}\ z &= \frac{m_X - m_Y - \delta}{\sqrt{s_X^2+s_Y^2}} \ \mathbb P_{X,Y}(X>Y+\delta)&\approx \Phi(z) \end{aligned} $$ The approximation should be adequate for reasonable values of $(a,b,c,d)$.

a <- 5
b <- 100
c <- 10
d <- 100
m1 <- a / (a + b)
m2 <- c / (c + d)
v1 <- a*b / ( (a + b)^2 * (a + b + 1))
v2 <- c*d / ( (c + d)^2 * (c + d + 1))
y1 <- rbeta(1e5, a, b)
y2 <- rbeta(1e5, c, d)

par(oma = c(2,2,1,1), mar = c(1,1,1,1), cex = 0.75)
hist(y1 - y2, freq = F, breaks = 100, main = "X - Y; a = 5, b = 100, c = 10, d = 100")
x <- seq(min(y1- y2), max(y1 - y2), length.out = 1e3)
lines(x, dnorm(x, m1 - m2, sqrt(v1 + v2)))
legend("topleft", legend = "Normal approx", lty = 1, bty = 'n')

a <- 2
b <- 50
c <- 10
d <- 50
m1 <- a / (a + b)
m2 <- c / (c + d)
v1 <- a*b / ( (a + b)^2 * (a + b + 1))
v2 <- c*d / ( (c + d)^2 * (c + d + 1))
y1 <- rbeta(1e5, a, b)
y2 <- rbeta(1e5, c, d)

par(oma = c(2,2,1,1), mar = c(1,1,1,1), cex = 0.75)
hist(y1 - y2, freq = F, breaks = 100, main = "X - Y; a = 2, b = 50, c = 10, d = 50")
x <- seq(min(y1- y2), max(y1 - y2), length.out = 1e3)
lines(x, dnorm(x, m1 - m2, sqrt(v1 + v2)))
legend("topleft", legend = "Normal approx", lty = 1, bty = 'n')

The gain in computational speed is large.

mark(
  beta_ineq(3, 100, 13, 90),
  beta_ineq_approx(3, 100, 13, 90),
  beta_ineq_sim(3, 100, 13, 90, sims = 1000),
  check = F,
  iterations = 1000
) %>%
  arrange(median) %>%
  select(expression, min, mean, median, max, `itr/sec`) %>%
  kable("html", row.names = F, booktabs = TRUE, digits = 2) %>%
  kableExtra::kable_styling(latex_options = "hold_position")

Approximation is also reasonably accurate for most parameter settings, in the worse case, it is no worse than the error which may occur using Monte Carlo estimate with 1,000 particles.

P_exact <- outer(0:50, 0:50, function(x, y) Vectorize(beta_ineq)(1+x, 1+50-x, 1+y, 1+50-y))
P_approx <- outer(0:50, 0:50, function(x, y) Vectorize(beta_ineq_approx)(1+x, 1+50-x, 1+y, 1+50-y))
P_sim <- outer(0:50, 0:50, function(x, y) Vectorize(beta_ineq_sim)(1+x, 1+50-x, 1+y, 1+50-y, sims = 1000))

par(mfrow = c(1, 2), mar = c(4,1,1,1), oma = c(0,1,1,1), mgp = c(2, 1, 0), cex = 0.7)
matplot(P_approx - P_exact, type = 'l', lty = 1, col = "grey50", ylim = c(-0.08, 0.08), main = "Approx", xlab = "a")
matplot(P_sim - P_exact, type = 'l', lty = 1, col = "grey50", ylim = c(-0.08, 0.08), main = "Sim (N = 1,000)", xlab = "a")

A trade-off may be to use exact or simulation methods for parameter values where the approximation is known to be poor, and use the approximation otherwise.



jatotterdell/optimum documentation built on May 29, 2019, 1:24 p.m.