In bearloga/BCDA: Tools for Bayesian Categorical Data Analysis

options(digits = 3)
library(hrbrthemes)
ggplot2::theme_set(theme_ipsum())
knitr::opts_chunk$set(message = FALSE, warning = FALSE)

library(BCDA)
library(tibble)
library(purrr)
library(ggplot2)
library(gt)
library(ggdist)

The following table is from a report on the relationship between aspirin use and heart attacks by the Physicians' Health Study Research Group at Harvard Medical School. It was a 5-year randomized study of whether regular aspirin take reduces mortality from cardiovascular disease. Of the 11,034 physicians taking a placebo, 18 and 17 suffered fatal and non-fatal heart attacks, respectively. Of the 11,037 taking aspirin, 5 and 99 suffered fatal and non-fatal heart attacks, respectively.

aspirin <- matrix(c(18 + 171, 10845, 5 + 99, 10933), nrow = 2, byrow = TRUE)
rownames(aspirin) <- c("Placebo", "Aspirin")
colnames(aspirin) <- c("Myocardial Infraction", "No Attack")
aspirin <- aspirin[2:1, 2:1]
knitr::kable(addmargins(aspirin))

Frequentist Approach

Using the frequentist approach, the individual cell probabilities $\pi_{ij}$ are estimated via $p_{ij} = n_{ij}/n_{++}$:

aspirin %>%
  prop.table %>%
  addmargins %>%
  knitr::kable(digits = 3)

We can perform a null hypothesis significance test of association using chi-square test or Fisher's exact test, both of which yield p-values > 0.001.

stats::chisq.test(aspirin) # usually done for large sample tables

stats::fisher.test(aspirin) # usually done for small sample tables

The two major problems with the frequentist approach are:

The rarity of the disease.
The large sample size yields a tiny p-value for a small effect, so we call it highly statistically significant.

As Andrew Gelman said, "In general: small n, unlikely to get small p-values. Large n, likely to find something. Huge n, almost certain to find lots of small p-values."

Alternatively, we can abandon null hypothesis signifiance test (NHST) approach that yields a p-value and instead employ Bayesian methods (which have their own quirks).

Bayesian Approach

In Bayesian statistics, we are concerned with the posterior distribution of the parameter(s) of interest $\theta$ given the data. Using Bayes Theorem, we can express the posterior distribution $p(\theta~|~\text{data})$ as proportional to the product of the likelihood $p(\text{data}~|~\theta)$ and the prior $p(\theta)$: $$p(\theta~|~\text{data}) \propto p(\text{data}~|~\theta)~p(\theta).$$

In essence, while the classical (frequentist) approach regards $\theta$ as a fixed entity whose value is to be estimated, Bayesian perspective views $\theta$ as a random variable whose posterior probability distribution is to be estimated.

We can use the est_multinom function to estimate the multinomial cell probabilities ($\pi_{11}, \pi_{12}, \pi_{21}, \pi_{22}$) using methods developed by Fienberg and Holland (1973). When we don't provide hyperparameters $\gamma_{ij}$, they are calculated using a simple model ($\gamma_{ij} = p_{i+}p_{+j}$).

est_multinom(aspirin)

aspirin %>%
  est_multinom %>%
  addmargins %>%
  knitr::kable(digits = 3)

We can also employ the knowledge that we do have about the prevalence of heart attacks in U.S. via the American Heart Association.

prevalence <- data.frame(
  Incidence = c(
    0.8, 2.2, 0.2, 1.0,
    2.0, 3.6, 1.0, 2.3,
    3.8, 5.7, 2.0, 3.7,
    6.6, 8.1, 3.6, 7.2,
    9.1, 12.9, 7.8, 10.2
  ),
  Age = c(
    rep('35-44', 4),
    rep('45-54', 4),
    rep('55-64', 4),
    rep('65-74', 4),
    rep('75-84', 4)
  ),
  Sex = c('Men', 'Women')[rep(c(1, 1, 2, 2), 5)],
  Race = c('White', 'Black')[rep(rep(c(1, 2), 2), 5)]
)

prevalence <- transform(prevalence, Group = paste(Race, Sex))
ggplot(prevalence) +
  geom_bar(
    aes(y = Incidence, x = Age, fill = Group),
    stat = "identity", position = "dodge"
  ) +
  scale_y_continuous(name = "Prevalence per 1,000 persons") +
  ggtitle("Incidence of myocardial infraction", "Source: American Heart Association")

The prevalence (averaged across age, sex, and race) is 0.469%, which we can use as very naive prior information in our model.

prior <- matrix(c(1-mean(prevalence$Incidence/1000), mean(prevalence$Incidence/1000)), ncol = 2, byrow = TRUE)
colnames(prior) <- c('No Attack', 'Avg Prevalence of M.I.')
knitr::kable(prior, digits = 3)
prior <- prior[c(1, 1), ]/2
rownames(prior) <- c('Aspirin', 'Placebo')
knitr::kable(prior, digits = 3)

est_multinom(aspirin, prior = prior)

aspirin %>%
  est_multinom(prior = prior) %>%
  addmargins %>%
  knitr::kable(digits = 3)

Strength and Direction of Association

We assume an independent Binomial model with independent Beta priors on the probability parameter. For each group $i = 1, 2$ (aspirin and placebo): $$X_i \sim \text{Binom}(N_i, \pi_i),~\text{and}\ \pi_i \sim \text{Beta}(\alpha_i, \beta_i).$$

However, we still need to provide the $\alpha$ and $\beta$ hyperparameters. We can set them to $\alpha=\beta=1$ which corresponds to a non-informative prior on $\pi$, or we can specify them such that the shape of the Beta prior better reflects our knowledge about the probability of (not) having a heart attack while being non-informative with respect to the groups. After all, we don't know which group has a better outcome, but if we did, we could specify the parameters to reflect that knowledge.

set.seed(0)
fit <- beta_binom(aspirin)

We can also provide success counts and totals:

# fit <- beta_binom(x = aspirin[, 1], n = aspirin[, 1] + aspirin[, 2])

What the model and Bayesian inference do is update the distribution of the parameters (p1 and p2) from the prior (before the data was observed) to the posterior (based on the observed data) -- shifting our certainty in which values of p1 and p2 are more or less likely than others, roughly speaking:

prior_posterior <- list(
  prior = tribble(
    ~value, ~parameter, ~hyperparameter,
    0.5, "p1", "shape1",
    0.5, "p1", "shape2",
    0.5, "p2", "shape1",
    0.5, "p2", "shape2"
  ),
  posterior = tibble(
    value = fit$posterior_parameters,
    parameter = c(rep("p1", 2), rep("p2", 2)),
    hyperparameter = rep(c("shape1", "shape2"), 2)
  )
) %>%
  dplyr::bind_rows(.id = "distribution") %>%
  dplyr::mutate(
    distribution = factor(distribution, c("prior", "posterior")),
    parameter = factor(parameter, c("p1", "p2"))
  ) %>%
  tidyr::pivot_wider(values_from = "value", names_from = "hyperparameter")

ggplot(aes(y = 0), data = prior_posterior) +
  stat_dist_slab(
    aes(dist = "beta", arg1 = shape1, arg2 = shape2),
    slab_type = "cdf", limits = c(0.95, 1.0), color = "black"
  ) +
  facet_grid(parameter ~ distribution) +
  scale_x_continuous(name = NULL) +
  scale_y_continuous(name = "Cumulative probability", labels = scales::percent_format()) +
  ggtitle(
    "Comparison of prior and posterior cumulative distribution functions",
    "Default Beta(0.5, 0.5) prior compared to posterior obtained via beta_binom()"
  ) +
  coord_cartesian(xlim = c(0.95, 1.0)) +
  ggdist::theme_ggdist()

plot(fit, interval_type = "HPD") # HPD intervals require {coda}
# plot(fit) will use credible intervals computed using the quantile method

summary(fit, interval_type = "HPD", digits = 2) # HPD intervals require {coda}
# summary(fit) will give credible intervals using the quantile method

To produce LaTeX or Markdown versions of the table, use the underlying present_bbfit function that is used inside summary for the beta_binomial_fit object:

present_bbfit(fit, interval_type = "HPD", digits = 2) # HPD intervals require {coda}
# present_bbfit(fit) will give credible intervals using the quantile method

11.4% more of aspirin takers don't experience a heart attack. The 95% Credible Interval (computed as the highest posterior density interval) for this difference of proportions ($p_{\text{aspirin, no attack}} - p_{\text{placebo, no attack}}$) is (0.48%, 1.09%). Aspirin takers were 1.005-1.01 times more likely to have no heart attacks. The odds of not developing a heart attack in Aspirin takers were 1.42-2.303 times the odds of those in the placebo group.

Tidy Summaries and Functional Programming

Because BCDA implements the tidy() verb, it's really easy to compare the results of different priors using purrr's map family of functions and obtain a single tidy data frame that we can visualize in table format or chart format using gt and ggplot2 packages, respectively.

priors <- list(
  Jeffreys = c(a = 0.5, b = 0.5, c = 0.5, d = 0.5),
  Uniform = c(a = 1, b = 1, c = 1, d = 1),
  "Highly Subjective" = c(a = 40, b = 1.1, c = 20, d = 1.2)
)

# What if we only had 1/100th of subjects in each cell?
aspirin_subset <- round(aspirin / 100)

# We can use functional programming with tidy data philosophy:
fits <- priors %>%
  map(~ beta_binom(aspirin_subset, prior = .x)) %>%
  map_df(tidy, .id = "prior") %>%
  dplyr::filter(term %in% c("p1", "p2"))

fits %>%
  gt(groupname_col = "prior", rowname_col = "term") %>%
  fmt_markdown(vars(term)) %>%
  fmt_number(vars(estimate, std.error, conf.low, conf.high), decimals = 3) %>%
  cols_merge(
    vars(conf.low, conf.high),
    pattern = "{1}&mdash;{2}"
  ) %>%
  cols_merge(
    vars(estimate, std.error),
    pattern = "{1} ({2})"
  ) %>%
  cols_label(
    estimate = "Estimate (Standard Error)",
    conf.low = "95% Credible Interval"
  ) %>%
  tab_header(
    md("Posterior estimates from `beta_binom()`"),
    "By different choices of prior"
  )

# And also visualize the estimates with ggplot2:
ggplot(fits) +
  geom_pointrange(
    aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high, color = prior),
    position = position_dodge(width = 0.5)
  ) +
  scale_x_discrete("Group", labels = c("Aspirin", "Placebo"), limits = c("p1", "p2")) +
  scale_y_continuous("Probability of No Attack", labels = scales::percent_format()) +
  coord_flip() +
  ggtitle("Comparison of prior choice impact on posterior")

Updating

set.seed(42)
group_1 <- sample.int(2, 20, prob = c(0.55, 0.45), replace = TRUE) - 1
group_2 <- sample.int(2, 20, prob = c(0.65, 0.35), replace = TRUE) - 1
fit <- beta_binom(c(sum(group_1), sum(group_2)), c(length(group_1), length(group_2)))
posterior_summaries <- cbind(day = 1, tidy(fit))

for (day in 2:14) {
  group_1 <- sample.int(2, 10, prob = c(0.55, 0.45), replace = TRUE) - 1
  group_2 <- sample.int(2, 10, prob = c(0.65, 0.35), replace = TRUE) - 1
  fit <- fit %>%
    update(c(sum(group_1), sum(group_2)), c(length(group_1), length(group_2)))
  posterior_summaries <- posterior_summaries %>%
    rbind(cbind(day = day, tidy(fit)))
}
posterior_summaries$term <- factor(
  posterior_summaries$term,
  levels = c("p1", "p2", "prop_diff", "relative_risk", "odds_ratio"),
  labels = c("Prop 1", "Prop 2", "Prop 1 - Prop 2", "Relative Risk", "Odds Ratio")
)

posterior_summaries %>%
  dplyr::filter(day == 1 | day %% 2 == 0) %>%
  ggplot() +
  facet_wrap(
    ~ day, nrow = 5,
    labeller = function(days) { days$day <- paste("Day", days$day); days }
  ) +
  geom_segment(
    aes(x = conf.low, xend = conf.high, y = term, yend = term),
    color = "#e41a1c", size = 0.75
  ) +
  geom_point(aes(x = estimate, y = term), size = 2) +
  scale_y_discrete(name = NULL, limits = rev(levels(posterior_summaries$term))) +
  geom_segment(
    aes(
      x = 0.45, xend = 0.45,
      y = 4.75, yend = 5.25
    ),
    color = "#377eb8", size = 1.1
  ) +
  geom_segment(
    aes(
      x = 0.35, xend = 0.35,
      y = 3.75, yend = 4.25
    ),
    color = "#377eb8", size = 1.1
  ) +
  geom_segment(
    aes(
      x = 0.10, xend = 0.10,
      y = 2.75, yend = 3.25
    ),
    color = "#377eb8", size = 1.1
  ) +
  geom_segment(
    aes(
      x = 0.45/0.35, xend = 0.45/0.35,
      y = 1.75, yend = 2.25
    ),
    color = "#377eb8", size = 1.1
  ) +
  geom_segment(
    aes(
      x = (0.45/0.55)/(0.35/0.65), xend = (0.45/0.55)/(0.35/0.65),
      y = 0.75, yend = 1.25
    ),
    color = "#377eb8", size = 1.1
  ) +
  scale_x_continuous(name = NULL, limits = c(0, 3), oob = scales::squish) +
  ggtitle("Updating posterior", "with 10 observations/day for 2 weeks") +
  theme(panel.grid = element_line(color = "black"))

gg <- posterior_summaries %>%
  ggplot(data = ., aes(x = estimate, y = term, frame = day)) +
  geom_segment(aes(x = conf.low, xend = conf.high,
                   y = term, yend = term),
               color = "#e41a1c", size = 0.75) +
  geom_point(size = 2) +
  scale_y_discrete(limits = rev(levels(posterior_summaries$term))) +
  labs(title = "Estimates and HPD Intervals after day", y = NULL, x = NULL) +
  geom_segment(aes(x = 0.45, xend = 0.45, y = 4.75, yend = 5.25),
               lty = "dotted", color = "#377eb8") +
  geom_segment(aes(x = 0.35, xend = 0.35, y = 3.75, yend = 4.25),
               lty = "dotted", color = "#377eb8") +
  geom_segment(aes(x = 0.10, xend = 0.10, y = 2.75, yend = 3.25),
               lty = "dotted", color = "#377eb8") +
  geom_segment(aes(x = 0.45/0.35, xend = 0.45/0.35, y = 1.75, yend = 2.25),
               lty = "dotted", color = "#377eb8") +
  geom_segment(aes(x = (0.45/0.55)/(0.35/0.65), xend = (0.45/0.55)/(0.35/0.65),
                   y = 0.75, yend = 1.25), lty = "dotted", color = "#377eb8") +
  theme(panel.grid = element_line(color = "black"))
gganimate::gg_animate(gg, "updating.gif", interval = 0.5, ani.width = 600, ani.height = 400)