knitr::opts_chunk$set(
  warning = FALSE,
  message = FALSE
)

This document shows simulations to investigate the frequentist properties of the net benefit (NB) estimates generated by bayesDCA.

Diagnostic tests

Data generation process

We simulate data for binary diagnostic tests according to the following mechanism. From $N$ patients, we observe $d$ diseased persons as sampled from a binomial distribution with parameter $p$, the prevalence or outcome proportion for the studied disease. Then, from $d$ diseased persons, we observe $TP$ true positives. Similarly, from $N - d$ non-diseased persons, we observe $TN$ true negatives. The parameters of the above binomial variables represent the sensitivity $Se$ and the specificity $Sp$ of the diagnostic test, respectively.

$\begin{equation} D \sim Binomial(N,\ p) \ TP \sim Binomial(D,\ Se) \ TN \sim Binomial(N-D,\ Sp) \end{equation}$

Parameter estimation

From the above model, it should be noted that the number of diseased patients $D$ is a random variable. Nonetheless, the most common way diagnostic tests are evaluated ignores this uncertainty and estimates $p$, $Se$, and $Sp$ independently --- pretending $d$ is just a number, as opposed to a realization of $D$.

This "independent" formulation can be modeled using the beta-binomial conjugate model, so that the posterior distribution of all parameters can be computed in closed form. Once we sample from their joint posterior distribution, however, we can compute the NB as

$$ NB = Se \cdot p - (1-Sp) \cdot (1-p) \cdot w $$

for each posterior draw and naturally generate credible intervals for NB.

As we are using Stan either way, there is not much to be gained in terms of sampling performance by using the analytical solutions.

Simulating data

We will simulate and analyze $B = 10000$ data sets for 8 different simulation settings. We assume the following values for the true parameters: $p \in {0.01, 0.2}$, $Se \in {0.6, 0.95}$, $Sp \in {0.6, 0.95}$). We set the number of expected events to be 100 (i.e., $N=500$ for $p=0.2$).

SEED <- 123

simulation_settings <- expand.grid(B = 10000,
                                   expected_events = 100,
                                   true_p = c(0.01, 0.2),
                                   true_Se = c(0.6, 0.95),
                                   true_Sp = c(0.6, 0.95))

simulation_list <- apply(simulation_settings, 1, function(x) {

  true_p <- x["true_p"]
  true_Se <- x["true_Se"]
  true_Sp <- x["true_Sp"]

  N <- ceiling(x["expected_events"]/true_p)
  B <- x["B"]

  set.seed(SEED)
  d <- rbinom(n = B, size = N, prob = true_p)
  tp <- rbinom(n = B, size = d, prob = true_Se)
  tn <- rbinom(n = B, size = N - d, prob = true_Sp)

  simulated_data <- data.frame(N, d, tp, tn, true_p, true_Se, true_Sp)

  return(simulated_data)

})

head(simulation_list[[1]])

We import the necessary libraries. The furrr package allows us to parallelize our processes in a tidyverse-friendly way.

library(tidyverse)
library(furrr)

For each simulation setting, we run the analysis for the $B=10000$ data sets. I use 7 cores because that's what I got in my personal computer.

NUMBER_OF_CORES <- 7

if (!file.exists('analysis_list.rds')) {
  analysis_list <- map(simulation_list, function(x) {

    x <- group_split(rowwise(x))

    plan(multisession, workers = NUMBER_OF_CORES)
    simulation_results <- future_map(x, function(.dataset) {
      .fit <- dca_diagnostic_test(N = .dataset[["N"]], 
                                  d = .dataset[["d"]], 
                                  tp = .dataset[["tp"]], 
                                  tn = .dataset[["tn"]])
      .result <- .fit$net_benefit %>% 
        select(thr, estimate, `2.5%`, `97.5%`)
      return(.result)
    }, .options = furrr_options(seed = TRUE))
    plan(sequential); closeAllConnections()

    analysis <- bind_rows(simulation_results, .id = "simulation_id")

    return(analysis)
  })

  saveRDS(analysis_list, "analysis_list.rds")
} else {
  analysis_list <- readRDS('analysis_list.rds')
}


head(analysis_list[[1]])

We use the settings_id columns to map the analysis rows to their respective simulation settings.

df <- left_join(
  bind_rows(analysis_list, .id = "settings_id"),
  rownames_to_column(simulation_settings, "settings_id"),
  by = "settings_id"
)

We can now compute the true NB.

df <- df %>% 
  mutate(
    NB = true_Se*true_p - (1-true_p)*(1-true_Sp)*(thr/(1-thr))
  )

Estimator's empirical distribution

Our NB estimator is the posterior mean, stored in the estimate column. For each simulation setting, we compute its average across the 10000 datasets, as well as 2.5% and 97.5% quantiles. The true NB is plotted as a dashed red line and should mach the average line in gray.

df <- df %>% 
  mutate(
    Prevalence = paste0(round(true_p*100), "%"),
    Sensitivity = paste0("Sensitivity ", round(true_Se*100), "%"),
    Specificity = paste0("Specificity ", round(true_Sp*100), "%")
  )

df %>% 
  group_by(settings_id, thr, NB, Prevalence, Sensitivity, Specificity) %>% 
  summarise(
    avg = mean(estimate),
    l = quantile(estimate, .025),
    u = quantile(estimate, .975)
  ) %>% 
  ggplot(aes(x = thr, color = Prevalence, fill = Prevalence)) +
  geom_ribbon(aes(ymin = l, ymax = u, color = NULL), alpha = .5) +
  geom_line(aes(y = avg), color = "gray20") +
  geom_line(aes(y = NB), color = "red", 
            linetype = "dashed") +
  facet_grid(rows = vars(Sensitivity),
             cols = vars(Specificity)) +
  scale_x_continuous(labels = scales::percent_format(1)) +
  theme_bw() +
  labs(x = "Threshold", y = "Net benefit",
       subtitle = "True NB in red (10000 data sets, 100 expected events)")

As we can see, the average posterior mean indeed accurately matches the true NB, without indication of bias. In the case of 20% prevalence, there is substantial variability, represented by the green ribbon. That is not seen when the prevalence is 1%: with 100 expected events, the total sample size is expected to be around $N=10000$, making the posterior mean rather stable - we can barely see any ribbon around the average line.

Empirical coverage of credible intervals

In addition to the posterior mean distribution, we can check how often the 95% credible intervals contain the true NB across the 10000 datasets.

df %>% 
    group_by(settings_id, thr, NB, Prevalence, Sensitivity, Specificity) %>% 
    summarise(
        coverage = mean(NB >= `2.5%` & NB <= `97.5%`)
    ) %>% 
    ggplot(aes(thr, coverage,color = Prevalence)) +
    geom_line() +
    geom_hline(yintercept = .95, 
               linetype = "longdash",
               color = "gray40") +
    facet_grid(rows = vars(Sensitivity),
               cols = vars(Specificity)) +
    scale_x_continuous(labels = scales::percent_format(1)) +
    scale_y_continuous(labels = scales::percent_format(1),
                       limits = c(.8, 1)) +
    theme_bw() +
    labs(x = "Threshold", y = "Empirical coverage",
         subtitle = "Coverage of 95% credible intervals")

The empirical coverage matches the nominal specified credibility level. One important point is that we are using default uniform priors for all parameters, which makes the intervals closely resemble their frequentist counterparts (95% confidence intervals), at least numerically.

Prediction models

Simulating prediction models data with known NB is not so simple, because we need a pair (sensitivity, specificity) for each probability threshold. Instead, we will show bayesDCA actually matches the NB estimates from a well-known package for decision-curve analysis, rmda. We will use publicly available datasets as examples and highlight some key differences between the two approaches.

rmda::dcaData

As in the package tutorial, we will use a previously-build logistic model to predict cancer status in a dataset with 500 individuals and 60 cases (12% prevalence).

Let's first see how rmda 's results look like:

library(rmda)

# source: http://mdbrown.github.io/rmda/
expit <- function(xx) exp(xx)/ (1+exp(xx))
thr_list <- seq(0, 1, by = .02)

dcaData <- dcaData %>% 
  mutate(
    logits = -10.5 + 0.22*Age - 0.01*Female + 0.91*Smokes + 2.03*Marker1 - 1.56*Marker2,
    probs = expit(logits)
  )

rmda <- decision_curve(Cancer ~ probs,
                       data = dcaData,
                       fitted.risk = TRUE, 
                       thresholds = thr_list,
                       bootstraps = 500) 

plot_decision_curve(rmda, standardize = FALSE)

With bayesDCA, we use the provided probabilities to generate a threshold-oriented data.frame: performance metric for each threshold considered.

bdca_threshold_data <- get_thr_data(dcaData, 
                                    outcome = "Cancer",
                                    prediction = "probs", 
                                    thresholds = thr_list)
head(bdca_threshold_data)

The same model with bayesDCA:

bdca <- dca_predictive_model(bdca_threshold_data)
plot(bdca)

To get a clearer sense, we can plot them together and see by which degree the results match -- and where they don't match. I will focus on the model curves.

df <- left_join(
  bdca$net_benefit %>% 
    dplyr::select(thr,
                  `Estimate - Bayes` := estimate,
                  `Lower - Bayes`:= `2.5%`, 
                  `Upper - Bayes` := `97.5%`),
  rmda$derived.data %>% 
    dplyr::filter(model == "Cancer ~ probs") %>% 
    dplyr::select(thr := thresholds,
                  `Estimate - rmda` := NB,
                  `Lower - rmda` := NB_lower,
                  `Upper - rmda` := NB_upper),
  by = "thr"
)

The results are pretty much the same, except when the threshold exceeds the maximum predicted probability, in which case the bootstrap-based estimate collapses to zero as there are no positives (true or false). In the Bayesian case, the sensitivity and specificity are never estimated to be exactly 0% or 100%, so it does not collapse to NB=0 and can still estimate the uncertainty. Nonetheless, we are mostly interested in the range of thresholds below 50%, where the results are pretty consistent.



giulianonetto/bayesdca documentation built on Aug. 31, 2023, 11:07 a.m.