In DTWilson/BOSSS: Bayesian Optimisation for Simulation-based Sample Size

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE
)

library(BOSSS)
library(ggplot2)

Every application of BOSSS follows the same set of steps:

Formally specify the problem and create a corresponding BOSSS_problem object;
Create a BOSSS_solution object and initialise it;
Perform a number of iterations, each one updating and (hopefully) improving the BOSSS_solution object;
Select a final design and run some diagnostics to check it is valid.

Adaptive design

Problem

The first ingredient of a BOSSS problem is the simulation function. The arguments of this function must first specify the design variables which we want to vary in our problem, followed by the model parameters. In this example we have two design variables representing the number of clusters in each arm, and the number of participants in each arm. The model parameters are the mean difference in outcome between the control and experimental arms (beta_1), the within-cluster standard deviation (sig_w), and the between-cluster standard deviation (sig_b). Note that all of these inputs require some defaults to be provided.

The other element of interface the simulation function must conform to is in its return value, which should be a named vector of the quantities who's mean values we want to estimate using the Monte Carlo method. Here, we have one such quantity: a boolean indicator of a negative decision, s.

sim_cRCT <- function(k = 10, n = 200, beta_1 = 0.3, sig_w = 0.975, sig_b = 0.224){
  k <- floor(k); n <- floor(n)
  u_c <- rnorm(k, sd = sig_b)
  cluster_size <-  rep(n%/%k, k) + c(rep(1, n%%k), rep(0, k - n%%k))
  y_c <- rep(u_c, times = cluster_size) + rnorm(n, 0, sig_w)

  u_i <- rnorm(k, sd = sig_b)
  y_i <- beta_1 + rep(u_i, times = cluster_size) + rnorm(n, 0, sig_w)

  s <- t.test(y_i, y_c, alternative = "greater")$p.value > 0.05

  return(c(s = s))
}

# For example,
sim_cRCT()

In our problem we are not just interested in the expected value of s; we also want to minimise the number of clusters and the number of participants, and to put an upper limit on the number of participants per cluster. Since these are fixed quantities given any design, we evaluate them in a seperate deterministic function. This should conform to the same principles as the simulation function, with the same inputs, but allowing for different named outputs. Here, the outputs are just k, n, and m = n/k:

det_cRCT <- function(k = 10, n = 200, beta_1 = 0.3, sig_w = 0.975, sig_b = 0.224){
  return(c(k = k, n = n, m = n/k))
}

Next, we need to note the ranges of the design variables which we plan to search over. We use the design_space() function for this, specifying the lower and upper limits of the design variables in the order they appear as simulation function arguments:

design_space <- design_space(lower = c(10, 100), 
                             upper = c(100, 500),
                             sim = sim_cRCT)

design_space

Note that the function automatically retrieves the names of the design variables based on the order they take in the simulation function.

We also need to specify the hypotheses which we're planning to simulate under, using the hypotheses() function, again specifying in the order that parameters appear as simulation function argument. We only need one hypothesis here, the alternative, since we will be estimating the type II error rate.

hypotheses <- hypotheses(values = matrix(c(0.3, sqrt(0.95), sqrt(0.05)), ncol = 1),
                         hyp_names = c("alt"),
                         sim = sim_cRCT)

hypotheses

Constraints should be specified using the constraints function. Each constraint should be named, and should be defined with respect to a specific simulation output and a specific hypothesis. It should have a nominal maximum value, and a probability delta used to judge if it is satisfied. Here, our fist constraint is that the mean of the simulation output (i.e. the probability of a negative result) under the alt hypothesis should be less than or equal to 0.2 with a probability of at least 0.95. The second constraint is that the cluster size m under the alt hypothesis must be less than or equal to 10 - note that we provide a value for delta, but this isn't used since the constraint is deterministic. Finally, we note whether the constraint output is binary or otherwise.

constraints <- constraints(name = c("con_tII", "con_m"),
                   out = c("s", "m"),
                   hyp = c("alt", "alt"),
                   nom = c(0.2, 10),
                   delta = c(0.95, 1),
                   binary = c(TRUE, FALSE))

constraints

The final ingredient of the problem is the set of objectives we want to minimise. Similar to constraints, objectives are tied to a specific output and hypothesis. We also specify weights for each objective which help guide the internal optimisation process, and note whether or not the output for each objective is binary or continuous. For example, here we want to minimise both the number of patients n and the number of clusters k, with the latter carrying a weight of 100 times that of the former.

objectives <- objectives(name = c("n", "k"),
                 out = c("n", "k"),
                 hyp = c("alt", "alt"),
                 weight = c(1, 10),
                 binary = c(FALSE, FALSE))

objectives

We now put this simulation function and set of data frames together to create an object of class BOSSS_problem.

prob <- BOSSS_problem(sim_cRCT, design_space, hypotheses, objectives, constraints, det_func = det_cRCT)

Initialisation

Having set up the problem, we now need to create an initial solution to it. This involves setting up a space-filling set of designs spanning the design space (where size is the number of designs), computing the Monte Carlo estimates of all the expectations we are interested in at each of these designs (using N samples for each evaluation), fitting GP models to these estimates, and then using those models to estimate the Pareto set and front:

size <- 20
N <- 500

sol <- BOSSS_solution(size, N, prob)

print(sol) 
plot(sol)

The print() function will give a table of the Pareto set with associated objective function values. The plot() function will plot the Pareto front.

Iteration

We can now start improving this solution by calling the iterate() function. Each call uses the fitted Gaussian Process models to decide on the next design to be evaluated, computes the Monte Carlo estimates at that point, and then updates the estimated Pareto set and front.

N <- 500
for(i in 1:10) {
  sol <- iterate(sol, prob, N) 
}
print(sol)
plot(sol)

Diagnostics

To check the GP models are giving sensible predictions, we can choose a specific design and then plot the predictions for each model along the range of each design variable.

# Pick a specific design from the Pareto set
design <- sol$p_set[nrow(sol$p_set),]

diag_plots(design, prob, sol)

We can also get the predicted values and 95% credible intervals for each point we have evaluated, contrasting these with the empirical MC estimate and interval. This will return a data frame for each of the models, named according to the output-hypothesis combination which defines it. We highlight with a * any points where the two intervals do not overlap.

diag_predictions(prob, sol)

If things aren't looking too good, it may indicate that we need more points in our initial evaluations, or more simulations at each of those points. We can do that via extend_initial(). For example, suppose we want to add an extra 500 simulations to our initial points:

sol <- extend_initial(prob, sol, extra_N = 500)

# Look at the first threee designs; the empirical and predicted estimates will have changed
diag_predictions(prob, sol)[[1]][1:3,]

Or, we could add another 10 points:

sol <- extend_initial(prob, sol, extra_points = 10)

# Look at the added designs, which go to the top of the table
diag_predictions(prob, sol)[[1]][1:10,]

Once we are happy with our solution and have chosen a specific design from the Pareto set, we might want to double check that point by running a large number of simulations at it.

design <- sol$p_set[nrow(sol$p_set),]

r <- diag_check_point(design, prob, sol, N=10^4)

If we decide we want to run more simulations at the same point, we can pass the previous results in so they are built upon.

r <- diag_check_point(design, prob, sol, N=10^4, r)

DTWilson/BOSSS documentation built on April 14, 2025, 8:35 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DTWilson/BOSSS
Bayesian Optimisation for Simulation-based Sample Size

In DTWilson/BOSSS: Bayesian Optimisation for Simulation-based Sample Size

Adaptive design

Problem

Initialisation

Iteration

Diagnostics

R Package Documentation

Browse R Packages

We want your feedback!

DTWilson/BOSSS Bayesian Optimisation for Simulation-based Sample Size

In DTWilson/BOSSS: Bayesian Optimisation for Simulation-based Sample Size

Adaptive design

Problem

Initialisation

Iteration

Diagnostics

R Package Documentation

Browse R Packages

We want your feedback!

DTWilson/BOSSS
Bayesian Optimisation for Simulation-based Sample Size