knitr::opts_chunk$set(echo = TRUE)
Reviewer 1
Overview: The authors present an R package to design early phase clinical trials with binary outcomes leveraging a Bayesian framework. I find the package potentially applicable to many settings. However, the paper includes some statements that are not sufficiently clear, or debatable. Some details should be clarified. I also think that a running toy-example that can be implemented without the use of a server would be a good tutorial to familiarize with the package before large scale implementation. Some comments to follow:
Thank you for this thorough review. We agree that there were parts of the paper that were unclear or lacked sufficient detail, and have strived to address all of your comments in an effort to improve the clarify and accuracy throughout. Please see our detailed responses that follow.
Major comments:
- The package and paper only consider binary outcomes. This should be clearly stated in the abstract/introduction.
Thank you for pointing this out. We have added this important detail to both the Abstract and the last paragraph of the Introduction.
- I am not sure what is the point of second paragraph of the introduction. In particular, the authors claim that on-the fly decisions can lead to “large sample size” and “poor statistical properties”. Why? I don’t think that this is necessarily the case. I suggest improving the clarity of this paragraph.
The goal of this paragraph was to introduce readers to the often large sample sizes that have been used in recent expansion cohorts, which is part of the problem addressed by this package. We see how the wording did not emphasize this point properly, though, and have rewritten the entire paragraph in response to this comment. Thank you.
- In the third paragraph of the introduction the author claims that Bayesian models are flexible “while also maintaining the control of type-I error”. I disagree with this statement. Bayesian models do not naturally account for type-I error and power (frequentist power?) constraints , which is the whole point of the calibration exercise that motivate this package.
We agree that the way this sentence was worded was not accurate. Our intention was to introduce the reader to the possibility of implementing Bayesian designs with consideration of frequentist type I error and power. We have reworded this section to improve clarity and accuracy.
- In the introduction and later in the paper, the authors introduce “one-sample” and “two-sample” phase-I design. The meaning should be explained somewhere.
We removed this language from the introduction, and added a definition of one-sample and two-sample to the beginning of the "Predictive probability monitoring" section. In this context, one-sample refers to a trial in which all patients are enrolled onto a single arm and given the experimental treatment and two-sample refers to a trial in which patients are enrolled onto two different treatment arms, or a treatment and control arm, for comparative purposes.
- Page 3: “The proportion of positive trials under the null represents the type I error and the proportion of positive trials under the alternative represents the power.” I don’t think this is correct. Specifically, I am not sure what “represent” means in this context. The proportion of “positive trial” (maybe better defined as the proportion of simulated trials for which we reject the null hypothesis of inferiority, or something along this line) is a Monte Carlo estimate of the type-I error. Correct? Same goes for power. The number of trials replications should also be discussed somewhere in the paper. I think this is very important aspect when selecting an optimal strategy, i.e., we don’t want the monte Carlo error to dominate the difference between alternative strategies. This aspect should also be discussed (see also following point).
Thank you for pointing out the lack of specificity in our wording. We meant to convey that the proportion of trials simulated under the null hypothesis for which we reject the null hypothesis is an estimate of the type I error. And similarly the proportion of trials simulated under the alternative hypothesis for which we reject the null hypothesis is an estimate of the power. We have revised the sentence accordingly.
And you are correct that the number of trial replications is an important aspect of this. Our default arguments are 1000 datasets with 5000 posterior draws in each, which we believe are reasonable values. We have added a note of caution related to these values in the Package Overview section.
- I find the section “Optimization” that describe the function “optimal_design()” to be misleading. First, I think that the authors should explicitly state that this function does not perform an optimization, but rather select a trial among the ones that have been simulated. There is no guarantee that the selected design is in fact optimal with respect to any criteria (e.g., if we select a coarse grid of thresholds). Second, I think that with a minimal effort this function can be generalized to take as input any user specified utility that take as impute type-I error and power, and/or average sample sizes. I think this could be helpful for a broader use of the package.
This is a very good point, and we apologize for the poor word choice here. We have renamed this section "Design selection" and have removed the word optimization to avoid any confusion. Since the paper was originally drafted, we have added the option of weighting the various criteria, to allow more user-specified flexibility in terms of the value of type I error versus power, and average sample size under the null versus average sample size under the alternative. We have added a description of this new functionality to the Design selection section.
- In the case study: what is the rational to select the considered posterior thresholds and not, say, an equi-spaced grid. I think this is an important point to discuss, and can have a large impact on the result.
Thank you, we have rewritten the case study a bit to be a two-sample example in order to differentiate from the added basic example, and have also added some information justifying the selected thresholds. The previous thresholds were a holdover from another application, and we apologize for not cleaning this up. The thresholds now represent an equi-spaced grid as appropriate in this setting.
- In the code snipped at page 4 the parameters
S
andnsim
of thecalibrate_thresholds()
function have not been discussed in the text. I think this are important parameters.
We have added more discussion of the importance of selecting sufficiently large values of S
and nsim
throughout the paper, thank you.
- Figure 1: I think the x-axis should be rescaled in a (say 0-0.25 for the left panel and 35-55 for the right panel). The plot would be easier to read.
Thank you for this suggestion, we have rescaled the axes so that the accuracy plot axes are limited to the ranges of considered values, and the efficiency plot is a close-up square.
- Page 7: In the first line of the tibble
one_sample_decision_tbl
there is NA,NA. If I am not mistaken this is interpreted as “never stop”. I will explicitly indicate this.
Thank you, we have changed to a basic example that can easily be run so now there is no NA, but we have still made a note that "If any NA value was returned in this table, it would indicate that at that interim look you would never stop, regardless of the number of responses."
- There are some typos in the bibliography. For example, “r” in place of “R”, “phase i” in place of “phase I”. Please double check.
These corrections have been made, thank you.
- I think that the paper would benefit of an additional toy example that can be run on a personal computer rather than on a server. This would allow to familiarize with the functions before running large scale simulations.
We agree, and have added a basic example to the Package demonstration section that can easily be run interactively, which is followed by an altered Case study section.
Minor comments:
- The third sentence in the abstract “Recent successes […]” is very long and convoluted. I suggest rephrasing.
We have taken this suggestion and revised the abstract to improve clarity.
- Page 2: Define the acronym RECIST.
In parentheses we have added the definition of the acronym RECIST (Response Evaluation Criteria in Solid Tumors)
- Page 2: Bayes rule -> Bayes’s theorem.
We have made this correction, thank you.
- Page2: when naming distribution (e.g., Beta-Binomial) I suggest to use $\mbox{Beta-Binomial}()$ or similar, to avoid wrong spacing/font.
Thank you for this suggestion, we have used it for all distributions to achieve the proper font/spacing.
- In the section decision rules, what is a decision point “r”? Please define this.
We have reworded this for clarity, thank you. The trial would stop at a given look if the number of observed responses is less than or equal to $r$, otherwise the trial would continue enrolling if the number of observed responses is greater than $r$. At the end of the trial when the maximum planned sample size is reached, the treatment would be considered promising if the number of observed responses is greater than $r$.
-Figure 1: in the legend (#fig:unnamed-chunk-7). I suppose this is a compilation error.
This is indeed a compilation error. I am not sure why this is being generated or how to suppress it, so we will work with the editorial team to ensure it isn't displayed when the article is finalized for publication.
Reviewer 5
The article describes the ppseq package for R, which provides several functions to facilitate clinical trial design using sequential predictive probability monitoring. A user looking to design an early phase trial can use Bayesian methods to ask what the probability of early stopping is and what expected sample sizes would be under different assumptions about outcomes in treatment and control groups and with different stopping rules, while maintaining desired type I error rates. The package is small, providing only three user-facing functions reported in the manuscript, along with plot and print methods. However, the functions are user friendly with obvious options, and the article provides a clear use case and justification for the methods. However, the article itself is quite difficult to follow, which is not helped by the example not being feasible to run for a user with a standard desktop or laptop computer (the example is reportedly run here on a 192 core computing cluster!). I have provided some queries, recommendations below.
We are sorry that the article in it's original form was difficult to follow. We hope the changes have made it more understandable, and that all of the needed details are now included. Thank you for taking the time to conduct such a thorough review of our paper, we do believe that it has been improved as a result. Please see our detailed responses below.
- I found the Package Overview section difficult to follow, since it provides only a narrative description of the inputs of the functions and options. It would be useful to link the description to the statistical discussion in the previous section, explaining which option relates to which parameter, and what each output reports in terms of the statistical notation. It may help to integrate the case study into the package description so that the reader can see what the text means in terms of the example. However, I would also strongly recommend providing an example(s) that a user can implement themselves in a reasonable amount of time, as this often helps when using a package for the first time. Indeed, is the package useful if it requires such significant computational resources (see also point 3)?
Thank you for this comment, we agree that much detail was overlooked in the Package Overview section, and we have expanded this section to include much more detail of the underlying algorithm, and to link the notation from the preceding section to the function arguments. We believe this section is much easier to follow now. We have also added a new section called Package demonstration that includes a basic example that can easily be run interactively, which is followed by an altered Case study section.
We do think this package is useful despite the potentially long computational time. This is meant to be used at the design stage of a clinical trial, when there is ample time to allow for computation. Once the design has been selected and the decision table generated with the function calc_decision_rules()
, no additional computations are required throughout the conduct of the trial. As a result, the trial can be conducted seamlessly at each interim analysis, and through the end of the trial.
- I was a little confused by the notation, some more detail would make it clearer I believe. Is the setting a randomized trial with treatment and control groups, or a dose expansion cohort? We are told that $p$ is the probability of response – is this in the dose expansion cohort? And we have $p_0$ and $p_1$ as the probabilities of response in the control and treatment groups, respectively. But are these random variables we intend to estimate, or are they pre-defined thresholds of interest (e.g. “high” and “low” efficacy)?
Thank you for pointing out the lack of clarity here. We have revised the wording to clarify that $p$ represents the true response probability, $p_0$ is a fixed threshold of unacceptable response rate, $p_1$ is a fixed threshold of acceptable response rate. We have also clarified that our example focuses on the setting of a single arm dose-expansion cohort.
- It was not clear to me what each function was doing computationally. In particular, the method implemented by the
calibrate_thresholds()
function was not explicated. I wanted to understand why the function took so long to run, and why it might be drawing posterior samples. My assumption was that the algorithm was something roughly like the following. For nsim iterations: i. Simulate a realization of $p^r$ from the prior $\mbox{Beta}(a_0, b_0)$ ii. Simulate a realization of $x^r | n, p^r ~ \mbox{Binomial}(n, p^r)$ iii. Calculate the probability of success in that trial $$PP^r=\Pr(p^r>p_0|x^r, N, n)=\int_{p_0}^1 f_p(q|x^r, N, n)\,dq=\sum_{x^}^{N-n}\int_{p_0}^1\Pr(X^=x^|q,N-n)f_p(q|x^r,N,X^=x^*)\,dq$$Then the PPP is the the proportion of $PP^r>\theta^*$. If this is the case, then the integral term has a closed form and posterior simulations do not seem necessary, in which case the function should run much quicker than it currently does. Running time does seem like a key issue for the functions in this package. I may of course be completely wrong about what the function is doing, but it would help to have some clarification so we know the function is doing what the user expects.
We apologize that the description in the Package overview was not clear. Please see the expanded section, which now includes all of the details of the implemented algorithm. Posterior sampling is required to account for the uncertainty in $p$. In the algorithm you describe above, using a single estimate $p^r$ will result in a too-narrow predictive distribution. In our scheme and using your notation, we rather draw nsim
iterations $x^r | n, p^r \sim \mbox{Binomial}(n, p^r)$, then we take S
posterior samples $p^s|x^r \sim \mbox{Beta}(a_0 + x^r, b_0 + n - x^r)$.
- The section “Optimisation” mentions minimizing Euclidean distance on different plots, however it isn’t clear what this means without notation or until the example is provided.
Thank you for pointing out the lack of clarity in this section. We have added much more detail about the computations being done, with notation, to this section, and hope it is now more clear. Additionally, in response to another reviewer's comment we have renamed this section "Design selection" and removed reference to optimization, since no true mathematical optimization is being conducted.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.