--- title: "Bayesian Reanalysis of the ICT-107 Trial" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Bayesian Reanalysis of the ICT-107 Trial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
author: | | Riko Kelter | Institute of Medical Statistics and Computational Biology | Faculty of Medicine | University of Cologne | Cologne, Germany date: "23 December 2025" --- ```{r, echo = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 10, fig.height = 6, out.width = "100%", fig.align = "center", dpi = 300, warning = FALSE ) ``` # Introduction and Overview In this vignette, we illustrate the basic functionality of the `bfbin2arm` package and its core functions. The package can be used to design a Bayesian (phase II) clinical trial with two arms and binary endpoints (success or failure) based on Bayes factors. Our main assumption here is that the observed data in both groups are from two random variables $Y_1,Y_2$ which both follow a binomial distribution with parameters $n_1$ and $n_2$ and $p_1$ respectively $p_2$, $$Y_1\sim \mathrm{Bin}(n_1,p_1), \hspace{1cm} Y_2\sim \mathrm{Bin}(n_2,p_2)$$ ## Hypothesis tests In its current form, the package implements four different hypothesis tests for the trial: $$H_0:p_1=p_2 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:p_1\neq p_2$$ Alternatively, a well-known parameterization of this test introduces a difference parameter $\eta=p_2-p_1$ and the grand mean $\zeta=\frac{1}{2}(p_1+p_2)$. Using this parameterization, we have $$p_1=\zeta-\frac{\eta}{2}, \hspace{1cm} p_2=\zeta+\frac{\eta}{2}$$ and the hypotheses can be rewritten as: $$H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta \neq 0$$ Next to this two-sided test, three directional tests are available in the package: - $$H_0:\eta \leq 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta > 0$$ - $$H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta > 0$$ - $$H_0:\eta = 0 \hspace{1cm} \text{ versus } \hspace{1cm} H_1:\eta < 0$$ For each of the four tests, a separate Bayes factor exists and can be used. For the two-sided test, we denote the Bayes factor as $BF_{01}$, and for the three directional tests above we denote the Bayes factors as $BF_{+-}$, $BF_{+0}$ and $BF_{-0}$. ## Design and analysis priors The $\mathrm{Beta}(a_0,b_0)$ distribution is a conjugate prior for the binomial likelihood, and when chosen as the prior, the posterior $P_{p \mid Y}$ is also Beta-distributed. A natural choice for the priors is the beta distribution. We assume independent Beta design priors $H_0$ as follows: $$p_1 =p_2 = p\mid H_0 \sim \mathrm{Beta}(a_0^d,b_0^d)$$ Thus, under $H_0:\eta = 0$, both probabilities are identical, $p_1=p_2$, and take some value $p\in [0,1]$, which has a beta design prior. Likewise, we pick independent Beta design priors under $H_1:\eta \neq 0$: $$p_1 \mid H_1 \sim \mathrm{Beta}(a_1^d,b_1^d), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_2^d,b_2^d)$$ For the analysis priors $P_{p_1}^a$, $P_{p_2}^a$ under $H_1$, we also choose independent Beta priors, with possibly different values $a_i^a$ and $b_i^a$ for $i=1,2$, where the superscript signals that the hyperparameters belong to our analysis instead of design prior: $$p_1 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a), \hspace{1cm} p_2 \mid H_1 \sim \mathrm{Beta}(a_1^a,b_1^a)$$ Lastly, for the analysis prior $P_{p}^a$ under $H_0:\eta=0$, we choose a Dirac prior with all probability on $\eta=p_2-p_1=0$ conditionally on a uniform prior on $\zeta$, that is $$p_1=p_2=p|H_0 \sim 1_{\{\eta=0\}}| \zeta \sim U(0,1)$$ for the analysis with the Bayes factor. # Using the package First, we load the package after installation: ```{r} library(bfbin2arm) ``` Next, we illustrate the key functions of the package by re-analyzing a phase II trial in the context of oncology. While no Bayesian approach was used in the original statistical analysis of the trial, the step-by-step walktrough below showcases how a structured approach to designing and calibrating a Bayesian phase II trial with the `bfbin2arm` package looks like. Importantly, the trial must have two trial arms and binary endpoints and we assume that one of the four tests detailed above is carried out using Bayes factors as the test criterion. ## ICT-107 Phase II Trial Overview The ICT-107 trial (Wen et al., 2019) was a randomized phase II study in newly diagnosed glioblastoma patients (n=124, 2:1 randomization). The primary binary endpoint is progression status at 6 months (PFS6), and the secondary binary endpoint immunologic status. Here, we focus on the secondary endpoint for illustration purposes. **Reported results** (ITT population): - ICT-107 (n=82): 49/82 responders= **59.7% response rate** - Control (n=42): 12/42 responders = **35.7% response rate** ## 1. Bayes Factor Analysis We start by calculating the Bayes factor(s) for the ICT-107 trial data: ```{r} ## ------------------------------------------------------------- ## 2. ICT-107 trial (immunologic response) ## Placebo (control): 12 responders, 31 non-responders ## ICT-107 (treatment): 49 responders, 32 non-responders ## ------------------------------------------------------------- y1_ict <- 12 # control successes n1_ict <- 12 + 31 y2_ict <- 49 # treatment successes n2_ict <- 49 + 32 cat("\n=== ICT-107 Trial (n1 =", n1_ict, ", n2 =", n2_ict, ") ===\n") # BF01 BF01_ict = twoarmbinbf01(y1_ict, y2_ict, n1_ict, n2_ict, a_0_a = 1, b_0_a = 1, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1) # BF+1 BFp1_ict = BFplus1(y1_ict, y2_ict, n1_ict, n2_ict, a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1) # BF-1 BFm1_ict = BFminus1(y1_ict, y2_ict, n1_ict, n2_ict, a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1) # BF+0 cat("=== ICT-107 Trial === Bayes factor BF+0 results in ", BFplus0(BFp1_ict, BF01_ict)) # BF+- cat("=== ICT-107 Trial === Bayes factor BF+- results in ", BFplusMinus(BFp1_ict, BFm1_ict)) ``` The most relevant Bayes factor here is $BF_{+-}$, because it is directional and leaves open the possibility of the placebo group having a larger response rate than the treatment group. Note that the hyperparameters of the beta analysis priors are specified in `twoarmbinbf01` via `a_0_a = 1, b_0_a = 1` et cetera. ## 2. Operating characteristics for actual sample sizes Now, a key question is which operating characteristics can be expected based on the actual sample sizes used in the trial. The `powertwoarmbinbf01` function can provide the answer: ```{r} ict_results <- powertwoarmbinbf01( n1 = n1_ict, n2 = n2_ict, k = 1/3, k_f = 3, test = "BF+-", # H+: p2 > p1 vs H-: p2 <= p1 a_0_d = 1, b_0_d = 1, a_0_a = 1, b_0_a = 1, a_1_d = 1, b_1_d = 1, a_2_d = 1, b_2_d = 1, a_1_a = 1, b_1_a = 1, a_2_a = 1, b_2_a = 1, output = "numeric", compute_freq_t1e = TRUE, ) print(ict_results) ``` We see that based on the actual sample sizes and a moderate evidence threshold $k=1/3$, the Bayesian power is sufficiently large with $87.8\%$. Still, the frequentist type-I-error rate is way too high with $28.7\%$, so we increase the evidence threshold to $k=1/10$ (strong evidence) and use the `ntwoarmbinbf01` function to calibrate the design based on our requirements next. ## 3. Power & Sample Size for ICT-107 Design The core working function to design a Bayesian trial with the package is the `ntwoarmbinbf01` function. It provides a method to calibrate a Bayesian design in terms of - the required Bayesian (or frequentist) power - the required Bayesian (or frequentist) type-I-error rate - the required Bayesian probability of compelling evidence for the null hypothesis $H_0$ (or $H_-$, in case $BF_{+-}$ is used) The function makes use of parallelization and it is recommended to run it on a computer with multiple cores to make computations fast. First, we perform a sample size search for a ICT-107-type trial (balanced arms) under flat design priors and substantial evidence thresholds, using the directional Bayes factor $BF_{+-}$: ```{r, fig.width = 12, fig.height = 7} ntwoarmbinbf01( k = 1/10, k_f = 10, power = 0.8, alpha = 0.05, pce_H0 = 0.8, test = "BF+-", nrange = c(10, 75), n_step = 1, progress = FALSE, compute_freq_t1e = TRUE, p1_power = 0.3, p2_power = 0.6, output = "plot" # Returns recommended n per group ) ``` The function arguments are - the evidence threshold `k = 1/10` and - threshold for compelling evidence `k_f = 10` under $H_-$ (or $H_0$ for other tests), - the required power `power`, type-I-error rate `alpha` and probability of compelling evidence `pce_H0`, - the test used in the trial (either `BF+-`, `BF01`, `BF+0` or `BF-0`), - the range of values `nrange` for which the design calibration operates (increasing the upper sample size requires more time for the calibration to finish), - the stepsize `n_step` (we recommend using `n_step = 1` always, except for quick checks where using `n_step = 5` or `n_step = 10` can decrease computing times significantly), - the parameter `progress` shows a progress bar at the console output (we strongly recommend to set it to TRUE, it is only set to FALSE to avoid cluttering the output of this R vignette), - `compute_freq_t1e` sets whether the frequentist type-I-error rate should be computed, too - If frequentist power is desired, the parameters `p1_power` and `p2_power` must be specified, denoting the assumed success probability in the control and treatment arm for frequentist power calculations - and `output` can be set to `plot` or `numeric`. The resulting output plots the design and analysis priors at the top row, the resulting power and type-I-error rate curves as functions of the sample size with markers for which sample sizes the design achieves the required calibration thresholds (middle row), and the probability of compelling evidence for the null hypothesis (in this case, the hypothesis $H_-$) in the bottom row. Note that the oscillations happen due to the discrete nature of the binomial distribution, and the package algorithm ensures that for the next 10 sample sizes, the power does not drop below the required threshold. Likewise, the package ensures that the type-I-error rate does not increase below the required alpha level, and that the probability of compelling evidence drops below its required threshold. It is straightforward to check this visually by means of the provided output plots, too. If no plots are required, use the option `numeric` instead of `plot` for the output argument. The resulting plot shows that while the type I error is calibrated for $n=10$ patients per trial arm, Bayesian power is not reaching our desired level of 80\% even for $n=75$ patients in total (in both arms). We could increase the range, or alternatively, use more informative design priors under which the hypotheses under comparison are separated in a better way. Right now, we essentially assume that everything is equally likely under our design priors, although we should have a clear expectation about the probabilities in the treatment and control arm. Thus, we modify our design priors next. Note that the plot also shows that frequentist power is calibrated for $n=50$ patients per arm when assuming $p_1=0.3$ (control arm probability) and $p_2=0.6$ (treatment arm probability). ## 4. Informative design priors Now, the example above used flat design priors, which might be unrealistic in a variety of settings. Next, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds. Notice the additionally specified parameters `a_1_d = 1, b_1_d = 2` and `a_2_d = 2, b_2_d = 1` which are the design prior hyperparameters of the Beta design priors for $p_1$ and $p_2$ under $H_+$. ```{r, out.width='100%'} ntwoarmbinbf01( k = 1/30, k_f = 30, power = 0.8, alpha = 0.05, pce_H0 = 0.8, test = "BF+-", nrange = c(10, 100), n_step = 1, progress = FALSE, a_1_d = 1, b_1_d = 2, a_2_d = 2, b_2_d = 1, compute_freq_t1e = TRUE, p1_power = 0.3, p2_power = 0.6, output = "plot" # Returns recommended n per group ) ``` We see that now the Bayesian power is calibrated for $n=72$ patients per trial arm, while the frequentist power is calibrated for $n=77$ patients per trial arm. Importantly, the frequentist type-I-error is now only $0.041<0.05$, as stated by the console output of the function. Thus, the design is fully calibrated except for the probability of compelling evidence for $H_-$ shown in the bottom plot. Therefore, next, we perform a sample size search for new ICT-107-type trial (balanced arms) under informative design priors with very strong evidence thresholds, and design prior under H- modified to achieve probability of compelling evidence PCE(H0) for even smaller sample sizes. Note that now, additionally, the design prior hyperparameters of the Beta design priors for $p_1$ and $p_2$ under $H_-$ are specified in `a_1_d_Hminus = 2, b_1_d_Hminus = 1` and `a_2_d_Hminus = 1, b_2_d_Hminus = 2`: ```{r, out.width='80%'} ntwoarmbinbf01( k = 1/30, k_f = 30, power = 0.8, alpha = 0.05, pce_H0 = 0.8, test = "BF+-", nrange = c(10, 100), n_step = 1, progress = TRUE, a_1_d = 1, b_1_d = 2, a_2_d = 2, b_2_d = 1, a_1_d_Hminus = 2, b_1_d_Hminus = 1, a_2_d_Hminus = 1, b_2_d_Hminus = 2, compute_freq_t1e = TRUE, p1_power = 0.3, p2_power = 0.6, output = "plot" # Returns recommended n per group ) ``` The result is a fully calibrated Bayesian design which meets Bayesian and frequentist power demands, Bayesian and frequentist type-I-error rate requirements and our requirement on the probability of compelling evidence for $H_0$ (that is, $H_-$ in this case). The `bfbin2arm` package reveals several aspects. If a balanced design with equal randomization probabilities is desired, then: - **n=77 patients in total** (39 patients per trial arm) are needed for 80% frequentist power at ICT-107 effect size when evidence threshold $k=1/30$ is used. Here, the assumption is that the true proportions are $p_1=0.3$ and $p_2=0.6$, which can easily be modified if a more optimistic or pessimistic assumption is warranted - **n=72 patients in total** (36 patients per trial arm) are needed for 80% Bayesian power at ICT-107 effect size when evidence threshold $k=1/30$ is used, and slightly informative Beta design priors are assumed under $H_+$. - **Type-I error control** both from a frequentist perspective (≤5% across designs when $k=1/30$ is used) and from a Bayesian perspective, where for the latter only **$n=10$ patients in total** (5 patients per trial arm) are required. - **High P(CE|H-)** guarantees that under $H_-$ there is 80\% probability to find a Bayes factor of at least $k_f=30$ in favour of $H_-$. **n=72 patients in total** (36 patients per trial arm) are required to assert this probability of compelling evidence for $H_-$. ## 5. Unequal randomization probabilities In the original ICT-107 trial, $2/3$ of the patients was randomized into the treatment group, while $1/3$ of the patients was randomized into the control group. We can use the parameters `alloc1` and `alloc2` to specify randomization probabilities for the control and treatment arms and carry out the Bayesian sample size calculations based on these randomization probabilities. As an example, we rerun the last calibration, but use the randomization probabilities of the ICT-107 trial: ```{r, out.width='80%'} ntwoarmbinbf01( k = 1/30, k_f = 30, power = 0.8, alpha = 0.05, pce_H0 = 0.8, test = "BF+-", nrange = c(10, 100), n_step = 1, progress = FALSE, a_1_d = 1, b_1_d = 2, a_2_d = 2, b_2_d = 1, a_1_d_Hminus = 2, b_1_d_Hminus = 1, a_2_d_Hminus = 1, b_2_d_Hminus = 2, compute_freq_t1e = TRUE, p1_power = 0.3, p2_power = 0.6, output = "plot", # Returns recommended n per group alloc1 = 1/3, alloc2 = 2/3 ) ``` Remember that the sample size shown at the x-axis in the power and type-I-error rate plot as well as in the probability of compelling evidence plot is the total sample size in both arms. We see that now we need $n=83$ patients in total to reach Bayesian power of 80\%, while $n=86$ patients in total are required for frequentist power calibration of 80\%. The probability of compelling evidence reaches 80\% at $n=83$ patients in total. Note, however, that the frequentist type-I-error rate is exactly at the boundary now, which might be too liberal for some. As the frequentist type-I-error rate assumes fixed success probabilities in both trial arms and is independent of the design priors, we must change the evidence threshold $k$ slightly to decrease the frequentist type-I-error rate accordingly. Just try it out yourself and decrease $k$ from $k=1/30$ to $k=1/40$ and rerun the last code block. ## 6. Design Recommendations based on the calibration If the original 2:1 randomization of the ICT-107 trial is used and two thirds of the patients are randomized into the treatment group, then: - **n=96 patients in total** (32 patients in the control arm and 64 in the treatment arm) are needed for 80% frequentist power at ICT-107 effect size when evidence threshold $k=1/40$ is used. Here, the assumption is that the true proportions are $p_1=0.3$ and $p_2=0.6$, which can easily be modified if a more optimistic or pessimistic assumption is warranted - **n=92 patients in total** (31 patients in the control arm and 61 in the treatment arm) are needed for 80% Bayesian power at ICT-107 effect size when evidence threshold $k=1/40$ is used, and slightly informative Beta design priors are assumed under $H_+$. - **Type-I error control** both from a frequentist perspective (≤5% across designs when $k=1/40$ is used) and from a Bayesian perspective, where for the latter only **$n=10$ patients in total** (both arms) are required. - **High P(CE|H-)** guarantees that under $H_-$ there is 80\% probability to find a Bayes factor of at least $k_f=30$ in favour of $H_-$. **n=83 patients in total** (28 in the control arm and 55 in the treatment arm) are required to assert this probability of compelling evidence for $H_-$. To fulfill all four requirements, it thus suffices if 32 patients in the control arm and 64 in the treatment arm are enrolled in the trial, and the Bayes factor thresholds $k=1/40$ and $k_f=30$ are used for decision making about the hypotheses $H_+$ and $H_-$ under consideration. ## 7. Predictive Densities If desired, we can compare predictive densities under different hypotheses also directly via: ```{r} pred_H0 <- predictiveDensityH0(y1_ict, y2_ict, n1_ict, n2_ict) pred_H1 <- predictiveDensityH1(y1_ict, y2_ict, n1_ict, n2_ict) pred_Hplus <- predictiveDensityHplus_trunc(y1_ict, y2_ict, n1_ict, n2_ict) data.frame( Hypothesis = c("H0: p1=p2", "H1: p1 != p2", "H+: p2>p1"), "Pred. Density" = round(c(pred_H0, pred_H1, pred_Hplus), 4) ) ``` ## References Wen PY, et al. (2019). A Randomized Double-Blind Placebo-Controlled Phase II Trial of Dendritic Cell Vaccine ICT-107. *Clinical Cancer Research*. [PMID: 31320597]