Welcome to jumble! This program was written as a companion to academic work performed by researchers at Brown University to conduct re-randomization for cluster-randomized nursing home trials.
library(tidyverse) library(ggplot2) library(jumble)
In this vignette we will explain how to use covariate balance as criteria for re-randomization and introduce the companion functions. The basic procedure outlined by @MorganKL2015 is as follows:
1) Select units for the comparison of treatments, and collect covariate data on all units.
2) Define an explicit criterion for covariate balance.
3) Randomize units to treatment groups.
4) Check covariate balance and return to Step 3 if the allocation is unacceptable according to the criterion specified in Step 2; continue until the balance is acceptable.
5) Conduct the experiment.
6) Perform inference (using a randomization test that follows exactly steps 2-4).
The example dataset used in this package is freely available and comes from a clinical HIV therapy trial conducted in the 1990s.[@HammerSM1996] See: ?jumble::ACTG175
df <- jumble::ACTG175 #Variables for balancing vars <- c('age', 'race', 'gender', 'symptom', 'wtkg', 'hemo', 'msm', 'drugs', 'karnof', 'oprior')
We will use Mahalanobis distance with sample size constant to estimate covariate balance, acceptance probability and identify an acceptable randomization.[@MorganKL2015]
$$ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) $$ Where n is the sample size, t treated, c controls, and X represents the covariate means.
The ACT trial was a single randomized trial, here are the observed standardized mean differences in that single randomization (difference in group means / pooled standard deviation).
df_grp_means <- df[, c('arms', vars)] %>% group_by(arms) %>% summarize_all(., mean) %>% t(.) colnames(df_grp_means) <- c('zido', 'combo') df_grp_sd <- df[, c('arms', vars)] %>% summarize_all(sd) %>% t(.) t <- (df_grp_means[, 1] - df_grp_means[, 2]) / df_grp_sd t %>% round(., 4)
df_t <- df[df$arms==1, vars] df_c <- df[df$arms==0, vars] ssc <- (nrow(df_c) * nrow(df_t)) / (nrow(bind_rows(df_t, df_c))) # sample size correction cat('The M-distance for this trial, estimated using patient characteristics is: \n', mdis_grps(df_t, df_c) * ssc)
But is this a good or bad randomization? How to get some context:
Assuming normality of covariate means, the M statistic is Chi-square distributed with k degrees of freedom corresponding to number of covariates. The alternative is to conduct a randomization test, compute M-distance for each randomization and use the empirical distribution of M to identify an acceptable cut-off.
Using guidance from Morgan & Rubin paper, an acceptance probability of 0.001 is a reasonable starting point.[@MorganKL2015] This is saying that we would like to find a M-distance cut-off which represents the lower 0.1% of the distribution of M (i.e. the most balanced randomization draws).
k <- length(df_t) #number of covariates, degrees of freedom Pa <- 0.001 chi_val <- qchisq(0.001, k)
cat('M-distance cut-off using 10 covariates is:\n ', round(chi_val,2))
Here we perform a Permutation with 10,000 simulations, where randomly assign individuals to treatment group. We then compute the M-statistic and store the value.
Assignment done using rnd_allot function, ?rnd_allot
.
permute_m <- function(df, vars, seed) { set.seed(seed) id <- 1:nrow(df) assign <- rnd_allot(id) df_2 <- df %>% mutate(arms = assign$group) #fair coin assignment df_t <- df_2[df_2$arms=='a', vars] df_c <- df_2[df_2$arms=='b', vars] ssc <- (nrow(df_c) * nrow(df_t)) / (nrow(bind_rows(df_t, df_c))) # sample size correction df_cov <- cov(df[, vars]) return(Rfast::mahala(colMeans(df_t), colMeans(df_c), sigma = df_cov) * ssc) }
seed_list <- sample(-10000:10000, 10000) #random seeds sims <- sapply(seed_list, function(x) permute_m(df, vars, x)) %>% unlist(.)
label <- paste0('Pa, 0.001 = ', round(quantile(sims, 0.001), 4)) tibble(x=sims) %>% ggplot(., aes(x=x)) + geom_histogram(color = 'darkblue', fill = 'lightblue', bins = 50) + geom_vline(aes(xintercept=quantile(sims, 0.001)), color="blue", linetype="dashed", size=1) + annotate("text", x = quantile(sims, 0.975), y = 500, label=label)
We can see from the empirical distribution that M ~ 8 is about average for achieving balance. So the trial had a decent randomization but it could have been better, the empirical distribution suggests an acceptance probability of 0.001 is seen at M values of r round(quantile(sims, 0.001), 4)
or lower.
At this point we have completed steps 1-3, we have our entire cohort, a list of covariates, and a defined criterion for accepting a randomization.
Now we must conduct the re-randomization procedure...
rnd_rerand
Will execute a single randomization given x and Mahalanobis cutoff.
runs <- 100 library(microbenchmark) tst_run <- microbenchmark( rerandomize = rnd_rerand(x = df[, vars], cutoff = quantile(sims, 0.001)), times=runs, unit = 's') avg_run_s<- round(sum(tst_run$time / 10^9) / runs, 3)
cat('Average runtime in seconds to find 1 acceptable randomization:', avg_run_s)
A single restricted randomization looks favorable compared with the trial result:
re_rand <- rnd_rerand(x = df[, vars], cutoff = quantile(sims, 0.001), seed = as.integer(as.Date('2019-06-28')))
Example Table 1. Actual trial randomization
c1 <- df[df$arms==0, vars] %>% summarize_all(., mean) %>% t(.) c2 <- df[df$arms==1, vars] %>% summarize_all(., mean) %>% t(.) psd <- lapply(df[, vars], sd) %>% unlist(.) tab_1_rnd <- tibble('Group a (trial)' = c1[, 1], 'Group b (trial)' = c2[, 1], 'psd' = psd) %>% mutate(`Trial Mean Std. Diff` = (`Group a (trial)` - `Group b (trial)`) / `psd`) %>% select(1,2,4) c1 <- re_rand[re_rand$group=='a', vars] %>% summarize_all(., mean) %>% t(.) c2 <- re_rand[re_rand$group=='b', vars] %>% summarize_all(., mean) %>% t(.) re_psd <- lapply(re_rand[, vars], sd) %>% unlist(.) tab_1_re <- bind_cols('Group a (re-rand)' = c1[, 1], 'Group b (re-rand)' = c2[, 1], 'psd'=psd) %>% mutate(`Re-rand. Mean Std. Diff` = (`Group a (re-rand)` - `Group b (re-rand)`) / `psd`) %>% select(1,2,4) tab_1 <- bind_cols(tab_1_rnd, tab_1_re) %>% as.data.frame(.) nrows_vars <- length(vars) tab_1[nrows_vars+1, 3] <- mean(tab_1[1:nrows_vars, 3]) tab_1[nrows_vars+1, 6] <- mean(tab_1[1:nrows_vars,6], na.rm=T) tab_1[nrows_vars+2, 3] <- sd(tab_1[1:nrows_vars, 3]) tab_1[nrows_vars+2, 6] <- sd(tab_1[1:nrows_vars, 6], na.rm=T) var_tri <- sd(tab_1[1:nrows_vars, 3]) var_rernd <- sd(tab_1[1:nrows_vars, 6]) var_pct <- (var_rernd / var_tri * 100) rownames(tab_1) <- c(vars, 'Average', 'Std. Dev.') apply(tab_1, 2, formatC, digits=2) %>% unlist(.) %>% knitr::kable(.)
The average standardized mean difference between groups is ~ 0 for the trial and the re-randomizations. This is expected, randomization is designed to achieve a value of zero in the mean difference of covariates between assigned groups, i.e. (T =1 | E[X]) - (T=0 | E[X]) = 0. But what is notable is that re-randomization achieves a significant reduction in the variance! moving from r formatC(var_tri, digits=3)
to r formatC(var_rernd, digits=3)
, a reduction of r formatC(var_pct, digits=2)
%.
This reduction in variance is the driver behind re-randomization's ability to significantly increase study power to detect a difference. As long as the primary outcome is evaluated using a randomization or a 'permutation' test which has the same criteria as re-randomization, the test statistic will have more narrow confidence intervals than a comparable test using standard parametric assumptions. @MorganKL2015
The trade is computation time, you have to perform ~1000 randomizations to find an acceptable one at Pa =0.0001. But if it takes 1-3 seconds to find an acceptable randomization. It could take you several minutes at least to find enough permutations of those acceptable draws. If each analysis is a complex regression which itself takes several seconds - minutes, you could be waiting a long time....
We can see the how much tighter covariate differences are if we perform a simulation of 1000 randomizations under either strategy.
rndms <- 1000L #Matrix of values rnd_mn_diff <- matrix(nrow = rndms, ncol = length(vars)) seeds <- sample(-100000:100000, rndms) for (i in 1:rndms) { rnd_id <- df$pidnum set.seed(seeds[i]) df_iter <- bind_cols(df[, vars], rnd_allot(rnd_id)[, 2]) %>% as.data.frame(.) mn_diff_i <- (colMeans(df_iter[df_iter$group=='a', vars]) - colMeans(df_iter[df_iter$group=='b', vars])) / (lapply(df_iter[, vars], sd) %>% unlist(.)) rnd_mn_diff[i, ] <- mn_diff_i }
rndms <- 1000L #Matrix of values rernd_mn_diff <- matrix(nrow = rndms, ncol = length(vars)+1) sd_iter <- sample(-100000:100000, rndms) for (i in 1:rndms) { #set seeds df_iter <- rnd_rerand(x=df[, vars], cutoff=quantile(sims, 0.001), seed = sd_iter[i]) mn_diff_i <- (colMeans(df_iter[df_iter$group=='a', vars]) - colMeans(df_iter[df_iter$group=='b', vars])) / (lapply(df_iter[, vars], sd) %>% unlist(.)) rernd_mn_diff[i, 1:length(mn_diff_i)] <- mn_diff_i rernd_mn_diff[i, ncol(rernd_mn_diff)] <- attr(df_iter, 'seed') #store seed } #Confirm all seeds unique (i.e. didn't pull same seed twice) paste0(rndms, ' unique randomizations obtained') nrow(unique(rernd_mn_diff)) == rndms
rnd_mn_diff <- rnd_mn_diff %>% as.data.frame(.) %>% mutate(random = 'unrestricted') rernd_mn_diff <- rernd_mn_diff %>% as.data.frame(.) %>% select(-11) %>% mutate(random = 'restricted') md_diff <- bind_rows(rnd_mn_diff, rernd_mn_diff) %>% mutate(random = factor(random)) colnames(md_diff) <- c(vars, 'random')
md_diff_grp <- gather(md_diff, key = 'var', value = 'x', -random) ggplot(md_diff_grp, aes(x=factor(var), y=x, fill = random)) + geom_boxplot() + xlab('Standardized Mean Difference Between Groups') + ylab('Variable') + coord_flip()
You can see how much tighter the distributions are for re-randomization!
Refer to the permutation vignette for implications of this randomization strategy in the analysis stage.
The primary author of the package was Kevin W. McConeghy. See here
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.