Welcome to jumble! This program was written as a companion to academic work performed by researchers at Brown University to conduct re-randomization for cluster-randomized nursing home trials.

library(tidyverse)
library(ggplot2)
library(jumble)

Introduction

In this vignette we will explain how to use covariate balance as criteria for re-randomization and introduce the companion functions. The basic procedure outlined by @MorganKL2015 is as follows:

1) Select units for the comparison of treatments, and collect covariate data on all units.

2) Define an explicit criterion for covariate balance.

3) Randomize units to treatment groups.

4) Check covariate balance and return to Step 3 if the allocation is unacceptable according to the criterion specified in Step 2; continue until the balance is acceptable.

5) Conduct the experiment.

6) Perform inference (using a randomization test that follows exactly steps 2-4).

Example dataset

The example dataset used in this package is freely available and comes from a clinical HIV therapy trial conducted in the 1990s.[@HammerSM1996] See: ?jumble::ACTG175

Load dataset

df <- jumble::ACTG175

#Variables for balancing  
vars <- c('age', 'race', 'gender', 'symptom', 'wtkg', 'hemo', 
          'msm', 'drugs', 'karnof', 'oprior')

Measure of covariate balance

We will use Mahalanobis distance with sample size constant to estimate covariate balance, acceptance probability and identify an acceptable randomization.[@MorganKL2015]

$$ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) $$ Where n is the sample size, t treated, c controls, and X represents the covariate means.

Trial Covariate Balance

The ACT trial was a single randomized trial, here are the observed standardized mean differences in that single randomization (difference in group means / pooled standard deviation).

df_grp_means <- df[, c('arms', vars)] %>%
  group_by(arms) %>%
  summarize_all(., mean) %>%
  t(.) 
colnames(df_grp_means) <- c('zido', 'combo')

df_grp_sd <- df[, c('arms', vars)] %>%
  summarize_all(sd) %>%
  t(.)

t <- (df_grp_means[, 1] - df_grp_means[, 2]) /  df_grp_sd 

t %>% round(., 4)
df_t <- df[df$arms==1, vars]

df_c <- df[df$arms==0, vars] 

ssc <- (nrow(df_c) * nrow(df_t)) / (nrow(bind_rows(df_t, df_c))) # sample size correction

cat('The M-distance for this trial, estimated using patient characteristics is:  \n', mdis_grps(df_t, df_c) * ssc)

But is this a good or bad randomization? How to get some context:

Distribution of M

Assuming normality of covariate means, the M statistic is Chi-square distributed with k degrees of freedom corresponding to number of covariates. The alternative is to conduct a randomization test, compute M-distance for each randomization and use the empirical distribution of M to identify an acceptable cut-off.

M-distance - Chi-sq.

Using guidance from Morgan & Rubin paper, an acceptance probability of 0.001 is a reasonable starting point.[@MorganKL2015] This is saying that we would like to find a M-distance cut-off which represents the lower 0.1% of the distribution of M (i.e. the most balanced randomization draws).

k <- length(df_t) #number of covariates, degrees of freedom
Pa <- 0.001

chi_val <- qchisq(0.001, k)
cat('M-distance cut-off using 10 covariates is:\n ', round(chi_val,2))

M-distance - empirical distribution

Here we perform a Permutation with 10,000 simulations, where randomly assign individuals to treatment group. We then compute the M-statistic and store the value.

Assignment done using rnd_allot function, ?rnd_allot.

permute_m <- function(df, vars, seed) {
  set.seed(seed)

  id <- 1:nrow(df)

  assign <- rnd_allot(id)

  df_2 <- df %>%
    mutate(arms = assign$group) #fair coin assignment

  df_t <- df_2[df_2$arms=='a', vars] 

  df_c <- df_2[df_2$arms=='b', vars] 

  ssc <- (nrow(df_c) * nrow(df_t)) / (nrow(bind_rows(df_t, df_c))) # sample size correction

  df_cov <- cov(df[, vars]) 

  return(Rfast::mahala(colMeans(df_t), colMeans(df_c), sigma = df_cov) * ssc)
}
seed_list <- sample(-10000:10000, 10000) #random seeds

sims <- sapply(seed_list, function(x) permute_m(df, vars, x)) %>%
  unlist(.)
label <-  paste0('Pa, 0.001 = ', round(quantile(sims, 0.001), 4))  

tibble(x=sims) %>%
  ggplot(., aes(x=x)) +
    geom_histogram(color = 'darkblue', fill = 'lightblue', bins = 50) +
    geom_vline(aes(xintercept=quantile(sims, 0.001)),
            color="blue", linetype="dashed", size=1) +
    annotate("text", x = quantile(sims, 0.975), y = 500, label=label) 

We can see from the empirical distribution that M ~ 8 is about average for achieving balance. So the trial had a decent randomization but it could have been better, the empirical distribution suggests an acceptance probability of 0.001 is seen at M values of r round(quantile(sims, 0.001), 4) or lower.

At this point we have completed steps 1-3, we have our entire cohort, a list of covariates, and a defined criterion for accepting a randomization.
Now we must conduct the re-randomization procedure...

Re-randomization

rnd_rerand Will execute a single randomization given x and Mahalanobis cutoff.

runs <- 100

library(microbenchmark)
tst_run <- microbenchmark(
  rerandomize = rnd_rerand(x = df[, vars],
                            cutoff = quantile(sims, 0.001)), 
  times=runs, unit = 's')

avg_run_s<- round(sum(tst_run$time / 10^9) / runs, 3) 
cat('Average runtime in seconds to find 1 acceptable randomization:', avg_run_s)

A single restricted randomization looks favorable compared with the trial result:

re_rand <- rnd_rerand(x = df[, vars],
                      cutoff = quantile(sims, 0.001),
                      seed = as.integer(as.Date('2019-06-28')))

Example Table 1. Actual trial randomization

c1 <- df[df$arms==0, vars] %>%
  summarize_all(., mean) %>% t(.)

c2 <- df[df$arms==1, vars] %>%
  summarize_all(., mean) %>% t(.)

psd <- lapply(df[, vars], sd) %>% unlist(.)

tab_1_rnd <- tibble('Group a (trial)' = c1[, 1], 
                    'Group b (trial)' = c2[, 1], 
                    'psd' = psd) %>%
  mutate(`Trial Mean Std. Diff` = (`Group a (trial)` - `Group b (trial)`) / `psd`) %>%
  select(1,2,4) 

c1 <- re_rand[re_rand$group=='a', vars] %>%
  summarize_all(., mean) %>% t(.)

c2 <- re_rand[re_rand$group=='b', vars] %>%
  summarize_all(., mean) %>% t(.)

re_psd <- lapply(re_rand[, vars], sd) %>% unlist(.)

tab_1_re <- bind_cols('Group a (re-rand)' = c1[, 1], 
                      'Group b (re-rand)' = c2[, 1], 
                      'psd'=psd) %>%
  mutate(`Re-rand. Mean Std. Diff` = (`Group a (re-rand)` - `Group b (re-rand)`) / `psd`) %>%
  select(1,2,4) 

tab_1 <- bind_cols(tab_1_rnd, tab_1_re) %>%
  as.data.frame(.)

nrows_vars <- length(vars)

tab_1[nrows_vars+1, 3] <- mean(tab_1[1:nrows_vars, 3])
tab_1[nrows_vars+1, 6] <- mean(tab_1[1:nrows_vars,6], na.rm=T)
tab_1[nrows_vars+2, 3] <- sd(tab_1[1:nrows_vars, 3])
tab_1[nrows_vars+2, 6] <- sd(tab_1[1:nrows_vars, 6], na.rm=T)

var_tri <- sd(tab_1[1:nrows_vars, 3])
var_rernd <- sd(tab_1[1:nrows_vars, 6])
var_pct <- (var_rernd / var_tri * 100)

rownames(tab_1) <- c(vars, 'Average', 'Std. Dev.')
apply(tab_1, 2, formatC, digits=2) %>% unlist(.) %>%
knitr::kable(.)

The average standardized mean difference between groups is ~ 0 for the trial and the re-randomizations. This is expected, randomization is designed to achieve a value of zero in the mean difference of covariates between assigned groups, i.e. (T =1 | E[X]) - (T=0 | E[X]) = 0. But what is notable is that re-randomization achieves a significant reduction in the variance! moving from r formatC(var_tri, digits=3) to r formatC(var_rernd, digits=3), a reduction of r formatC(var_pct, digits=2)%.

This reduction in variance is the driver behind re-randomization's ability to significantly increase study power to detect a difference. As long as the primary outcome is evaluated using a randomization or a 'permutation' test which has the same criteria as re-randomization, the test statistic will have more narrow confidence intervals than a comparable test using standard parametric assumptions. @MorganKL2015

The trade is computation time, you have to perform ~1000 randomizations to find an acceptable one at Pa =0.0001. But if it takes 1-3 seconds to find an acceptable randomization. It could take you several minutes at least to find enough permutations of those acceptable draws. If each analysis is a complex regression which itself takes several seconds - minutes, you could be waiting a long time....

We can see the how much tighter covariate differences are if we perform a simulation of 1000 randomizations under either strategy.

rndms <- 1000L
#Matrix of values  
rnd_mn_diff <- matrix(nrow = rndms, ncol = length(vars))  
seeds <- sample(-100000:100000, rndms)

for (i in 1:rndms) {

  rnd_id <- df$pidnum

  set.seed(seeds[i])

  df_iter <- bind_cols(df[, vars], rnd_allot(rnd_id)[, 2]) %>%
    as.data.frame(.)

  mn_diff_i <- (colMeans(df_iter[df_iter$group=='a', vars]) - 
                colMeans(df_iter[df_iter$group=='b', vars])) / 
                (lapply(df_iter[, vars], sd) %>% unlist(.))

  rnd_mn_diff[i, ] <- mn_diff_i
}
rndms <- 1000L
#Matrix of values  
rernd_mn_diff <- matrix(nrow = rndms, ncol = length(vars)+1)  

sd_iter <- sample(-100000:100000, rndms) 

for (i in 1:rndms) {

  #set seeds

  df_iter <- rnd_rerand(x=df[, vars], 
                        cutoff=quantile(sims, 0.001), 
                        seed = sd_iter[i])

  mn_diff_i <- (colMeans(df_iter[df_iter$group=='a', vars]) - 
              colMeans(df_iter[df_iter$group=='b', vars])) / 
              (lapply(df_iter[, vars], sd) %>% unlist(.))

  rernd_mn_diff[i, 1:length(mn_diff_i)] <- mn_diff_i
  rernd_mn_diff[i, ncol(rernd_mn_diff)] <- attr(df_iter, 'seed') #store seed
}

#Confirm all seeds unique (i.e. didn't pull same seed twice)
paste0(rndms, ' unique randomizations obtained')
nrow(unique(rernd_mn_diff)) == rndms
rnd_mn_diff <- rnd_mn_diff %>%
  as.data.frame(.) %>%
  mutate(random = 'unrestricted')

rernd_mn_diff <- rernd_mn_diff %>%
  as.data.frame(.) %>%
  select(-11) %>%
  mutate(random = 'restricted')

md_diff <- bind_rows(rnd_mn_diff, rernd_mn_diff) %>%
  mutate(random = factor(random))

colnames(md_diff) <- c(vars, 'random')
md_diff_grp <- gather(md_diff, key = 'var', value = 'x', -random)

ggplot(md_diff_grp, 
       aes(x=factor(var), y=x, fill = random)) +
  geom_boxplot() +
  xlab('Standardized Mean Difference Between Groups') +
  ylab('Variable') +
  coord_flip()

You can see how much tighter the distributions are for re-randomization!

Refer to the permutation vignette for implications of this randomization strategy in the analysis stage.

Contacting authors

The primary author of the package was Kevin W. McConeghy. See here

References



kmcconeghy/jumble documentation built on Jan. 8, 2020, 8:52 a.m.