Welcome to jumble! This program was written as a companion to academic work performed by researchers at Brown University to conduct randomization procedures for cluster-randomized nursing home trials.

library(tidyverse)
library(ggplot2)
library(jumble, warn.conflicts)

Introduction

In this vignette we will explain how to use Mahalanobis distance as a measure of covariate balance between two groups.

Example dataset

The example dataset used in this package is freely available and comes from a clinical HIV therapy trial conducted in the 1990s.[@HammerSM1996] See: ?jumble::ACTG175

Load dataset

df <- jumble::ACTG175

#Variables for balancing  
var_nms <- c('age', 'race', 'gender', 'symptom', 'wtkg', 'hemo', 
          'msm', 'drugs', 'karnof', 'oprior')

Measure of covariate balance

Imagine a randomized controlled trial of patients, assigned to treatment A or B. The primary goal is to have singular statistic to describe how closely 'balanced' individuals in group A are to group B. Morgan and Rubin describe the qualities of this statistic which are desirable.[@MorganKL2015] Namely it should be 'affinely invariant'. Which means that an affine transformation of the covariates would lead to the same re-randomization acceptance. This kind of metric will reflect the joint distribution of covariates, and provide equal balance for any linear combination of covariates. Mahalanobis distance meets these criteria.

Mahalanobis distance

Mahalanobis or M-distance measures the distance between point P and distribution D. It is generally a measure of many standard deviations P is from the mean D.full description

Mahalanobis distance can be used as a scalar measure of multivariate balance using the following formula.[@MorganKL2015]

$$ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) $$ Where n is the sample size, t treated, c controls, and X represents the covariate means.

Alternatively, for stratified randomization by a distance measure, you can compute each unit's distance from the overall group mean.

$$ M \equiv (x_i - \bar{X}) ' cov(X)^{-1} (\bar{x_i} - \bar{X}) $$

Trial Covariate Balance

The ACT trial was a single randomized trial, here are the observed mean differences +/- standard deviation in that single randomization.

df_grp_means <- df[,c('arms', var_nms)] %>%
  group_by(arms) %>%
  summarize_all(., mean) %>%
  t(.) 

colnames(df_grp_means) <- c('zido', 'combo')

df_grp_sd <- df[, var_nms] %>%
  summarize_all(sd) %>%
  t(.)

t <- (df_grp_means[var_nms, 1] - df_grp_means[var_nms, 2]) /  df_grp_sd 

t %>% round(., 4)

These are the standardized mean differences in selected covariates, with large treatment effects or sample sizes, small differences may not be significant, but if evaluating small effects, could be an important source of confounding.

What is the mahalanobis distance for these groups?

To compute the mahalanobis formula, use must do some matrix algebra.

Reduce dataset to needed covariates, format

1) Only include treatment indicator and k covarates
2) Ensure all covariates are numeric

df_mdist <- df[,c('arms', var_nms)] %>%
  map_dfr(., as.numeric) 

Sample size constant

$$ \frac{n_tn_c}n$$

This is easy to compute

ssc <- (nrow(df_mdist[df_mdist$arms==0, ]) * nrow(df_mdist[df_mdist$arms==1, ])) / (nrow(df_mdist))
cat('Sample size constant: ', round(ssc, 3))

Covariate means

$$ \bar{X_t} - \bar{X_c}$$

X_t <- colMeans(df_mdist[df_mdist$arms==0, var_nms])

X_c <- colMeans(df_mdist[df_mdist$arms==1, var_nms])

X_delta <- X_t - X_c 
cat('Difference in covariate means by treatment: \n ', round(X_delta, 3))

Inverse of covariance of covariate matrix

df_cov <- df_mdist %>%
  select(-arms) %>%
  cov(.) %>%
  solve(.) # inverse

round(df_cov, 1)

Bring formula together

$$ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) $$

M_man = ssc * (t(X_delta) %*% (df_cov %*% X_delta)) %>% as.vector(.)
cat('Manually computed M-distance: ', M_man)

jumble package function mdis_grps:

M_jumb <- mdis_grps(df_mdist[df_mdist$arms==0, 2:length(df_mdist)], 
          df_mdist[df_mdist$arms==1, 2:length(df_mdist)]) * ssc 
cat('Jumble M-distance: ', M_jumb)

There is also an R-function which can compute M-distance.

df_cov <- df_mdist %>%
  select(-arms) %>%
  cov(.)
M_base <- mahalanobis(X_t, X_c, cov = df_cov) * ssc

cat('Base R Mahalanobis: ', M_base)

There is also an Rfast function using C++:

M_fast <- Rfast::mahala(X_t, X_c, sigma = df_cov) * ssc
cat('Rfast Mahalanobis: ', M_fast)

Manual calculation same as Base, same as Rfast

all.equal(M_man, M_jumb, M_base, M_fast) # Nearly equal

All functions provide nearly equal answers.

library(microbenchmark)
microbenchmark(
  jumble = mdis_grps(df_mdist[df_mdist$arms==0, var_nms], 
                     df_mdist[df_mdist$arms==1, var_nms]) * ssc,
  rbase = mahalanobis(X_t, X_c, cov = df_cov) * ssc,
  rfast = Rfast::mahala(X_t, X_c, sigma = df_cov) * ssc,
  times = 1000
)

Rfast blows away competition!

Typically you would take the square root of this distance, but that is not necessary for our purposes.

When performing a stratified analysis, M-distance can be used to construct the strata and perform a permuted block randomization.

In this case, you are evaluating each person / unit's distance from the mean of the whole group, then doing a stratified randomization by groups of like distance. Note: M-distance is an absolute measure, so does not pair those with similar covariates, but rather the joint distribution of covariates is similarly different from the mean values. So a distant value could mean covariates in the upper or lower quartile.

The formula is now:

$$ M \equiv ({X_i} - \bar{X}) ' cov(X)^{-1} ({X_i} - \bar{X}) $$

X_delta <- apply(df_mdist, 2, function(x) x - mean(x))[, var_nms] 
df_cov <- df_mdist %>%
  select(-arms) %>%
  cov(.) %>%
  solve(.) # inverse
int_1 <- (X_delta %*% df_cov)
int_2 <- t(X_delta)
M_man <-  sqrt(diag(int_1 %*% int_2))

tst <- df_mdist[, var_nms] %>% as.matrix(.)
M_base <- sqrt(mahalanobis(tst, colMeans(tst), cov=cov(tst)))

M_fast <- Rfast::mahala(tst, colMeans(tst), sigma=cov(tst))

Test all methods give equal answers:

all.equal(M_man, M_base, M_fast) # Nearly equal

All functions provide nearly equal answers.

Benchmarking

Because a limitation of re-randomization is computation time, it is important to have a function which can compute M-distance quickly.

library(microbenchmark)
microbenchmark(
  mine = mdis_chrt(tst),
  rfast = Rfast::mahala(tst, colMeans(tst), sigma=cov(tst)),
  rbase = {mahalanobis(tst, colMeans(tst), cov=cov(tst))},
  times = 1000
)

My handwritten code is very in-efficient, Rfast outperforms the base code.
In the re-randomization procedures, the R-fast computation is used for speed. Unit testing is in place to ensure the Mahalanobis calculations are consistent across Rfast, Base R and the manual computations for version control and reproducibility.

Contacting authors

The primary author of the package was Kevin W. McConeghy. See here

References



kmcconeghy/jumble documentation built on Jan. 8, 2020, 8:52 a.m.