Welcome to jumble! This program was written as a companion to academic work performed by researchers at Brown University to conduct randomization procedures for cluster-randomized nursing home trials.
library(tidyverse) library(ggplot2) library(jumble, warn.conflicts)
In this vignette we will explain how to use Mahalanobis distance as a measure of covariate balance between two groups.
The example dataset used in this package is freely available and comes from a clinical HIV therapy trial conducted in the 1990s.[@HammerSM1996] See: ?jumble::ACTG175
df <- jumble::ACTG175 #Variables for balancing var_nms <- c('age', 'race', 'gender', 'symptom', 'wtkg', 'hemo', 'msm', 'drugs', 'karnof', 'oprior')
Imagine a randomized controlled trial of patients, assigned to treatment A or B. The primary goal is to have singular statistic to describe how closely 'balanced' individuals in group A are to group B. Morgan and Rubin describe the qualities of this statistic which are desirable.[@MorganKL2015] Namely it should be 'affinely invariant'. Which means that an affine transformation of the covariates would lead to the same re-randomization acceptance. This kind of metric will reflect the joint distribution of covariates, and provide equal balance for any linear combination of covariates. Mahalanobis distance meets these criteria.
Mahalanobis or M-distance measures the distance between point P and distribution D. It is generally a measure of many standard deviations P is from the mean D.full description
Mahalanobis distance can be used as a scalar measure of multivariate balance using the following formula.[@MorganKL2015]
$$ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) $$ Where n is the sample size, t treated, c controls, and X represents the covariate means.
Alternatively, for stratified randomization by a distance measure, you can compute each unit's distance from the overall group mean.
$$ M \equiv (x_i - \bar{X}) ' cov(X)^{-1} (\bar{x_i} - \bar{X}) $$
The ACT trial was a single randomized trial, here are the observed mean differences +/- standard deviation in that single randomization.
df_grp_means <- df[,c('arms', var_nms)] %>% group_by(arms) %>% summarize_all(., mean) %>% t(.) colnames(df_grp_means) <- c('zido', 'combo') df_grp_sd <- df[, var_nms] %>% summarize_all(sd) %>% t(.) t <- (df_grp_means[var_nms, 1] - df_grp_means[var_nms, 2]) / df_grp_sd t %>% round(., 4)
These are the standardized mean differences in selected covariates, with large treatment effects or sample sizes, small differences may not be significant, but if evaluating small effects, could be an important source of confounding.
To compute the mahalanobis formula, use must do some matrix algebra.
1) Only include treatment indicator and k covarates
2) Ensure all covariates are numeric
df_mdist <- df[,c('arms', var_nms)] %>% map_dfr(., as.numeric)
$$ \frac{n_tn_c}n$$
This is easy to compute
ssc <- (nrow(df_mdist[df_mdist$arms==0, ]) * nrow(df_mdist[df_mdist$arms==1, ])) / (nrow(df_mdist))
cat('Sample size constant: ', round(ssc, 3))
$$ \bar{X_t} - \bar{X_c}$$
X_t <- colMeans(df_mdist[df_mdist$arms==0, var_nms]) X_c <- colMeans(df_mdist[df_mdist$arms==1, var_nms]) X_delta <- X_t - X_c
cat('Difference in covariate means by treatment: \n ', round(X_delta, 3))
df_cov <- df_mdist %>% select(-arms) %>% cov(.) %>% solve(.) # inverse round(df_cov, 1)
$$ M \equiv \frac{n_tn_c}n (\bar{X_t} - \bar{X_c}) ' cov(X)^{-1} (\bar{X_t} - \bar{X_c}) $$
M_man = ssc * (t(X_delta) %*% (df_cov %*% X_delta)) %>% as.vector(.)
cat('Manually computed M-distance: ', M_man)
jumble package function mdis_grps
:
M_jumb <- mdis_grps(df_mdist[df_mdist$arms==0, 2:length(df_mdist)], df_mdist[df_mdist$arms==1, 2:length(df_mdist)]) * ssc
cat('Jumble M-distance: ', M_jumb)
There is also an R-function which can compute M-distance.
df_cov <- df_mdist %>% select(-arms) %>% cov(.) M_base <- mahalanobis(X_t, X_c, cov = df_cov) * ssc cat('Base R Mahalanobis: ', M_base)
There is also an Rfast function using C++:
M_fast <- Rfast::mahala(X_t, X_c, sigma = df_cov) * ssc cat('Rfast Mahalanobis: ', M_fast)
Manual calculation same as Base, same as Rfast
all.equal(M_man, M_jumb, M_base, M_fast) # Nearly equal
All functions provide nearly equal answers.
library(microbenchmark) microbenchmark( jumble = mdis_grps(df_mdist[df_mdist$arms==0, var_nms], df_mdist[df_mdist$arms==1, var_nms]) * ssc, rbase = mahalanobis(X_t, X_c, cov = df_cov) * ssc, rfast = Rfast::mahala(X_t, X_c, sigma = df_cov) * ssc, times = 1000 )
Rfast blows away competition!
Typically you would take the square root of this distance, but that is not necessary for our purposes.
When performing a stratified analysis, M-distance can be used to construct the strata and perform a permuted block randomization.
In this case, you are evaluating each person / unit's distance from the mean of the whole group, then doing a stratified randomization by groups of like distance. Note: M-distance is an absolute measure, so does not pair those with similar covariates, but rather the joint distribution of covariates is similarly different from the mean values. So a distant value could mean covariates in the upper or lower quartile.
The formula is now:
$$ M \equiv ({X_i} - \bar{X}) ' cov(X)^{-1} ({X_i} - \bar{X}) $$
X_delta <- apply(df_mdist, 2, function(x) x - mean(x))[, var_nms]
df_cov <- df_mdist %>% select(-arms) %>% cov(.) %>% solve(.) # inverse
int_1 <- (X_delta %*% df_cov) int_2 <- t(X_delta) M_man <- sqrt(diag(int_1 %*% int_2)) tst <- df_mdist[, var_nms] %>% as.matrix(.) M_base <- sqrt(mahalanobis(tst, colMeans(tst), cov=cov(tst))) M_fast <- Rfast::mahala(tst, colMeans(tst), sigma=cov(tst))
Test all methods give equal answers:
all.equal(M_man, M_base, M_fast) # Nearly equal
All functions provide nearly equal answers.
Because a limitation of re-randomization is computation time, it is important to have a function which can compute M-distance quickly.
library(microbenchmark) microbenchmark( mine = mdis_chrt(tst), rfast = Rfast::mahala(tst, colMeans(tst), sigma=cov(tst)), rbase = {mahalanobis(tst, colMeans(tst), cov=cov(tst))}, times = 1000 )
My handwritten code is very in-efficient, Rfast outperforms the base code.
In the re-randomization procedures, the R-fast computation is used for speed. Unit testing is in place to ensure the Mahalanobis calculations are consistent across Rfast, Base R and the manual computations for version control and reproducibility.
The primary author of the package was Kevin W. McConeghy. See here
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.