In li-xinran/RIQITE: Randomization Inference for Quantiles of Individual Treatment Effects

bibliography: ["permutation.bib"] biblio-style: "apalike" link-citations: true

knitr::opts_chunk$set(echo = TRUE, cache.lazy = FALSE)
setwd("~/Dropbox/Own/Projects/Quantile Effects/code2021Aug/R illustration")
library(dplyr)
library(ggplot2)
library( RIQITE )
library( tidyverse )
set.seed(1)
nperm = 1000

Install the RIQITE package

We first install the package from Github. The github page of the package (https://github.com/li-xinran/RIQITE) contains main functions with detailed explanation in R documentation, as well as a simple illustrating example. Below we provide the code used to analyze the data sets in the paper.

Install via devtools, and then load:

devtools::install_github("li-xinran/RIQITE")
library(RIQITE)

Testing monotonicity of an instrumental variable

We get the data from @BlackEtAl11a, in which Table 1 cross-tabulates month of birth and school entry age (early/on-time/late) in terms of proportions and Table 3 reports the total number of subjects born in January or December (the "discontinuity subsample"): 104,023. We round the total number of subjects to 104,000, and, similar to @FioriniStevens14a, we assume there are equal numbers of subjects born in January and December. Based on these summary statistics, we generate the individual-level data as follows:

x <- matrix(0, 2, 3, dimnames=list(Month=c("Dec", "Jan"), 
                                   Entry=c("Early", "On-Time", "Late")))
n <- 104000
x["Dec", "On-Time"] <- .85
x["Dec", "Late"] <- .15
x["Jan", "Early"] <- .10
x["Jan", "On-Time"] <- .90
df <- data.frame(BirthMonth = rep(c("Dec", "Jan"), each=n/2), 
                 EntryTiming = c(rep("On-Time", n/2 * .85), rep("Late", n/2 * .15), 
                                 rep("Early", n/2 * .10), rep("On-Time", n/2 * .90)))
df <- df %>% mutate(EntryAge = NA,
        EntryAge = ifelse(BirthMonth == "Dec" & EntryTiming == "On-Time", 6.75, EntryAge),
        EntryAge = ifelse(BirthMonth == "Dec" & EntryTiming == "Late", 7.75, EntryAge),
        EntryAge = ifelse(BirthMonth == "Jan" & EntryTiming == "On-Time", 23/3, EntryAge),
        EntryAge = ifelse(BirthMonth == "Jan" & EntryTiming == "Early", 20/3, EntryAge))
data <- df[sample(c(1:n)), ] # randomly permute the ordering of units
table(data$BirthMonth, data$EntryTiming)

The following figure shows the empirical distribution functions of the school-entry age of December-born and January-born subjects. From the figure, most children started school at an older age if they were born in January rather than December.

ggplot(data, aes(EntryAge, color = BirthMonth, linetype=BirthMonth)) +
  stat_ecdf(geom="step", show.legend=FALSE, size=1.3) +
  annotate("text", x=c(7.2, 7.2), y=c(.145, .81), 
           label=c("January", "December")) +
  scale_x_continuous(breaks=seq(6, 8, .2)) + 
  theme_bw() + 
  theme( plot.background = element_blank(), 
         panel.grid.major = element_blank(), 
         panel.grid.minor = element_blank(),
         plot.title=element_text(hjust=.5 ) ) + 
  labs(x="School-Entry Age (Years)", y="Probability")

We first perform the standard regression analysis assessing the relation between the instrument (birthday) and the school-entry age. The first-stage relationship is incredibly strong, with an average effect of $-0.667$ years and an F statistic of 106,256. This would conventionally be considered persuasive evidence of a valid instrument.

lm.fit <- lm(EntryAge ~ (BirthMonth=="Dec"), data)
cat(paste( 
  "esimated average effect is", round( as.numeric(lm.fit$coefficients[2]), digits = 3), 
  "and F statistic is", round( as.numeric( summary(lm.fit)$fstatistic[1] ), digits = 0), 
  "\n" 
  ))

We then perform the randomization test for the bounded null $H_{\le 0}$ on the monotonicity of the instrument. Specifically, we consider three randomization tests using the difference-in-means statistic, Wilcoxon rank sum statistic and Stephenson rank sum statistic with the parameter $s=10$, respectively. We also conduct the classical Student's $t$-test. These produce Table 1 in the paper.

data$Z = as.numeric( data$BirthMonth == "Dec" )
data$Y = data$EntryAge
n <- length(data$Z)
m <- sum(data$Z)
nperm <- 10^4 # number of permutations for Monte Carlo approximation

# Generate a sample of permutations
Z.perm <- assign_CRE(n, m, nperm)

# difference-in-means
pval.dim <- pval_bound(Z=data$Z, Y=data$Y, c=0, method.list=list(name = "DIM"), 
                     alternative = "greater", Z.perm=Z.perm, impute="control")
ci.dim <- ci_bound(Z=data$Z, Y=data$Y, alternative="greater", method.list=list(name="DIM"), 
                        Z.perm=Z.perm, impute="control", alpha=0.10)

# Wilcoxon rank sum
pval.wilc <- pval_quantile(Z=data$Z, Y=data$Y, k=n, c=0, alternative = "greater",
                           method.list=list(name="Stephenson", s=2), Z.perm=Z.perm )
ci.wilc <- ci_quantile(Z=data$Z, Y=data$Y, k.vec = n, alternative = "greater", 
                method.list=list(name="Stephenson", s=2), Z.perm=Z.perm,  alpha=0.10)

# Stephenson rank sum
pval.steph10 <- pval_quantile(Z=data$Z, Y=data$Y, k=n, c=0, alternative = "greater", 
                              method.list=list(name="Stephenson", s=10), Z.perm=Z.perm )
ci.steph10 <- ci_quantile(Z=data$Z, Y=data$Y, k.vec = n, alternative = "greater", 
              method.list=list(name="Stephenson", s=10), Z.perm=Z.perm,  alpha=0.10)

# classical Student's t-test
ttest = t.test(data$Y[data$Z==1], data$Y[data$Z==0], alternative = "greater")
pval.ttest = ttest$p.value
ci.ttest = round( ttest$conf.int[1], digits = 3)

# result for Table 1 in the paper
result = data.frame(test=c("diff-in-means", "Wilcoxon", "Stephenson 10", "t test"),
                    pval=c(pval.dim, pval.wilc, pval.steph10, pval.ttest), 
                    lower = c(ci.dim, ci.wilc$lower, ci.steph10$lower, ci.ttest))
knitr::kable( result )

Finally, we infer the effect range. The result shows that the range of the effect of birth month on school-entry age is at least 1 year, indicating significant individual effect heterogeneity.

tau.max.lower = ci_quantile(Z=data$Z, Y=data$Y, k.vec = n, alternative = "greater", 
              method.list=list(name="Stephenson", s=10), Z.perm=Z.perm,  alpha=0.05)$lower
tau.min.upper = ci_quantile(Z=data$Z, Y=data$Y, k.vec = 1, alternative = "less", 
              method.list=list(name="Stephenson", s=10), Z.perm=Z.perm,  alpha=0.05)$upper
print(paste( "lower confidence limit for effect range is", 
             max(0, tau.max.lower - tau.min.upper) ))
save.image("BirthSchoolAge20221028.RData") # save the analysis result

Evaluating the effectiveness of professional development

We next evaluate a teacher professional development RCT where treated teachers were given a professional development course on electric circuits. We get the data from @HSMHSD10. The cleaned data, as described in the paper, is part of the package. We load it and estimate the Average Treatment Effect (ATE) via linear regression:

data( electric_teachers )

mod <- lm(gain~TxAny, data = electric_teachers)
est_ATE <- round(mod$coefficients[2], digits = 2)
est_ATE

The following figure shows the histograms of the observed gain scores in the treatment and control groups.

hist(electric_teachers$gain[electric_teachers$TxAny==1], 
     xlim = range(electric_teachers$gain), freq = FALSE, ylim = c(0, 0.065), 
     col = "grey", border=FALSE, breaks = 20,
     main = NULL, xlab = "gain score")
hist(electric_teachers$gain[electric_teachers$TxAny==0], 
     freq = FALSE, add = TRUE, breaks = 20, col = NULL)

We can compare the variation of the units in the treatment and control arms, which can be a clue to how much impact heterogeneity there might be. We do this in effect size units, by first standardizing gain by the control-side variation in outcome at baseline:

sd_0 = sd( electric_teachers$Per.1[ electric_teachers$TxAny == 0 ] )
electric_teachers$std_gain = electric_teachers$gain / sd_0
electric_teachers %>% group_by( TxAny ) %>%
    summarise( raw_mean = mean( gain ),
               mean = mean( std_gain ),
               var = var( std_gain ),
               max = max( std_gain ),
               min = min( std_gain ),
               raw_sd = sd( gain ),
               sd = sd( std_gain ) )

If there were no correlation between $Y_i(0)$ and $\tau_i$, the standard deviation of the treatment impacts would be, using $var(Y_i(1)) = var(Y_i(0)) + var(\tau_i)$,

tx_het <- sqrt( 0.852^2 - 0.806^2 )
tx_het

I.e., around $0.28\sigma$, in effect size units. This would be $0.28 \times sd_0$, or r round( tx_het * sd_0,digits=1) in the original scale.

An estimate of the minimum treatment effect possible is the largest control outcome minus the largest treated outcome, which is about $3.16 - -0.63 = r 3.16 + 0.63$. The maximum, by contrast, is $5.06 - -2.22 = r 5.06 + 2.22. These are extremely conservative bounds, in that such magnitudes are unlikely to exist in practice.

Determining an s-value for the test statistic

We plan on using a Stephenson's Rank Sum test, but need to determine an appropriate value of $s$. We first conduct a sensitivity analysis across a range of scenarios to determine which $s$ have the highest amount of power for the top 10 quantiles, the next 20 quantiles, and the next 40 quantiles.

To illustrate this process, we first run a check on what the best $s$ is for a single scenario. We use the empirical distribution of the control outcomes as our distribution, and calculate a treatment effect (all in standardized units). Our first scenario has impacts be normally distributed and not correlated with $Y_i(0)$, in line with our explorations above.

set.seed(303020)
cdat = filter( electric_teachers, TxAny == 0 )
ate = est_ATE / sd_0
R = 1000
nperm = 10^3
s_list = c( 2, 3, 4, 5, 6, 8, 10, 15, 20 )
res <- explore_stephenson_s( s = s_list,
                      n = nrow( electric_teachers ),
                      Y0_distribution = cdat$std_gain,
                      tx_function = "rnorm", ATE = ate, 
                      tx_scale = tx_het, rho = 0,
                      nperm = nperm,
                      R = R, calc_ICC = FALSE,
                      targeted_power = FALSE, k.vec = (233-99):233 )

We get one row per targeted quantile and explored $s$, telling us how easy it was to detect effects at that quantile. Here we see how our performance was on the 170th largest effect (of 233 effects).

filter( res, k == 170 )

In our table, we see that $s$ of 6 or 8 gave the highest bound on the impact for this quantile (see the q_ci column). The power columns shows the power to reject no treatment effect for the 170th largest effect. Power is very high for $s = 8$, corresponding to the high median confidence bound. The $n$ column is for the entire set of quantiles, and shows the average number of units found as significant: For $s = 8$ we are finding around 63 units as significant across the simulations.

We next make a grid of different scenarios to explore to see if our trends of best $s$ are consistent across different possible scenarios:

checks = expand_grid( TxVar = c( 0.25, 0.75 ),
                      tx_dist = c( "rexp", "rnorm" ),
                      rho = c( -0.5, 0, 0.5 ) )

# There is no treatment variation, nor correlation with Y0, when
# the tx impact is a constant.
checks = bind_rows( checks,
                    tibble( TxVar = 0, tx_dist = "constant", rho = 0 ) )

We have r nrow(checks) scenarios. We use pmap() to run explore_stephenson_s across them. The explore_stephenson_s() function will repeatedly generate datasets and analyze them, and then return a summary of the results. It will do this using parallel processing, if we set that flag and if our computer has multiple cores. We specify saving the results for the top 100 quantiles:

if ( !file.exists( here::here( "demo_code/heller_s_check_results.rds" ) ) ) {
    checks$data = pmap( checks, function( TxVar, tx_dist, rho ) {
        cat( glue::glue("Running {TxVar} {tx_dist} rho={rho}\n" ) )
        explore_stephenson_s( s = s_list,
                              n = nrow( electric_teachers ),
                              Y0_distribution = cdat$std_gain,
                              tx_function = tx_dist,
                              nperm = nperm,
                              ATE = ate, tx_scale = TxVar, rho = rho,
                              R = R, calc_ICC = FALSE,
                              parallel = TRUE, n_workers = 5,
                              verbose = TRUE,
                              targeted_power = FALSE, k.vec = (233-99):233 )
    } )
    s_selector = unnest( checks, cols = "data" )

    saveRDS( s_selector, here::here( "demo_code/heller_s_check_results.rds" ) )
} else {
    s_selector = readRDS( here::here( "demo_code/heller_s_check_results.rds" ) )
}

checks$data = pmap( checks, function( TxVar, tx_dist, rho ) {
    cat( glue::glue("Running {TxVar} {tx_dist} rho={rho}\n" ) )
    explore_stephenson_s( s = s_list,
                          n = nrow( dat ),
                          Y0_distribution = cdat$std_gain,
                          tx_function = tx_dist,
                          ATE = ate, tx_scale = TxVar, rho = rho,
                          nperm = nperm, R = R, 
                          calc_ICC = FALSE, 
                          parallel = TRUE, n_workers = 5,
                          targeted_power = FALSE, k.vec = (233-99):233 )
} )
checks
s_selector = unnest( checks, cols = "data" )

We next aggregate our findings and see what $s$ provides the tightest confidence intervals across a range of quantile groups: the top 10, next 20, next 30, and next 40. For each group, we take the median of the median lower confidence bound, to get a sense of how informative our confidence intervals in each group will tend to be. We plot this vs. $s$. High points on the curve indicate those $s$ that are more informative:

n_units = nrow(electric_teachers)
s_selector = mutate( s_selector,
                     group = cut( k, 
                                  breaks = c(0, 233-60, 233-30, 233-10, 233),
                                  labels = c( "v low", "low", "med", "high" ) ) )

s_selector$rho.f = factor( s_selector$rho,
                          levels = c( -0.5, 0, 0.5 ),
                          labels = c( "rho = -0.5", "rho = 0", "rho = 0.5" ) )

avg = s_selector %>% 
    group_by( s, group, rho, rho.f ) %>%
    summarise( q_ci = median( q_ci ),
               n = n(), .groups="drop" ) %>%
    filter( group != "v low" )
avg

s_agg = s_selector %>% 
    group_by( s, group, rho, rho.f, tx_dist ) %>%
    summarise( q_ci = median( q_ci ) )

s_agg %>%
    filter( group != "v low" ) %>%
    ggplot( aes( s, q_ci ) ) +
    facet_grid( rho.f ~ group ) +
    geom_hline( yintercept = 0 ) +
    #geom_smooth( aes( col=tx_dist ), method = "loess", se = FALSE, 
    #             span = 1, lwd= 0.5 ) +
    geom_line( aes( col=tx_dist ), lwd=0.5 ) +
    labs( x = "s", y = "Median lower CI bound", col = "Tx distribution:" ) +
    scale_x_log10( breaks = unique( s_selector$s ) ) +
    theme_minimal() +
    theme( panel.grid.minor = element_blank() ) + 
    geom_point( data = avg, col="black", size=2 ) +
    coord_cartesian( ylim = c( -1.5, 2 ) ) +
    theme( #legend.position="bottom", 
            #                    legend.direction="horizontal", 
        legend.key.width=unit(1,"cm"),
           panel.border = element_rect(colour = "grey", fill=NA, size=1) )

if ( FALSE ) {
    s_selector %>% dplyr::filter( s==2, tx_dist=="constant", TxVar == 0 ) %>%
        pull( group ) %>% table()
}

# Save plot for main paper.
ggsave( filename = "demo_code/s_selector.pdf", width = 8, height = 4 )

We can also just tally up the winning $s$ across each of the 100 quantiles we are exploring:

qis <- s_selector %>% 
    group_by( TxVar, tx_dist, rho, group, k ) %>%
    filter( q_ci == max(q_ci) )
table( group = qis$group, s = qis$s, rho = qis$rho )

Each cell in this table records the number of times a particular $s$ value had the highest lower confidence bound for a given scenario and given quantile, divided by the groups of quantiles. We see that for strong confidence intervals on higher quantiles, lower $s$ is good. To obtain more informative findings on the quantiles lower down, higher $s$ is good.

Finally, we can look at the number of units found to be significant across the different scenarios and see which $s$ values identify the most units as significant, on average:

n_score <- s_selector %>% 
    group_by( rho, rho.f, tx_dist, s ) %>%
    summarize( n = mean( n ), .groups = "drop")
ggplot( n_score, aes( s, n, col=tx_dist ) ) +
    facet_wrap( ~ rho.f, nrow=1 ) +
    geom_point( size=3) + geom_line() +
    labs( col = "Tx distribution:") +
    scale_x_log10( breaks = unique( s_selector$s ) ) +
    theme_minimal() +
    theme( panel.grid.minor = element_blank() ) + 
    theme( legend.key.width=unit(1,"cm"),
           panel.border = element_rect(colour = "grey", fill=NA, size=1) )

# Save plot for main paper.
ggsave( filename = "demo_code/s_selector_n.pdf", width = 8, height = 4 )

Looking across our results, a wide range of $s$ could be argued for. There is a cost for small $s$ of 2 through 4 for the lower quantiles. The very high $s$ of 10 or above have costs as well, across the simulations: note the down-turn of the power curves in the plot above. Given the apparent shift of the treatment vs. control distributions in our data, either treatment variation is not overly large, or there is a negative correlation between $Y_i(0)$ and $\tau_i$ (otherwise the variance of the treated group would be substantially larger than the control group, and they are instead somewhat similar), so we should attend more closely to those simulations that are aligned with these empirical trends. For optimizing the number of units tagged as significant, we typically see $s=6$ through $10$ as performing well across the scenarios. Overall we might adjust our $s$ down a bit to try and optimize multiple goals at once, using, e.g., $s = 6$.

Analyzing the teacher professional development data

With our selected $s=6$ in hand, we first test the bounded null hypothesis that all individual effects are non-positive.

selected_s = 6
nperm = 10^6
n_units = nrow( electric_teachers )
pval.steph <- pval_quantile(Z=electric_teachers$TxAny, Y=electric_teachers$gain, 
                              k=n_units, c=0, alternative = "greater", 
                              method.list=list(name="Stephenson", s=selected_s), nperm=nperm )
pval.steph # final p-value

We then conduct inference for all quantiles of individual treatment effects. The following figures show the 90% lower confidence limits for all quantiles of individual effects using Stephenson rank statistics with s = r selected_s and s = 2 (i.e., Wilcoxon rank), respectively.

ci6 = ci_quantile( Z=electric_teachers$TxAny, Y=electric_teachers$gain, alternative="greater", 
                  method.list=list( name="Stephenson", s=selected_s ), nperm=nperm, alpha=0.10 )

We can summarize our results:

summary( ci6, c = c(-0.0001, 0, 6), k = c( 140, 141, 145, 146, 164, 165, 233 ) )

Comparing to Wilcoxon:

ci2 = ci_quantile( Z=electric_teachers$TxAny, Y=electric_teachers$gain, alternative="greater", 
                  method.list=list( name="Stephenson", s=2 ), nperm=nperm, alpha=0.10 )
summary( ci2, c = c( 0, 6 ) )

We can also visualize the bounds:

par(mfrow = c(1, 2))
s_title = glue::glue( "s={selected_s}" )
plot_quantile_CIs(ci6, k_start = 102, main = s_title )
plot_quantile_CIs(ci2, k_start = 102, main = "s=2" )

pdf(file = "Gain_score_6.pdf" )
pdf(file = "Gain_score_2.pdf" )
dev.off()

Finally, we infer the effect range by conducting two tests, one for greater and one for less. We then can say that there are effects at least as small as some value and at least as large as another.

tau.max.lower = ci_quantile( Z=electric_teachers$TxAny, Y=electric_teachers$gain, k.vec=n_units, 
                             alternative="greater", method.list=list( name="Stephenson", s=selected_s ), 
                             nperm=nperm, alpha=0.05 )$lower
tau.min.upper = ci_quantile( Z=electric_teachers$TxAny, Y=electric_teachers$gain, k.vec=1, 
                             alternative="less", method.list=list( name="Stephenson", s=selected_s ), 
                             nperm=nperm, alpha=0.05 )$upper
max(0, tau.max.lower - tau.min.upper)

In this case, we have no evidence of effect heterogeniety (all the effects could be the same).

save.image("Develop20221028.RData") # save the analysis result

Evaluating the effects of six-month nutrition therapy

We get the data for the homefood study from https://doi.org/10.7910/DVN/38X3LX. We download the SPSS Binary (Original File Format) there. The details of the study can be found at https://clinicaltrials.gov/ct2/show/NCT03995303.

dat = foreign::read.spss(here::here( "demo_code/data_store/Data file, 24.09.2021.sav" ),
                         to.data.frame=TRUE)

We focus on the effect of the treatment on the increase of lean body mass (kg) measured before and after the trial. We exclude two units with missing outcomes, resulting in 52 treated units and 52 control units.

dat$change = dat$lean_mass_kg_1 - dat$lean_mass_kg_0
paste("There are", sum(is.na(dat$change)), "units with missing outcomes.")
dat = dat[ !is.na(dat$change), ] # delete units with missing outcomes
n = nrow(dat)
dat = dat[sample(n), ] # randomly permute the order of units
table(dat$group)

The following figure shows the histograms of the lean body mass changes in treated and control groups, respectively.

hist(dat$change[dat$group=="intervention"], xlim=range(dat$change), freq=FALSE, 
     ylim=c(0,0.20), col="grey", border=FALSE, breaks = 10, main=NULL, xlab="gain score")
hist(dat$change[dat$group=="control"], freq=FALSE, add=TRUE, breaks=10, col=NULL)

We infer the lower confidence limits for all quantiles of individual treatment effects. The following two figures show the 90% lower confidence limits for all quantiles of individual effects using Stephenson rank statistics with s = 6 and s = 2 (i.e., Wilcoxon rank), respectively

nperm = 10^6
ci6 = ci_quantile( Z=as.numeric(dat$group == "intervention"), Y=dat$change, 
                   alternative="greater", method.list=list(name="Stephenson", s=6), 
                   nperm=nperm, alpha=0.10 )
ci2 = ci_quantile( Z=as.numeric(dat$group == "intervention"), Y=dat$change, 
                   alternative="greater", method.list=list(name="Stephenson", s=2), 
                   nperm=nperm, alpha=0.10 )

par(mfrow = c(1, 2))
plot_quantile_CIs(ci6, main="s=6", k_start = n-37)
plot_quantile_CIs(ci2, main="s=2", k_start = n-37)

save.image("Homefood20221028.RData") # save the analysis result

li-xinran/RIQITE documentation built on July 1, 2023, 6:58 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

li-xinran/RIQITE
Randomization Inference for Quantiles of Individual Treatment Effects

In li-xinran/RIQITE: Randomization Inference for Quantiles of Individual Treatment Effects

Install the RIQITE package

Testing monotonicity of an instrumental variable

Evaluating the effectiveness of professional development

Determining an s-value for the test statistic

Analyzing the teacher professional development data

Evaluating the effects of six-month nutrition therapy

R Package Documentation

Browse R Packages

We want your feedback!

li-xinran/RIQITE Randomization Inference for Quantiles of Individual Treatment Effects

In li-xinran/RIQITE: Randomization Inference for Quantiles of Individual Treatment Effects

Install the RIQITE package

Testing monotonicity of an instrumental variable

Evaluating the effectiveness of professional development

Determining an s-value for the test statistic

Analyzing the teacher professional development data

Evaluating the effects of six-month nutrition therapy

R Package Documentation

Browse R Packages

We want your feedback!

li-xinran/RIQITE
Randomization Inference for Quantiles of Individual Treatment Effects