knitr::opts_chunk$set(echo = TRUE, fig.show='hold', out.width="0.5\\textwidth")
This document accompanies the paper "Rank conditional coverage and confidence intervals in high dimensional problems." by Jean Morrison and Noah Simon. Here we walk through the simulations in section 3.2 and reproduce the results shown in the paper.
We make use of the rcc
and rccSims
packages which accompanies the paper and can be found at github.com/jean997/rcc and github.com/jean997/rccSims.
Here we consider data from a clinical trial examining the effect of a a treatment on an outcome $Y$. For each participant we have also collected a biomarker $W$ and suspect that the treatment has a greater effect for patients with larger values of $W$. We are interested in establishing a cut-point in $W$ to guide enrolment of a future trial so we will estimate the treatment effect in subgroups defined by the biomarker.
The relationship between $Y$, the treatment, and the biomarker is given by $$ E[Y | trt, W] = \begin{cases} 0 \qquad & W < 0.5\ \frac{1}{2}\cdot trt & W \geq 0.5 \end{cases}. $$ For participant $i$, the observed value of $Y$ is $Y_i = E[Y | W_i, trt_i] + \epsilon$ where $\epsilon\sim N(0, 1/4)$.
In each simulation, we generate data for 400 participants --- 200 in each treatment arm. The vlaue biomarker is uniformly distributed between 0 and 1. Here we generate data for one simulation:
library(ggplot2) library(rcc) library(rccSims) set.seed(1e7) n <- 2*200 mean_outcome <- function(x, trt){ if(x < 0.5) return(0) return(0.5*trt) } dat <- data.frame("trt"=rep(c(0, 1), each=n/2), "w"=runif(n=n)) dat$Ey <- apply(dat, MARGIN=1, FUN=function(z){mean_outcome(z[2],z[1])}) dat$y <- rnorm(n=n, mean=dat$Ey, sd=0.5) dat$trt <- as.factor(dat$trt) ggplot(dat) + geom_line(aes(x=w, y=Ey, group=trt, col=trt)) + geom_point(aes(x=w, y=y, col=trt, group=trt)) + ylab("E[Y]") + theme_bw() + theme(panel.grid = element_blank())
We will consider 100 cutopoints between 0.1 and 0.9, $w_1, \dots, w_100$. For each cutpoint, $j \in 1, \dots, 100$, we estimate the difference in average treatment effect for individuals above and below the cutpoint: $$ \beta_j = (E[Y | trt=1, W \geq w_j] - E[Y | trt=0, W \geq w_j])- (E[Y | trt=1, W < w_j] - E[Y | trt=0, W < w_j])$$
We can estimate $\beta_j$ by fitting the parameters in the regression
$$ Y = \beta_0 + \alpha_1 trt + \alpha_2 1_{W > w_j} + \beta_j trt 1_{W > w_j} + \epsilon $$ by least squares. Here we calculate $\hat{\beta}_j$ for each cutpoint:
n.cutpoints <- 100 cutpoints <- seq(0.1, 0.9, length.out=n.cutpoints) #Indicator variables ix <- sapply(cutpoints, FUN=function(thresh){as.numeric(dat$w >= thresh)}) #Run the linear regressions stats <- apply(ix, MARGIN=2, FUN=function(ii){ f <- lm(y~trt*ii, data=dat) summary(f)$coefficients[4, 1:3] }) stats <- data.frame(matrix(unlist(stats), byrow=TRUE, ncol=3)) names(stats) <- c("beta", "se", "tstat") stats$cutpoint <- cutpoints j <- order(abs(stats$tstat), decreasing=TRUE) stats$rank <- match(1:nrow(stats), j) head(stats)
The true value of $\beta_j$ is $$0.5*\frac{min(1-w_j, 0.5)}{(1-w_j)} - 0.5\frac{max(0, w_j-0.5)}{w_j}$$
stats$truth <-sapply(cutpoints, FUN=function(wj){ 0.5*(min(1-wj, 0.5)/(1-wj)) - 0.5*max(0, wj-0.5)/wj})
Plotting effect size estimates and true parameter values vs. cutpoint:
library(tidyr) stats_long <- gather(stats[, c("cutpoint", "beta", "truth")], "type", "value", -cutpoint) ggplot(stats_long) + geom_point(aes(x=cutpoint, y=value, group=type, color=type)) + scale_color_discrete(labels=c("Estimate", "Truth")) + ggtitle("Effect Size Estimates") + theme_bw() + theme(panel.grid=element_blank(), axis.title.y=element_blank(), legend.title = element_blank())
Because we use the same individuals to estimate each parameter, the estimates are highly correlated.
The non-parametric bootstrap is the most appropriate choice for these data since the estimates are correlated. Only the non-parametric bootstrap can model this correlation. To use the nonpar_bs_ci
function in the rcc
package to compute the nonparametric boostrap confidence intervals, we must supply an analysis function. We will use a data object that has $trt$, $w$, $E[Y]$, and $Y$ as the first four columns and the indicator variables $1_{W > w_j}$ as the next 100 columns as input.
mydata <- cbind(dat, ix) analysis.func <- function(data){ y <- data[,4] trt <- data[,1] stats <- apply(data[, 5:(n.cutpoints + 4)], MARGIN=2, FUN=function(ii){ f <- lm(y~trt*ii) if(nrow(summary(f)$coefficients) < 4) return(rep(0, 3)) summary(f)$coefficients[4, 1:3] }) stats <- data.frame(matrix(unlist(stats), byrow=TRUE, ncol=3)) names(stats) <- c("estimate", "se", "statistic") return(stats) }
Here we calculate the non-parametrci bootstrap confidence intervals (Algorithm 3 in the paper):
library(parallel) library(rcc) ci.nonpar <- nonpar_bs_ci(data=mydata, analysis.func=analysis.func, n.rep=500, parallel=TRUE) head(ci.nonpar) ci.nonpar <- ci.nonpar[, c("ci.lower", "ci.upper")] sum(ci.nonpar[,1] <= stats$truth & stats$truth <= ci.nonpar[,2])/n.cutpoints
Here are the intervals plotted versus rank. Colored points indicate the true value of the parameter.
library(rccSims) plot_cis(stats$rank, ci.nonpar, stats$truth, plot.truth=TRUE) + ggtitle("Non-Parametric Bootstrap Confidence Intervals")
Here they are plotted versus cutpoint:
plot_cis(cutpoints, ci.nonpar, stats$truth, plot.truth=TRUE) + ggtitle("Non-Parametric Bootstrap Confidence Intervals") + xlab("Cutpoint")
We could instead use the parametric bootstrap (Algorithm 2 and Supplemental Algorithm 2):
ci.par <- rcc::par_bs_ci(beta=stats$beta, se=stats$se, n=500, use.abs=TRUE)[, c("ci.lower", "ci.upper")]
Here are the intervals plotted versus rank. Colored points indicate the true value of the parameter.
plot_cis(stats$rank, ci.par, stats$truth, plot.truth=TRUE) + ggtitle("Parametric Bootstrap Confidence Intervals")
Here they are plotted versus cut-point:
plot_cis(cutpoints, ci.par, stats$truth, plot.truth=TRUE) + ggtitle("Parametric Bootstrap Confidence Intervals") + xlab("Cutpoint")
The parametric bootstrap confidence intervals perform more poorly than the non-parametric confidence intervals because estimates are highly correlated but this isn't modeled in the basic parametric bootstrapping algorithm. These intervals tend to undercover parameters associated with the most and least significant estimates.
For comparison, let's look at the naive confidence intervals:
ci.naive <- cbind(stats$beta-stats$se*qnorm(0.95), stats$beta + stats$se*qnorm(0.95)) mean(ci.naive[,1] <= stats$truth & stats$truth <= ci.naive[,2])
plot_cis(stats$rank, ci.naive, stats$truth, plot.truth=TRUE) + ggtitle("Standard Marginal Confidence Intervals") plot_cis(cutpoints, ci.naive, stats$truth, plot.truth=TRUE) + ggtitle("Standard Marginal Confidence Intervals") + xlab("Cutpoint")
These do pretty well since the estimates are so correlated.
Here are the selection adjusted intervals of the intervals of @Weinstein2013 after selecting the top 10 (10\%) parameters. o make running simulations easier, we have included the code distributed by @Weinstein2013 in the rccSims
package. The Shortest.CI
function used below is part of this code.
#We need to give this method the "cutpoint" or minimum value of the test statistic ct <- abs(stats$tstat[stats$rank==11]) wfb <- lapply(stats$tstat, FUN=function(x){ if(abs(x) < ct) return(c(NA, NA)) ci <- try(rccSims:::Shortest.CI(x, ct=ct, alpha=0.1), silent=TRUE) if(class(ci) == "try-error") return(c(NA, NA)) #Sometimes WFB code produces errors return(ci) }) ci.wfb <- matrix(unlist(wfb), byrow=TRUE, nrow=n.cutpoints) ci.wfb[,1]<- ci.wfb[,1]*stats$se ci.wfb[,2]<- ci.wfb[,2]*stats$se mean(ci.wfb[,1] <= stats$truth & ci.wfb[,2]>= stats$truth, na.rm=TRUE)
Here are the 10 WFB intervals which all cover their respective parameters
plot_cis(stats$rank, ci.wfb, stats$truth, plot.truth=TRUE) + ggtitle("Weinstein, Fithian, and Benjamini Intervals")
Here are the selection adjusted confidence intervals of @Reid2014. These are implemented in the selectiveInference
R package.
library(selectiveInference) M <- manyMeans(y=stats$tstat, k=0.1*n.cutpoints, alpha=0.1, sigma=1) ci.rtt <- matrix(nrow=n.cutpoints, ncol=2) ci.rtt[M$selected.set, ] <- M$ci mean(ci.rtt[,1] <= stats$truth & ci.rtt[,2]>= stats$truth, na.rm=TRUE)
rccSims::plot_cis(stats$rank, ci.rtt, stats$truth, plot.truth=TRUE, prop=0.2) + ggtitle("Reid, Taylor, and Tibshirani Intervals")
This method gives the 10th ranked parameter an very short confidence interval!
Here are the credible intervals we get out of ashr (@stephens2016)
library(ashr) ash.res <- ash(betahat = stats$beta, sebetahat = stats$se, mixcompdist = "normal") ci.ash <- ashci(ash.res, level=0.9, betaindex = 1:n.cutpoints, trace=FALSE) mean(ci.ash[,1]<= stats$truth & ci.ash[,2] >= stats$truth)
Here are the ashr intervals at all ranks
rccSims::plot_cis(stats$rank, ci.ash, stats$truth) + ggtitle("ashr Credible Intervals (Stephens 2016)")
The steps above are all implemented by the biomarker_sim
function in the rccSims
package. This code will generate the results in package:
library(parallel) biomarker_results <- biomarker_sim(n=400, n.rep = 400, n.cutpoints = 100, seed =1e7)
These results are also included as a built-in data set in the rccSims
package.
This generates the plots shown in the paper:
data("biomarker_results", package="rccSims") coverageplot <- rccSims::plot_coverage(biomarker_results, proportion=1, cols=c("black", "deeppink3", "red", "gold4", "forestgreen", "purple"), simnames=c("naive", "par", "nonpar", "ash", "wfb", "selInf1"), ltys= c(2, 1, 3, 6, 4, 2), span=0.5, main="Rank Conditional Coverage", y.range=c(-0.02, 1.02), legend.position = "none") + theme(plot.title=element_text(hjust=0.5)) widthplot <- rccSims::plot_width(biomarker_results, proportion=1, cols=c("black", "deeppink3", "red", "gold4", "forestgreen", "purple"), simnames=c("naive", "par", "nonpar", "ash", "wfb", "selInf1"), ltys= c(2, 1, 3, 6, 4, 2), span=0.5, main="Interval Width", legend.position = "none") + theme(plot.title=element_text(hjust=0.5)) legend <- rccSims::make_sim_legend(legend.names = c("Marginal", "Parametric\nBootstrap", "Non-Parametric\nBootstrap", "ashr", "WFB", "RTT"), cols=c("black", "deeppink3", "red", "gold4", "forestgreen", "purple"), ltys= c(2, 1, 3, 6, 4, 2)) coverageplot widthplot legend$plot
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.