In floatofmath/adaperm: Permutation tests for adaptive designs

knitr::opts_chunk$set(echo=FALSE,warning=FALSE,results='hide',fig.width=7,fig.height=7,message=FALSE)

library(magrittr)
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(pander)
library(parallel)
library(bt88.03.704)
##devtools::install_github('floatofmath/adaperm')
##devtools::install_github('floatofmath/bt88.03.704')
##devtools::install('~/repos/adaperm')
library(adaperm)
options(mc.cores=detectCores()-1)

library(Rcpp)
sourceCpp("../src/combinations.cpp")
library(microbenchmark)
## about 90 fold time improvement
microbenchmark(all_combinations_cpp(16,8),gtools::combinations(16,8))
microbenchmark(all_reassignments_cpp(16,8),gtools::combinations(16,8))
null <- all_combinations(24,12)
## stress test
replicate(1000,{null <- all_combinations(20,10);NULL})

for(n in 1:24){
    for(k in 1:n){
        if(!all(t(all_combinations(n,k)) == gtools::combinations(n,k))){
            stop(paste('error at n =',n,'k =',k))
        }
    }
}

for(n in 1:24){
    for(k in 1:n){
        if(!all(all_reassignments_cpp(n,k) == adaperm:::all_reassignments(n,k))){
            stop(paste('error at n =',n,'k =',k))
        }
    }
}

To evaluate the operating characteristics of the approach we performed a simulation study, that compares the conditional error based permutation tests with previously suggested test procedures for adaptive designs for trials of different sample sizes and observations from different distributions.

Specifically we compare the performance of conditional error based permutation tests using either the difference in group means or the sum of ranks in the treatment groups with inverse-normal combination tests based on stage-wise $t$-tests or Wilcoxon Rank-sum tests. As all scenarios use balanced sample sizes, the preplanned tests underlying the conditional error based approach using the difference of means is equivalent to a permutation test based on the $t$-Statistic; the Rank-sum is equivalent to a Wilcoxon Rank sum test.

For the conditional error based approach we also compare different versions of the conditional error rate and second stage test. We compare a procedure using the conidtional error of the conservative preplanned permutation test (see \ref{}), or the conditional error of the exact randomized permutation (see \ref{}). Finally we look at a procedure that in addition to using the conditional error of the randomized preplanned permutation test, also applies a second stage permutation test using the mid-$p$ value instead of the conservative permutation $p$-value.

The conditional error based approaches are all based on stage-wise stratified permutations (\ie the per group sample sizes are fixed and balanced within each stage). Using permutations that do not stratify the permutations by stage does not improve the performance of the procedure (data not shown). This is not surprising, as we simulate data under a block-randomized design that blocks by stages.

Distributions

In terms of data generating processes we consider the following reference distributions. We compare the performance of test procedures with observations sampled from the

normal distirbution
log-gamma distribution
t-distribution with 4 degrees of freedom
contaminated normal distribution, where 10\% of the observations sampled from a normal distribution with three-times higher standard deviation then the remaining observations.

The normal distribution is considered the baseline scenario. We expect that in this situation the $t$-test based procedures outperform procedures based on Rank-sum statistics. The log-gamma distribution represents a left-skewed distribution, the t-distribution has heavy tails, so does the contaminated normal distribution. See Figure \ref{fig:distributions} for the densities of distributions under the null - and alternative-hypothesis.

For all distributions we will study scenarios under the null hypothesis, as well as scenarios under the alternative. Specifically we investigate a small, a medium and a large sample size scenario, with pre-planned per group sample sizes of 10, 30 and 100 observations. For every combination of distribution and sample size we have calibrated the parameters such that the inverse-normal combination $t$-test gives a power of 60% under the alternative.

\begin{figure} \centering \includegraphics{distributions.eps} \caption{Densities of the distributions from which observations are samples. Under the null-hypothesis samples are drawn from distributions with densities shown by solid lines. Under the alternative samples in the treatment group are drawn from distributions with densities shown with dashed lines.} \label{fig:distributions} \end{figure}

In all cases we will apply a sample size reassessment rule based on a conditional power rule that reestimates the sample size based on an unblinded estimate of the variance in order to achieve a target power of 80\% (see Section \ref{example}).

Test procedure

Tests where performed using functions implemented in adaperm. See the code below for examples how to perform the different test an example with 16 preplanned observations per group that is reassessed to collect and additional 4 observations per group after the end of the preplanned trial.

x <- rnorm(20)
y <- rnorm(20)

adaptive_invnormtest_2s(x,y,n1=8,n=16,ne=20)
adaptive_invnorm_wilcoxtest_2s(x,y,n1=8,n=16,ne=20)
adaperm_DR(c(x,y),rep(0:1,each=20),n1=8,n=16,test=tstat)
adaperm_DR(c(x,y),rep(0:1,each=20),n1=8,n=16,test=tstat,stratified=FALSE)
adaperm_DR(c(x,y),rep(0:1,each=20),n1=8,n=16,test=ranksum)
adaperm_DR(c(x,y),rep(0:1,each=20),n1=8,n=16,test=ranksum,cer_type='randomized')
adaperm_DR(c(x,y),rep(0:1,each=20),n1=8,n=16,test=ranksum,cer_type='randomized',atest_type='midp')

Distributions

The parameters of the distributions are calibrated using a manual intersection search to give the desired operational characteristics. That is 60\% and 80\% power using the inverse-normal t-test with a preplanned sample size of 10 or 100 observations per group. The parameters for the 80\% power scenario will be considered the planning assumptions, for the 60\% power scenario will be used to sample observations.

Small sample scenarios

x <- rnorm(10)
y <- rnorm(10)
adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)

## Assume m0 = 1 s0=.75 (power=80), true m0=1 s0=.9 (power=60)
power.t.test(10,1,.65)


find_param <- function(rfun,n0,delta=1,target=.9,up=5,low=1/2,replication=1000){
    power <- function(param) mean(replicate(replication,{
                                            x <- rfun(n0,param,0)
                                            y <- rfun(n0,param,delta)
                                            adaptive_invnormtest_2s(x,y,n0/2,n0,n0)
    })) - target
    bisect(low,up,power,tol=10^(-3))
}

find_param(function(n0,param,delta) rnorm(n0,m=delta,sd=param),10,target=.70)
find_param(function(n0,param,delta) rnorm_cont(n0,m=delta,sd=param,cshift=0,csd=3*param),10,target=.70)
find_param(function(n0,param,delta) rt(n0,10)*param+delta,10,target=.70)

find_param(function(n0,param,delta) rnorm(n0,m=delta,sd=param),30,target=.70)
find_param(function(n0,param,delta) rnorm_cont(n0,m=delta,sd=param,cshift=0,csd=3*param),30,target=.70)
find_param(function(n0,param,delta) rt(n0,10)*param+delta,30,target=.70)

find_param(function(n0,param,delta) rnorm(n0,m=delta,sd=param),100,target=.70)
find_param(function(n0,param,delta) rnorm_cont(n0,m=delta,sd=param,cshift=0,csd=3*param),100,target=.70)
find_param(function(n0,param,delta) rt(n0,10)*param+delta,100,target=.70,low=.1,up=10)


plot(density(-log(rgamma(1000,4))*.9))
lines(density(-log(rgamma(1000,4))*.9+1))

mean(replicate(10000,{
sd <- .9
x <- rnorm(10,sd=sd)
y <- rnorm(10,m=1,sd=sd)
adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)}))


x <- log(rgamma(10,1))
y <- log(rgamma(10,1))
adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)


mean(replicate(10000,{a <- 1
                      ncp <- 1.35
                      x <- log(rgamma(10,a))
                      y <- log(rgamma(10,a))+ncp
                      adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)}))

x <- rt(10,4)
y <- rt(10,4)
adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)

mean(replicate(10000,{ncp <- 1.2
                      x <- rt(10,4)
                      y <- rt(10,4,ncp)
                      adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)}))

mean(replicate(10000,{sd <- .8
                      x <- rnorm_cont(10,0,sd,cshift=0,csd=3*sd)
                      y <- rnorm_cont(10,1,sd,cshift=1,csd=3*sd)
                      adaptive_invnormtest_2s(x,y,n1=5,n=10,ne=10)}))


r_counts <- function(n,theta=1,df=2,lambda=NULL){
  if (is.null(lambda)) lambda=rchisq(n=n,df = df)
  lambda[lambda<.5]=.5
  lambda[lambda>10000]=10000
  y=rpois(n = n,lambda = lambda*theta)
  y
}

#This concludes the configuration of the small sample setup.

In all situations our planning assumption is normally distributed observations with mean difference of $\delta_0=1$ and common standard deviation of $\sigma_0=.6$ but simulate data from. Note that for these small sample sizes there is quite a large difference in power between the the inverse normal combination and the fixed sample t-test, the latter has a power of 94% for $\sigma_0=.6$.

Normal distributions with mean-differences $\delta=1$ and $\sigma=.9$
Log-gamma distirbiutions with shape parameter $a=1$ shifted by $\delta=1.35$ in the treatment group
$t$-distributions with $4$ degrees of freedom and non-centrality parameter $1.2$ in the treatment group

in all cases the inverse normal combination test has a (simulated) power of 60%.

Large sample scenarios

x <- rnorm(100)
y <- rnorm(100)
adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)

## Assume m0 = 1 s0=2.5 (power=80), true m0=1 s0=3.2 (power=60)
power.t.test(100,1,2.5)

mean(replicate(10000,{
sd <- 3.2
x <- rnorm(100,sd=sd)
y <- rnorm(100,m=1,sd=sd)
adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)}))


x <- log(rgamma(100,1))
y <- log(rgamma(100,1))
adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)


mean(replicate(10000,{a <- 1
                      ncp <- .4
                      x <- log(rgamma(100,a))
                      y <- log(rgamma(100,a))+ncp
                      adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)}))

x <- rt(100,4)
y <- rt(100,4)
adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)

mean(replicate(10000,{ncp <- .35
                      x <- rt(100,4)
                      y <- rt(100,4,ncp)
                      adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)}))

adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)

mean(replicate(10000,{sd <- 2.8
                      x <- rnorm_cont(100,0,sd,cshift=0,csd=3*sd)
                      y <- rnorm_cont(100,1,sd,cshift=1,csd=3*sd)
                      adaptive_invnormtest_2s(x,y,n1=50,n=100,ne=100)}))

Large sample scenarios

N <- 30
x <- rnorm(N)
y <- rnorm(N)
adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)

## Assume m0 = 1 s0=1.35 (power=80), true m0=1 s0=1.7 (power=60)
power.t.test(N,1,1.35)

mean(replicate(10000,{
sd <- 1.7
x <- rnorm(N,sd=sd)
y <- rnorm(N,m=1,sd=sd)
adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)}))

mean(replicate(10000,{
sd <- 1.35
x <- rnorm(N,sd=sd)
y <- rnorm(N,m=1,sd=sd)
adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)}))


x <- log(rgamma(N,1))
y <- log(rgamma(N,1))
adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)


mean(replicate(10000,{a <- 1
                      ncp <- .75
                      x <- log(rgamma(N,a))
                      y <- log(rgamma(N,a))+ncp
                      adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)}))

x <- rt(N,4)
y <- rt(N,4)
adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)

mean(replicate(10000,{ncp <- .65
                      x <- rt(N,4)
                      y <- rt(N,4,ncp)
                      adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)}))

adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)

mean(replicate(10000,{sd <- 1.3
                      x <- rnorm_cont(N,0,sd,cshift=0,csd=3*sd)
                      y <- rnorm_cont(N,1,sd,cshift=1,csd=3*sd)
                      adaptive_invnormtest_2s(x,y,n1=N/2,n=N,ne=N)}))

robust_pooled_variance <- function(x,y){
    (iqr(c(x-median(x),y-median(y)))/1.349)^2
}

mean(replicate(10000,{
                   sd <- 3.2
                   x <- rnorm(100,sd=sd)
                   y <- rnorm(100,m=1,sd=sd)
                   sqrt(robust_pooled_variance(x,y))}))

replicate(10,{
              sd <- 1.6
              x <- rnorm(10,sd=sd)
              y <- rnorm(10,m=1,sd=sd)
              c(nn=cond_power_rule_norm_ts(x,y,target=.8),
                nt=cond_power_rule_t_ts(x,y,target=.8))})

mean(replicate(10000,{
                   sd <- 3.2
                   x <- rnorm(100,sd=sd)
                   y <- rnorm(100,m=1,sd=sd)
                   sqrt(robust_pooled_variance(x,y))}))

mean(replicate(10000,{
                   sd <- 3.2
                   x <- rnorm(100,sd=sd)
                   y <- rnorm(100,m=1,sd=sd)
                   sqrt(pooled_variance(c(x,y),rep(0:1,each=100)))}))

mean(replicate(10000,{ncp <- .35
                      x <- rt(100,4)*2.8
                      y <- rt(100,4,ncp)*2.8
                      sqrt(robust_pooled_variance(x,y))}))

mean(replicate(10000,{a <- 1
                      ncp <- .4
                      x <- log(rgamma(100,a))
                      y <- log(rgamma(100,a))+ncp
                      sqrt(robust_pooled_variance(x,y))}))

mean(replicate(10000,{ncp <- 1
                      x <- rnorm_cont(100,0,3.15,cshift=0,csd=3*3.15)
                      y <- rnorm_cont(100,ncp,3.15,cshift=ncp,csd=3*3.15)
                      sqrt(robust_pooled_variance(x,y))}))


mean(replicate(10000,{ncp <- 1
                      x <- rnorm_cont(100,0,3.15,cshift=0,csd=3*3.15)
                      y <- rnorm_cont(100,ncp,3.15,cshift=ncp,csd=3*3.15)
                      sqrt(pooled_variance(c(x,y),rep(0:1,each=100)))}))


#This concludes the configuration of the large sample setup.

In all situations our planning assumption is normally distributed observations with mean difference of $\delta_0=1$ and common standard deviation of $\sigma_0=2.5$. For these larger sample sizes there is almost no difference in power between the the inverse normal combination and the fixed sample t-test. We will simulate data from:

Normal distributions with mean-differences $\delta=1$ and $\sigma=3.2$
Log-gamma distirbiutions with shape parameter $a=1$ shifted by $\delta=.4$ in the treatment group
$t$-distributions with $4$ degrees of freedom and non-centrality parameter $.35$ in the treatment group

in all cases the inverse normal combination test has a (simulated) power of 60%.

Medium sample size

In all situations our planning assumption is normally distributed observations with mean difference of $\delta_0=1$ and common standard deviation of $\sigma_0=1.35$. For the medium sample size there is almost no difference in power between the the inverse normal combination and the fixed sample t-test. We will simulate data from:

Normal distributions with mean-differences $\delta=1$ and $\sigma=1.7$
Log-gamma distirbiutions with shape parameter $a=1$ shifted by $\delta=.75$ in the treatment group
$t$-distributions with $4$ degrees of freedom and non-centrality parameter $.65$ in the treatment group
Contaminated normal with mean-differences $\delta=1$ and 90% observations $\sigma = 1.3$ and 10% observations with $\sigma= 3.9$.

in all cases the inverse normal combination test has a (simulated) power of 60%.

distribution <- data_frame('Distribution'='Normal',
                           'Scenario'='Null',
                           'Sample-size'='Small',
                           'Data'=rnorm(10^6,0,.9))
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='Normal',
                                     'Scenario'='Alternative',
                                     'Sample-size'='Small',
                                     'Data'=rnorm(10^6,1,.9)))
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='Log-gamma',
                                     'Scenario'='Null',
                                     'Sample-size'='Small',
                                     'Data'=log(rgamma(10^6,1))))
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='Log-gamma',
                                     'Scenario'='Alternative',
                                     'Sample-size'='Small',
                                     'Data'=log(rgamma(10^6,1))+1.35))
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='t',
                                     'Scenario'='Null',
                                     'Sample-size'='Small',
                                     'Data'=rt(10^6,4,0)))    
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='t',
                                     'Scenario'='Alternative',
                                     'Sample-size'='Small',
                                     'Data'=rt(10^6,4,1.2)))
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='Contaminated',
                                     'Scenario'='Null',
                                     'Sample-size'='Small',
                                     'Data'=rnorm_cont(10^6,mean=0,sd=.8,cprop=.1,cshift=0,csd=3*.8)))
distribution <- bind_rows(distribution,
                          data_frame('Distribution'='Contaminated',
                                     'Scenario'='Alternative',
                                     'Sample-size'='Small',
                                     'Data'=rnorm_cont(10^6,mean=1,sd=.8,cprop=.1,cshift=0,csd=3*.8)))


rec_relevel <- function(factor,levels){
    if(length(levels) == 1){
        relevel(factor,levels)
    } else {
        rec_relevel(relevel(factor,tail(levels,1)),head(levels,-1))
    }
}


distribution %<>% mutate(Scenario = relevel(factor(Scenario),'Null'))
distribution %<>% mutate(Distribution = rec_relevel(factor(Distribution),c('Normal','Log-gamma','Contaminated')))

distribution %>% group_by(Distribution,Scenario) %>% summarize(sd=sd(Data))

#distribution %<>% group_by(Distribution,Scenario) %>% mutate(Data = Data/sd(Data))

distribution_plot <- ggplot(distribution,aes(x=Data))+geom_line(aes(lty=Scenario),stat='density')+facet_wrap(~Distribution)+xlim(-7,10)+theme_minimal()

cairo_ps('distributions.eps')
print(distribution_plot)
dev.off()

Simulation results

run_scenario <- function(scenario){
    for(i in names(scenario)) assign(i,scenario[[i]])
    xdist = function(N) {
        switch(distribution,
               'normal'=rnorm(N,m=0,sd=parameter),
               'log-gamma'=log(rgamma(N,1.2))*parameter,
               't4'=rt(N,4)*parameter,
               'cont'=rnorm_cont(N,m=0,sd=parameter,cshift=0,csd=3*parameter),
               'count'=NULL)
    }
    ydist = function(N) {
        switch(distribution,
               'normal'=rnorm(N,m=delta,sd=parameter),
               'log-gamma'=log(rgamma(N,1.2))*parameter+delta,
               't4'=rt(N,4)*parameter+delta,
               'cont'=rnorm_cont(N,m=delta,sd=parameter,cshift=0,csd=3*parameter),
               'count'=NULL)
    }
    ## treatment group size!
    m1 <- ceiling(n1*(1-propn)/propn)
    m <- ceiling(n*(1-propn)/propn)
    x1 <- xdist(n1)
    y1 <- ydist(m1)
    ## fixed sample test sample size
    nf <- ceiling(power.w.test(p1=delta2res(delta0,sigma0,T),propn=propn,power=0.8,silent=T))
    mf  <- ceiling(nf*(1-propn)/propn)
    ## deltas <- c('normal'=delta,
    ##            'log-gamma'=delta,
    ##            't4'=delta,
    ##            'cont'=delta)[distribution]
    ##   nE <- max(n,cond_power_rule_norm_ts(x1,y1,delta=1,target=.84,rob_var=T,maxN=6*n1,type='sn'))
    nAdapt <- combination_power_rule_w_ts_f(x1,y1,pp=delta2res(delta=delta0,sigma0,T),lambda=lambda,maxN=6*n1,propn=propn)
    nE <- ifelse(is.null(nAdapt),n,nAdapt)
    mE <- ceiling(nE*(1-propn)/propn)
    n_combs <- choose(2*n1,n1)*choose(2*(n-n1),(n-n1))
    exact  <- n_combs <= 10^5
    maxn <- max(nf,nE)
    maxm <- max(mf,mE)
    x <- c(x1,xdist(maxn-n1))
    y <- c(y1,ydist(maxm-m1))
    ans <- data_frame(
        exact=exact,
        nf=nf,
        fs=is.null(nAdapt),
        nE=ifelse(is.null(nAdapt),n1,nE),
        fixed_w = wilcox.test(y[1:nf],x[1:nf],alternative='greater',correct=F,exact=T)$p.value < 0.025,
        #invnorm_t=adaptive_invnormtest_2s(x[1:nE],y[1:mE],n1=n1,n=n,ne=nE,m1=m1,m=m,me=mE),
        invnorm_w=ifelse(is.null(nAdapt),FALSE,adaptive_invnorm_wilcoxtest_2s(x[1:nE],y[1:mE],n1=n1,n=n,ne=nE,m1=m1,m=m,me=mE)),
        #adaperm_t=adaperm_DR(c(x[1:nE],y[1:mE]),rep(0:1,c(nE,mE)),n1=n1,n=n,m1=m1,m=m,test=tstat,cer_type='randomized',atest_type='non-randomized',permutations=10^5),
        adaperm_rw=ifelse(is.null(nAdapt),FALSE,adaperm_DR(c(x[1:nE],y[1:mE]),rep(0:1,c(nE,mE)),n1=n1,n=n,m1=m1,m=m,test=ranksum,cer_type='randomized',atest_type='non-randomized',permutations=10^5)),
        adaperm_w=ifelse(is.null(nAdapt),FALSE,adaperm_DR(c(x[1:nE],y[1:mE]),rep(0:1,c(nE,mE)),n1=n1,n=n,m1=m1,m=m,test=ranksum,cer_type='non-randomized',atest_type='non-randomized',permutations=10^5))
    )
    return(ans)
}
                                        #             adaperm_m=adaperm_DR(c(x,y),rep(0:1,each=nE),n1=n1,n=n,test=maritzm,atest_type='davison_hinkley'))



simulate_scenario <- function(scenario,B){
    ans <- bind_rows(mclapply2(1:B,function(i) run_scenario(scenario)))
    summarise(ans,B=B,
                  pExact = mean(exact),
                  ASN = mean(nE),
                  sdSN = sd(nE),
                  MSN = max(nE),
              futilS = mean(fs),
              fixedN = mean(nf),
              fixed_w = mean(fixed_w),
#                  fixed_t = mean(fixed_t),
#                  fixed_w = mean(fixed_w),
                  #invnorm_t=mean(invnorm_t),
                  invnorm_w=mean(invnorm_w),
                  #adaperm_t=mean(adaperm_t),
                  adaperm_w=mean(adaperm_w),
                  adaperm_rw=mean(adaperm_rw)
#                  adaperm_rwp=mean(adaperm_rwp)
#                  adaperm_ut=mean(adaperm_ut),
#                  adaperm_uw=mean(adaperm_uw)
                  )
#                  adaperm_m=mean(adaperm_m))
}

scenarios <- data_frame(n1 = rep(c(rep(5,3),rep(50,3),rep(15,3),rep(10,3)),2),
                n =  rep(c(rep(10,3),rep(100,3),rep(30,3),rep(20,3)),2),
                distribution = rep(rep(c('normal','t4','cont'),4),2),
                parameter =rep(c(
                    0.80,0.65,0.70,
                    2.6,2.20,2.3,
                    1.5,1.28,1.32,
                    3,2.5,2.7),2),
                delta = c(rep(0,12),rep(1,9),rep(2,3)),
                delta0 = rep(c(rep(1,9),rep(2,3)),2),
                sigma0 = rep(rep(c(.80,2.6,1.5,3),each=3),2),
                lambda = rep(c(rep(550e-4,3),rep(550e-5,3),rep(140e-4,3),rep(180e-4,3)),2),
                propn = rep(c(rep(1/2,9),rep(1/3,3)),2))


##simulate_scenario(scenarios[24,],100)
##run_scenario(scenarios[15,])
small_index <- which(scenarios$n1 == 5)
medium_index <- which(scenarios$n1 == 15)
large_index <- which(scenarios$n1 == 50)
example_index <- which(scenarios$n1 == 10)

small_scenarios <- list()
medium_scenarios <- list()
large_scenarios <- list()
example_scenarios <- list()

## about 1h for 1e4
for(i in 1:6) small_scenarios[[i]] <- bind_cols(scenarios[small_index[i],],simulate_scenario(scenarios[small_index[i],],B=1e3))
print(bind_rows(small_scenarios),width=Inf)

save(small_scenarios,file=vfile('~/adaperm_simresults/small_simulations_futility'))

## about 3.5h for 1e4
for(i in 1:6) medium_scenarios[[i]] <- bind_cols(scenarios[medium_index[i],],simulate_scenario(scenarios[medium_index[i],],B=1e4))
print(bind_rows(medium_scenarios),width=Inf)
save(medium_scenarios,file=vfile('~/adaperm_simresults/medium_simulations_futility'))

for(i in 1:6) large_scenarios[[i]] <- bind_cols(scenarios[large_index[i],],simulate_scenario(scenarios[large_index[i],],B=1e4))
print(bind_rows(large_scenarios),width=Inf)
save(large_scenarios,file=vfile('~/adaperm_simresults/large_simulations_futility'))

for(i in 1:6) example_scenarios[[i]] <- bind_cols(scenarios[example_index[i],],simulate_scenario(scenarios[example_index[i],],B=1e3))
save(example_scenarios,file=vfile('~/adaperm_simresults/example_simulations_futility'))

sims <- bind_rows(out)
#for(i in 1:nrow(scenarios)) out[[i]] <- bind_cols(scenarios[i,],simulate_scenario(scenarios[i,],B=20000))
print(sims,width=Inf)

For each scenario we simulated 50000 trials. For each trial we computed the test decision of each test procedure. Permutation tests used a maximum of 10^5 permutations for the computation of the permutation distribution.

Simulations for the small sample scenarios took about a day using 30 cores of our simulation server. Simulations for the large sample scenarios took up to one week using 30 cores of our simulation server. This is due to the fact that for the small sample scenarios we often do not need to perform computations for all 10^5 permutations since the complete permutation space if often smaller.

options(dplyr.width=120)
load('~/srv-home/adaperm_simresults/small_simulations_newrules_node6_161018.Rd')
load('~/srv-home/adaperm_simresults/medium_simulations_newrules_node6_161018.Rd')
load('~/srv-home/adaperm_simresults/large_simulations_newrules_node6_161007.Rd')


## small_null_50k <- sims
## load('~/srv-home/adaperm_simresults/alt_simulations_node2_160405.Rd')
## small_alt_50k <- sims
## load('~/srv-home/adaperm_simresults/null_simulations_large_node6_160412.Rd')
## large_null_50k <- sims
## load('~/srv-home/adaperm_simresults/alt_simulations_large_node6_160419.Rd')
## large_alt_50k <- sims
## load('~/srv-home/adaperm_simresults/simulations_medium_node6_160429.Rd')
## medium_null_50k <- sims[1:4,]
## medium_alt_50k <- sims[5:8,]
## ## load('../data/simulations_unstratified_node6_160503.Rd')
## small_null_50k <- sims[1:4,]
## small_alt_50k <- sims[9:12,]
## medium_null_50k <- sims[5:8,]
## medium_alt_50k <- sims[13:16,]

sims <- bind_rows(small_scenarios,
                  medium_scenarios,
                  large_scenarios)
colnames(sims) <- gsub("adaperm","ce",colnames(sims))
colnames(sims) <- gsub("invnorm","ct",colnames(sims))
colnames(sims) <- gsub("fixedN","fsN",colnames(sims))
colnames(sims) <- gsub("fixed_w","fs:W",colnames(sims))
colnames(sims) <- gsub("_t",":T",colnames(sims))
colnames(sims) <- gsub("_w",":W",colnames(sims))
colnames(sims) <- gsub("_rw","$^r$:W",colnames(sims))



## pdf('simresults.pdf',height=4,width=7)
## sims %>% mutate(distribution = relevel(factor(distribution),'normal')) %>%
##     melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>% 
##         ggplot(aes(distribution,Power,fill=Test))+geom_bar(stat='identity',position='dodge')+facet_grid(delta0~n1,scales='free_y')
## dev.off()

sims %>% melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>%
    subset(delta0==0 & n1==5) %>% dcast(distribution~Test,value.var='Power') -> small_null_tab
sims %>% melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>%
    subset(delta0>0 & n1==5) %>% dcast(distribution~Test,value.var='Power') -> small_alt_tab
sims %>% melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>%
    subset(delta0==0 & n1==15) %>% dcast(distribution~Test,value.var='Power') -> medium_null_tab
sims %>% melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>%
    subset(delta0>0 & n1==15) %>% dcast(distribution~Test,value.var='Power') -> medium_alt_tab
sims %>% melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>%
    subset(delta0==0 & n1==50) %>% dcast(distribution~Test,value.var='Power') -> large_null_tab
sims %>% melt(.,id.vars=1:11,value.name='Power',variable.name='Test') %>%
    subset(delta0>0 & n1==50) %>% dcast(distribution~Test,value.var='Power') -> large_alt_tab



##devtools::install_github('qrat/plaint')
colnames(sims)
library(plaint)
dists <-  c("normal" = 1,"t4" = 2, "cont" = 3)
simtab <- sims[,-c(1,4,6,9:10,11,13)]
simtab[,c(8,10:13)] <- simtab[,c(8,10:13)] * 100
simtab <- simtab[with(simtab,order(dists[distribution],n,delta)),]
#rownames(simtab) <- paste(simtab$distribution,1:nrow(simtab))
simtab %<>% mutate(distribution=c("cont" = "Cont.","log-gamma" = "log-$\\gamma$","normal" = "Normal","t4" = "t$_4$"," " = " ")[distribution]) %>% as.data.frame
ftab <- form(simtab[,c(3,4,1,2,5,7,6,8:ncol(simtab))],formatcolumn = list(
                                                          "%1i" = "delta",
                                                          "%1.1f" = "sigma0",
                                                          "%3i" = "n",
                                                          "%3i" = "MSN",
                                                          "%3i" = "fsN",
                                                          "%3.1f" = "ASN",
                                                          "%2.1f" = "futilS",
                                                          "%1.3f" = 'lambda',
                                                          "%2.1f" = "c*"
                                                          ))
#             marker=list("v=='1'" = "delta0"),
                                        #             symbol=list("**" = "==")
ftab <- within(ftab,n  <- factor(n,levels=c(' 10',' 30','100'),ordered=T))
ftab %<>%
    group_by(delta,n) %>% do(within(.,{ #distribution[duplicated(distribution)] <- ' '
        delta[duplicated(delta)] <- ' '
        n <- as.character(n)
        n[duplicated(n)] <- ' '
        sigma0[duplicated(sigma0)] <- ' '
        lambda[duplicated(lambda)] <- ' '
#        ASN[duplicated(ASN)] <- ' '
        MSN[duplicated(MSN)] <- ' '
    }))
ftab[10,1] <- "\\\\\\hline\\\\ 1"
ftab <- ftab[,!grepl("T$",colnames(ftab))]
latex(ftab,options=list(rownames=FALSE),design='design.txt',file="~/repos/ovl_adaperm/table.tex")

Type 1 error

pander(small_null_tab,split.table=3000,digits=3,caption='Simulations of small sample scenarios under the null hypothesis')

pander(medium_null_tab,split.table=3000,digits=3,caption='Simulations of medium sample scenarios under the null hypothesis')

pander(large_null_tab,split.table=3000,digits=3,caption='Simulations of large sample scenarios under the null hypothesis')

We see that the Type I error rate is essentially controlled for all scenarios. Small deviations from the nominal level are due to simulation error. Even the test using the mid-$p$-value in the second-stage seems to control the Type I error rate.

Power

pander(small_alt_tab,split.table=3000,digits=3,caption='Simulations of small sample scenarios under the alternative hypothesis')

Looking at the power there are large differences between the small and large sample scenarios. In the small sample scenarios the inverse normal combination Wilcoxon test performs poorly, this is due to the extreme discreteness of the conditional null distribution when the test is applied stage-wise with only 5 observations per group. Consequently applying an inverse normal combination t-test outperforms the non-parametric test even for non-normal observations. The conditional error based permutation test is affected by the small sample sizes to a much smaller degree and outperforms the t-test for all but the normal scenario. The inverse normal combination t-test and the conditional error permutation test show similar performance in all secnarios. Using the conditional error rate of a randomized pre-planned test adds on average 0.4 percentage points to the power of the test procedure, performing the second stage test using test based on the mid-$p$ value adds almost anothe percentage point to the power. In comparison to the inverse normal combination t-test we gain between 1 percentage point (normally distributed observations) and 3 percentage points (log-gamma, contaminated normal) in power.

pander(medium_alt_tab,split.table=3000,digits=3,caption='Simulations of medium sample scenarios under the alternative hypothesis')

pander(large_alt_tab,split.table=3000,digits=3,caption='Simulations of large sample scenarios under the alternative hypothesis')

In the large sample scenario the inverse normal combination and conditional error permutation test procedure perform very similar under the alternative. Test procedures based on the $t$-Statistic are outperformed by Wilcoxon type tests for all but the normal scenario. The conditional error permutation test based on the wilcoxon statistic outperform the combination tests by a small margin $.1$ percentage points. For the $t$-statistic it is the other way around. We conjecture that due to the asymptotic normality of the $t$-statistic the inverse-normal combination is closer to the actual conditional error rate of the preplanned test whereas the conditional error permutation test uses the less efficient "conditional-conditional" error rate. This difference may be reduced using a sub-sampling approach to compute average the 'conditional-coniditional'-error rate over many subsamples - with the size of the preplanned second stage - of an extended second stage, which asymptotically (though requiring the sample size extension going to inifinity) converges to the unconditional conditional error rate.

set.seed(10514)
library(flip)
B <- 1000
n <- 20
X0 <- matrix(rt(n*B,df=2),n)
X <- matrix(rt(n*B,df=2),n)+1


ttests0 <- apply(X0,2,t.test,alternative='greater')
ttests0_pvalues <- sapply(ttests0,`[[`,'p.value')
ttests <- apply(X,2,t.test,alternative='greater')
ttests_pvalues <- sapply(ttests,`[[`,'p.value')


comb10 <- apply(X0[1:(n/2),],2,t.test,alternative='greater')
comb10_pvalues <- sapply(comb10,`[[`,'p.value')
comb20 <- apply(X0[(n/2+1):n,],2,t.test,alternative='greater')
comb20_pvalues <- sapply(comb20,`[[`,'p.value')
comb0_pvalues <- pnorm(sqrt(.5)*(qnorm(comb10_pvalues,lower=F)+qnorm(comb20_pvalues,lower=F)),lower=F)

comb1 <- apply(X[1:(n/2),],2,t.test,alternative='greater')
comb1_pvalues <- sapply(comb1,`[[`,'p.value')
comb2 <- apply(X[(n/2+1):n,],2,t.test,alternative='greater')
comb2_pvalues <- sapply(comb2,`[[`,'p.value')
comb_pvalues <- pnorm(sqrt(.5)*(qnorm(comb1_pvalues,lower=F)+qnorm(comb2_pvalues,lower=F)),lower=F)

permtests0 <- flip(X0,statTest='t',tail=1,perms=10000)
permtests0_pvalues <- permtests0@res$`p-value`
permtests <- flip(X,statTest='t',tail=1,perms=10000)
permtests_pvalues <- permtests@res$`p-value`

wilctests0 <- flip(X0,statTest='Wilcoxon',tail=1,perms=10000)
wilctests0_pvalues <- wilctests0@res$`p-value`
wilctests <- flip(X,statTest='Wilcoxon',tail=1,perms=10000)
wilctests_pvalues <- wilctests@res$`p-value`

mean(ttests0_pvalues<=.05)
mean(permtests0_pvalues<=.05)
mean(comb0_pvalues<=.05)
mean(wilctests0_pvalues<=.05)

mean(ttests_pvalues<=.05)
mean(permtests_pvalues<=.05)
mean(comb_pvalues<=.05)
mean(wilctests_pvalues<=.05)