In statlab/permuter: permuter: Permutation Tests

Vignette summary

library(permuter)
set.seed(101)

IPAT Data

As an initial example, let us consider a testing problem on the effectiveness of training in the reduction of anxiety in a sample of n = 20 subjects (Pesarin, 2001). At first glance, the subjects of the experiment are presumed to be "homogeneous" with respect to the most important experimental conditions, the so-called covariates, such as sex, age and health.

Suppose that anxiety, the variable $Y$, is measured by means of an Institute for Personality and Ability Testing (IPAT) psychological test, responses to which are quantitative scores corresponding to the sum of sub-responses to a set of different items. Each unit is observed before treatment, occasion 1, also called the baseline observation, and one week after a fixed number of training sessions, occasion 2, which are administered with the aim of stochastically reducing baseline values.

Of course, bivariate responses within each specific unit are dependent because they are measured on the same unit on different occasions, whereas the $n$ pairs are assumed to be independent because related to different units. Moreover, due to the assumed homogeneity of all individuals with respect to most common experimental conditions, the set of data pairs ${(Y_{1i},Y_{2i}), i = 1,\dots,n}$ may be viewed as a random sample of $n$ i.i.d. pairs from the bivariate random variable $Y = (Y_1,Y_2)$. Formally, data are represented by a matrix of $n$ pairs $(\mathbf{Y}_1,\mathbf{Y}_2) \in \mathcal{X}$, where $\mathcal{X}$ is the sample space of the experiment.

We'd like to test the strong null hypothesis $H_0: Y_1 \,{\buildrel d \over =}\, Y_2$ against the alternative $H_1: Y_1 \,{\buildrel d \over >}\, Y_2$. Under the null, whether the IPAT measurement was taken at the beginning or end of the treatment is essentially irrelevant, making $Y_{1i}>Y_{2i}$ equally as likely as $Y_{1i}<Y_{2i}$. Conditional on the observed values $\mathbf{Y}1$ and $\mathbf{Y}_2$ and assuming the null hypothesis, we know the entire null distribution of the differences $(\mathbf{Y}_1, \mathbf{Y}_2)$: it consists of all $2^n$ permutations of pairs $(Y{1i}, Y_{2i})$.

We begin by loading in the data. The column YA contains the first measurement and the column YB contains the second measurement, and we'd like to look at the change between the two measurements. The test statistic we will use is the mean of the differences YA-YB. Note that testing $H_0$ above is equivalent to testing $H_0: YA-YB \text{ is symmetric about }0$. Using the one_sample function, we can approximate the null distribution of the mean difference with B=1000 random permutations:

data(ipat)
d <- ipat$YA - ipat$YB
n <- nrow(ipat)
B <- 100
observed <- mean(d)
distr <- one_sample(x = d, reps = B)

The vector distr has length B and contains the null distribution of the test statistic. In order to obtain a p-value we need to use the function t2p that returns the p-value from a distribution of permutation values of the test statistic (i.e. it compares the observed test statistic with the whole distribution). You may specify an alternative ('greater', 'less', or 'two-sided') or leave the argument alternative blank to get all three p-values.

t2p(observed, distr)

The output is the p-value of this analysis. In this case the null hypothesis $H_0 : Y_1 \,{\buildrel d \over =}\, Y_2$ is rejected if favor of the alternative $H_1: Y_1 \,{\buildrel d \over >}\, Y_2$. Let's calculate the conditional power for the 'greater than' alternative.

R <- 100
p.val <- array(0, dim = c(R, 1))
pval <- rep(0, R)
Z <- ipat[, 1] - ipat[, 2] - sum(d)
for (cc in 1:R) {
    d.star <- sample(d)
    distr <- one_sample(x = d.star, reps = B)
    pval[cc] <- t2p(mean(d.star), distr, alternative = "greater")
}

alpha <- 0.05
pow <- mean(pval <= alpha)
pow

Job Satisfaction

Let us now discuss, as a second example, a problem (Pesarin, 2001) concerning the comparison of locations of two populations. In a psychological experiment to assess the degree of job satisfaction of two groups of workers, 20 units, assumed to be homogeneous in respect of most important covariates, such as sex, age, general health and social status, were examined through the response variable $X$, corresponding to the perceived degree of job satisfaction. $X$ was measured by a proper psychological index consisting of a sum of a finite number of items each related to a specific sub-aspect. Before the experiment was carried out, $12$ units (group 1) were classified as "extroverted", $X=1$, and the remaining $8$ units (group 2) were classified as "introverted", $X=2$. The testing problem was to show whether the data conform better to the null hypothesis of no difference in distribution, or to the alternative of a difference in favour of "extroverted", corresponding to a one-sided (same as restricted or dominance) alternative. It is worth noting that since subjects are assigned to symbolic treatment levels (extroverted and introverted) after they were observed, so that subjects were not randomized to treatments, this is a typical observational study where the treatment is merely a post-hoc classification (see Remark 4, 2.1.1 for some related problems). However, since the null hypothesis assumes that there is no distributional difference between two treatment levels, instead of permuting subjects we are allowed to permute observed data (see Section 1.5).

We load in the job dataset. The column X measures the degree of job satisfaction and the column Y denotes the treatment (extroverted (1) or introverted (2)). We'd like to test the null hypothesis $H_0: X_1 \,{\buildrel d \over =}\, X_2$ against the alternative $H_1: X_1 \,{\buildrel d \over >}\, X_2$. Under the null, treatment is essentially an arbitrary label that has no effect on the value of $X$ for each individual. We'll use the difference in means as the test statistic, $\bar{X}_1 − \bar{X}_2$. Conditional on the observed data, we know the entire null distribution of this statistic: it consists of the statistic computed for all $20 \choose 12$ relabellings of the treatment.

data(job)
head(job)
group1 <- job[job$Y == 1, "X"]
group2 <- job[job$Y == 2, "X"]

observed <- mean(group1) - mean(group2)
distr <- two_sample(group1, group2)
pvalue <- t2p(observed, distr)
pvalue

Confidence interval

We invert the test to obtain a confidence interval for the difference in mean job satisfaction between extroverted and introverted workers. Suppose that instead of the null hypothesis above, we assume that the distributions of $X_1$ and $X_2$ differ by a constant shift $\delta$. We would test $H_0: X_1 \,{\buildrel d \over =}\, X_2 + \delta$ for some $\delta < \infty$ Let $\delta_0 = \mathbb{E}(X_1) - \mathbb{E}(X_2)$. Note that the test above was a special case of this shift model with $\delta=0$. A $(1-\alpha)100\%$ confidence interval for $\delta_0$ contains all those values $\delta$ for which the null hypothesis $H_0: X_1 \,{\buildrel d \over =}\, X_2 + \delta$ is accepted at level $\alpha$.

CI_mean(group1, group2, alpha = 0.05)

Conditional Power

We compute the conditional power by randomly generating 100 datasets under the null, conditional on the data we observed. Under the null hypothesis $H_0: X_1 = \,{\buildrel d \over =}\, X_2 + \delta_0$, the data $X_{11}-\delta_0,\dots,X_{1n}-\delta_0, X_{21},\dots,X_{2m}$ are exchangeable. Thus we can generate all possible realizations of the samples, conditional on the observed values, by permuting these exchangeable values, splitting into the two treatment groups, and adding back $\delta_0$ to the $X_{1i}^*$ values. For each permuted dataset, we compute a $95\%$ confidence interval for the difference in means between the two groups. The proportion of these confidence intervals that do not cover $0$ (equivalently, the proportion of tests we would reject at level $5\%$ in favor of a two-sided alternative) is the conditional power.

MB <- 100
CI_perm <- matrix(rep(0, 2 * MB), ncol = 2)
Z <- c(group1 - observed, group2)
n <- length(group1)

for (bb in 1:MB) {
    Z.perm <- sample(Z)
    Z.perm[1:n] <- Z.perm[1:n] + observed
    CI_perm[bb, ] <- CI_mean(Z.perm[1:n], Z.perm[-(1:n)], reps = 100)
}
pow <- 1 - mean(CI_perm[, 1] < 0 & CI_perm[, 2] > 0)
pow

Worms

data(worms)
Y <- worms[, 1]
X <- worms[, 2]
n <- table(Y)
C <- length(n)
n

This is an example of one-way ANOVA analysis, where the $k=3$ groups are given by Y and the worm lengths are recorded in X. We use the k_sample function to test the null hypothesis of equality of distributions: $H_0: X_1 \,{\buildrel d \over =}\, X_2 \,{\buildrel d \over =}\, X_3$ against the alternative that they are not all the same. Under the null, group assignment is essentially a random labelling of the units and has no influence on the worm's length.

It is easy to show that, conditionally, the usual F test statistic for the one-way ANOVA is permutationally equivalent to $T^{obs} = \sum_{j=1}^3 n_j \bar{X}_j^2$. k_sample approximates the distribution of the test statistic by permuting the group assignments and then computing, at each permutation, the values of $\bar{X}_j^2, j=1,2,3$ and taking the sum weighted by the $n_j$'s.

The p-value of the test is again obtained by applying the t2p function to the observed test statistic and permutation distribution. Since the test statistic is strictly positive, we are interested in the "greater than" alternative. In this case we can conclude that there is a strong evidence against the null hypothesis.

xbar <- rep(NA, length(n))
for (gg in 1:length(n)) {
    xbar[gg] <- mean(X[Y == unique(Y)[gg]])
}
observed <- sum(n * xbar^2)
distr <- k_sample(X, Y)
t2p(observed, distr, alternative = "greater")

Anderson-Darling

Suppose we observe samples from two distributions $X_1$ and $X_2$ of ordered categorical data. We record the frequencies in f1 and f2. We'd like to test the null hypothesis $H_0: X_1\,{\buildrel d \over =}\, X_2$. We can obtain the empirical CDFs $F_1$ and $F_2$ of groups $X_1$ and $X_2$ by taking the cumulative sums of the frequencies and normalizing. We obtain the overall empirical CDF by summing the frequencies f1 and f2, then applying the same procedure:

x <- 1:5
f1 <- c(8, 9, 6, 8, 9)
f2 <- c(17, 6, 6, 3, 8)
N1 <- cumsum(f1)
N2 <- cumsum(f2)
N <- cumsum(f1 + f2)
n <- sum(N)

temp <- cbind(f1, f2, N1/sum(f1), N2/sum(f2), N/sum(f1 + f2))
rownames(temp) <- x
library(knitr)
kable(temp, row.names = TRUE, col.names = c("f1", "f2", "F_1", "F_2", "F"))

The test statistic $T_D^$ is a sum over $k-1 = 4$ categories of the quantities $D_i = N_{2i}/[N_{.i}\times(n-N_{.i})]^{1/2}$, $i=1,\dots,4$. It is then easier to create a vector containing the $D_i$'s and then sum its elements to obtain $T_D^$. For the observed data:

B <- 100
T <- array(0, dim = c((B + 1), 1))
D <- N2/sqrt(N * (n - N))
T[1] <- sum(D[1:4])

A random permutation of data can be obtained by re-creating the original data that gives the observed frequencies f1 and f2. Indeed, we have to obtain all possible configurations of frequencies ${f_{i1},f_{i2}}$ that satisfies the row and columns totals. To do that, first we create the vectors X1 and X2 containing the categories of each observation in each sample (for semplicity we indicate the categories with the numbers from one to five). Later on, concatenate the vectors X1 and X2 in the vector X and let X.star be a random permutation of X. Create a binary vector of group labels Y. In this example X1 and X2 are vectors of lengths 40, X and Y have lengths equal to 80. X1 and X2 are such that table(X1) = f1 and table(X2) = f2.

Finally, the frequency table corresponding to a random permutation can be obtained by applying the function table to the elements of $X.star$ belonging to the first and second sample, respectively. The permutation values of the test statistic are then obtained as above. Note that this way of proceeding guarantees that the marginal distributions of Table 1 are fixed, therefore we only need to obtain the frequency distribution in the second sample.

X1 <- rep(seq(1, 5), f1)
X2 <- rep(seq(1, 5), f2)
X <- c(X1, X2)
Y <- rep(c(1, 2), c(sum(f1), sum(f2)))
options(warn = -1)
for (bb in 2:(B + 1)) {
    X.star <- sample(X)
    f2.star <- table(X.star[Y == 2])
    N2.star <- cumsum(f2.star)
    D.star <- N2.star/sqrt(N * (n - N))
    T[bb] <- sum(D.star[1:4])
}
t2p_old(T)[1]

Blood testosterone levels in 11 women

The purpose was to evaluate whether the level of testosterone in the blood is subject to change during the day. The null hypothesis is that there is no change in testosterone across time. This example has the peculiarity that the observations are dependent, since testosterone levels were recorded for the same woman over the course of a day. Therefore, under $H_0$ we can permute the measurements for a woman throughout the day, independently for each woman. This problem can be viewed as a two-way ANOVA experiment without interaction, where main factors are "time" (factor B) and "individual" (factor A: blocking factor, not of interest).

data(testosterone)
Y <- rep(seq(1, 5), each = 11)
Time <- colnames(testosterone)
boxplot(unlist(testosterone) ~ Y, xlab = "Time", ylab = "Testosterone", names = Time)
lines(seq(1, 5), apply(testosterone, 2, mean), lty = "dotted")

The commands above assign the dataset to the object testosterone and represent it with a boxplot, the dotted line linking the sample means at each time. The deviance decomposition can be written as $SST = SSA + SSB + SSR$, where $SST$ is the total deviance, $SSA$ and $SSB$ are the deviances due to the main effects and $SSR$ is the residual deviance. Note that $SST$ is constant at each permutation, and so is $SSA$ since we permute observations within the rows of data. On the other hand, $SSB$ and $SSR$ vary at each permutation. The test statistic for the time effect is $F_B = (df_{SSR}/df_{SSB}) \times SSB/SSR$. Leaving out the degrees of freedom $df_{SSB}$ and $df_{SSR}$ that are permutationally invariant, the easiest way to obtain the residual deviance at each permutation is to write it as $SSR^ = SST − SSA − SSB^$. Therefore, the F statistic can be written as $F^ = SSB^/(SS − SSB^)$, where $SS$ is constant at each permutation. It is easy to see that $F^$ is a monotone function of $T^ = SSB^$. Thus, $SSB^*$ is the test statistic we will use.

TO DO: make this chunk run. Error in k_sample call

library(reshape2)

m.col <- apply(testosterone, 2, mean)
m <- mean(m.col)
SSB <- sum((m.col - m)^2)
testosterone$woman <- rownames(testosterone)
testosterone <- melt(testosterone)
head(testosterone)
distr <- k_sample(x = testosterone$value, group = testosterone$variable, group2 = testosterone$woman, 
    stat = "twoway_anova")
t2p(SSB, distr, alternative = "greater")

There is strong evidence to reject the null hypothesis that blood testosterone levels stay constant over time, which confirms the pattern observed in the boxplot above.

Biting flies

In this example we have two samples from two species of flies and seven variables have been measured. We want to test the null

$$H_0: \mu_{1h} = \mu_{2h} \text{ for all } h = 1,\dots,7$$

against the alternative

$$H_1: \left\lbrace \left[ \cup_{1\leq h \leq 6} (\mu_{1h}<\mu_{2h}) \right] \cup (\mu_{17}>\mu_{27}) \right\rbrace$$

where $\mu_{ih}$ is the mean of the $h$th variable in group $i$, $i=1,2$, $h=1,\dots,7$ (see p. 177 for explanation in detail). This can be done by performing one partial test for each variable (according to the related alternative), and then by combining the partial tests into a global test. We will conduct a two-sample test for the location parameter for variable $h$.

The first column of the dataset contains an indicator variable of the species ($0$ for Leptoconops Carteri, $1$ for Leptoconops Torrens).

data(fly)
group0 <- fly[fly$group == 0, ]
group1 <- fly[fly$group == 1, ]
N <- nrow(fly)
p <- ncol(fly)
B <- 1000
distr <- matrix(NA, nrow = B, ncol = p - 1)
partial_pvalue <- rep(NA, p - 1)
for (h in 1:(p - 2)) {
    observed <- mean(group1[, h + 1]) - mean(group0[, h + 1])
    distr[, h] <- two_sample(group1[, h + 1], group0[, h + 1], reps = B)
    partial_pvalue[h] <- t2p(observed, distr[, h], alternative = "greater")
}
observed_col8 <- mean(group1[, 8]) - mean(group0[, 8])
distr[, 7] <- two_sample(group1[, 8], group0[, 8])
partial_pvalue[7] <- t2p(observed_col8, distr[, 7], alternative = "less")
partial_pvalue

The vector partial_pvalue contains the partial p-values of each of the $7$ variables. It appears that the first, third, and fifth variables are strongly significant while the seventh is moderately significant. The matrix distr has dimension $B \times 7$ and contains the null distribution of the test statistic for each variable $h$ in its columns. Using the npc function, we combine the partial tests with Fisher’s, Liptak’s, and Tippett’s combining functions.

combined_p_fisher <- npc(partial_pvalue, distr, combine = "fisher", alternatives = c(rep("greater", 
    6), "less"))
combined_p_liptak <- npc(partial_pvalue, distr, combine = "liptak", alternatives = c(rep("greater", 
    6), "less"))
combined_p_tippett <- npc(partial_pvalue, distr, combine = "tippett", alternatives = c(rep("greater", 
    6), "less"))
print(c(combined_p_fisher, combined_p_liptak, combined_p_tippett))