The Page test is not a trend test
In cultevo: Tools, Measures and Statistical Tests for Cultural Evolution

The Page test is a non-parametric test for monotonically ordered differences in ranks. It can be used to assess the statistical evidence for an increase in the ordinal ranks between $k$ 'treatments' (conditions or generations), based on $N$ independent replications for each treatment. The ordering of the treatments $1 \ldots k$ along which we expect the monotonic effect has to be specified a priori.

Definition

Let $m_i$ be the mean ordinal rank of the measure of interest obtained for treatment $i$, then the null hypothesis of the test (identical to many other tests, e.g. Friedman's) is $$ m_1 = m_2 = \ldots = m_k $$ i.e. there is no difference between the expected ranks for the $k$ conditions. For some reason the original formulation of the alternative hypothesis being tested given in @Page1963 is $$m_1 > m_2 > \ldots > m_k.$$

Later papers and textbook entries correctly point out that the alternative hypothesis is actually $$m_1 \le m_2 \le \ldots \le m_k$$ where at least one of the inequalities has to be a true inequality [@Siegel1988;@Hollander1999, p.284; @VanDeWiel2001, p.143]. What this means is that strong evidence for just a single step-wise change in the mean rank, e.g. $$m_1 < m_2 = \ldots = m_k$$ can be sufficient for the test to reject the null hypothesis.

What the Page test is not

As the alternative hypothesis shows, the Page test is not a 'trend test' in any meaningful way, since it does not test for successive or cumulative changes in ranks. (Note how the original paper speaks of k treatments rather than generations, i.e. the Page test was not designed for what are essentially dependent measures.)

It also cannot show whether differences between the conditions/generations are significant, since it is a non-parametric test that only considers ranks, not absolute changes in the underlying measure. These points will be demonstrated using some semi-randomly generated data sets.

Mock dataset demonstration

To test the sensitivity of the test to a single step-wise difference across conditions we can take a typical sample set of $N=4$ replications with $k=10$ levels each and fix the very first position to always be ranked first, with all successive ranks being randomly shuffled. This is equivalent to the first generation doing badly at a task, with all successive generations outperforming the first one, but no cumulative improvement between them.

# make results reproducible
set.seed(1000)

pseudorandomranks <- function(...)
  unlist(lapply(list(...), function(p) if (length(p) > 1) sample(p) else p))

lowestthenrandom <- function()
  pseudorandomranks(1, 2:10)

# example ordering
t(replicate(4, lowestthenrandom()))

Given this semi-random data generation function, we can now create a large number of data sets and compute their expected distribution of significance levels according to the Page test, and see how it varies based on the number of replications $N$.

library(cultevo)

sampleLs <- function(datafun, nrepetitions) {
  ps <- list("0.001" = 0, "0.01" = 0, "0.05" = 0, "NS" = 0)
  for (i in seq(nrepetitions)) {
    p <- page.test(datafun(), verbose=FALSE)$p.value
    if (p <= 0.001) {
      p <- "0.001"
    } else if (p <= 0.01) {
      p <- "0.01"
    } else if (p <= 0.05) {
      p <- "0.05"
    } else {
      p <- "NS"
    }
    ps[[p]] <- ps[[p]] + 1
  }
  unlist(ps) / nrepetitions
}

# choose some N (number of replications)
sampleps <- function(testfun, N, datafun, nrepetitions=1000)
  testfun(function() t(replicate(N, datafun())), nrepetitions)

Generating 1000 datasets like the one above -- which really only exhibit a single point change in the distribution of mean ranks -- we get a significant result about half of the time:

sampleps(sampleLs, N=4, lowestthenrandom)

Increasing the number of replications to $N=10$, still only assuming that the first generation performs differently from all the other ones:

sampleps(sampleLs, N=10, lowestthenrandom)

The influence is even stronger when the single change point occurs closer to the middle of the number of conditions. Based on 1000 randomly generated datasets where the first two ranks are always shuffled in the first two positions, followed by ranks 3-10 also shuffled randomly across just 4 replications, we obtain the following distribution of p values:

sampleps(sampleLs, N=4, function() pseudorandomranks(1:2, 3:10))

The test is so sensitive to any (even just single-point) evidence for a change in the a priori suspected direction that it is largely unaffected by evidence for a consistent trend in the opposite direction, as can be seen in this data set where more than half of the pairwise differences between ranks indicate a downward trend:

# abrupt upwards jump after the first three observations, but the
# remaining seven observations exhibit a consistent downwards trend
upwardsjumpdownardstrend <- function() pseudorandomranks(1:3, 10, 9, 8, 7, 6, 5, 4)
sampleps(sampleLs, N=4, upwardsjumpdownardstrend)

# start around the middle, then sudden downward followed by extreme upwards jump
updownup <- function() pseudorandomranks(3:4, 5:6, 1:2, 7:8)
sampleps(sampleLs, N=4, updownup)

Alternatives to the Page test

It's not 1963 anymore, so everybody has a computer, and probably some more concrete expectations about the development of their (presumably continuous) measure of interest. Will its value rise indefinitely across conditions/generations, or is there a ceiling where it will level out? Do you have an idea of the value at which it will level out? Will it rise linearly between conditions until it hits its maximum? Logarithmically? Exponentially? All of these are specific hypotheses corresponding to specific models that can be fit and then compared based on your data [@Winter2016].

If you are simply looking for other non-parametric tests for sequential (or otherwise temporally dependent) data, the seasonal Kendall test [@Hirsch1982;@Gilbert1987;@Gibbons2009] takes seasonal effects on environmental measurements into account by computing the Mann Kendall test on each of $k$ seasons/months separately, and then combining the individual test results. Since the order of the individual seasons is not actually taken into account (it only is in a later version of the test, @Hirsch1984), the test is essentially a within-subject version that combines the results of $k$ independent Mann-Kendall tests into one to increase the statistical power [@Gibbons2009, p.211]. The test was in fact already used to test for trends in different geographic sample locations rather than seasons [@Helsel2006]. The seasonal's test alternative hypothesis is "a monotone trend in one or more seasons" [@Hirsch1984, p.728].

Citation

This tutorial can be cited as:

techreport{Stadler2017,
author = {Stadler, Kevin},
title = {{The Page test is not a trend test}},
url = {https://kevinstadler.github.io/cultevo/articles/page.test.html},
year = {2017}
}