In corybrunson/tautable: Tables of critical values for Kendall's tau-a

Rank correlation takes place at the interface between statistics and combinatorics. Calculating rank correlation coefficients such as Kendall's (\tau) and Spearman's (\rho) is straightforward and, for small lists, fast; and the distributions for long lists rapidly approach their asymptotics. Kendall's (\tau) mimics Pearson's (r) by ranging from (-1) (perfect negative correlation) through (0) (perfect null correlation) to (+1) (perfect positive correlation) while remaining nonparametric--unlike (\rho), it makes no assumptions about the underlying distribution of values (if such values even exist) that the rankings reflect. Within the (\tau) framework, any of several equivalent statistics may be used to describe the (dis)similarity in two rankings of (n) objects, and each is easily recovered from the others: (P), the number of concordant pairs, or pairs of objects ranked the same relative to each other in both lists; (Q), the number of discordant pairs, so that (P+Q={n\choose 2}); (S=P-Q); and (\tau={S}/{n\choose 2}). (Ties are ignored in this discussion.)

There are at least two extant and two apparently orphaned implementations of (\tau) in R: the kendall option in the cor function (stats package) and the eponymous function in the Kendall package; and cor.fk in the pcaPP projection pursuit package and the kendall method for the rpucor function in the rpud NVIDIA package. While it's easy to compute (\tau) values and confidence intervals with these packages, though, i haven't come across a computational resource for critical values, or quartiles, of (\tau). That is, given (n) objects and a threshold (p), what is the level of concordance (or, equivalently, discordance) between two uniformly random rankings that occurs with probability less than (p), and what value (\tau_{n,p}) corresponds to this concordance? The normal asymptote (\tau_{n,p}\approx\frac{\sqrt{2(2n+5)}}{3\sqrt{n(n-1)}}z_p) (where (Z\sim N(0,1))) offers an approximate solution for large values of (n), but there will always be a place for exact answers if they're available.

library(tautable)

To find a specific critical value using any of these implementations of (\tau), since these functions do not receive values of (\tau), (P), etc. directly, one must gain control of the number of concordant pairs in the input rankings. This amounts to controlling the number of inversions (k) in a permutation (\pi) of (n), taking the identity permutation ((1,2,\ldots,n)) as the second ranking:

# Count inversions in a vector
print(vec_inv)

Whereas the identity has zero inversions, (k) here plays the role of (Q). (And, since (\tau) is symmetric, it's enough to consider (k) values from (0) to (\left\lceil{n\choose 2}/{2}\right\rceil).) To find such a permutation, recall (or look it up for yourself, or just take my word for it) that the number of inversions of (\pi) equals the minimal length of a word for (\pi), and one minimal word for the longest permutation (\pi_0=(n,n-1,\ldots,1)) is $$\prod_{j=1}^{n-1}\left(\prod_{i=j}^{1}s_i\right)=(s_1)(s_2s_1)\cdots(s_{n-1}s_{n-2}\cdots s_1)\text.$$ Therefore, a (highly non-arbitrary) permutation of length (k) obtains as the product of the first (rightmost) (k) generators in this word. Here is an implementation in R (with an option to leverage the symmetry for a bit of efficiency):

# Generate a permutation of 1:n having a given number k of inversions
print(inv_vec)

From there it is quick work to whittle down the options. For example, if i want to know the "value to beat" for a type-I error (p<.01) on a one-sided test for positive correlation between two rankings of (n=30) objects using the Kendall package, i can run the following:

n <- 30; p <- .01
# range of revelant k (at most half of all possible discordant pairs)
k_ran <- c(0, floor(choose(n, 2) / 2))
# corresponding one-sided p-values
p_ran <- sapply(k_ran,
                function(k) Kendall::Kendall(x = 1:n, y = inv_vec(n, k))$sl / 2)
# interpolate (linearly)
repeat {
    if(p <= p_ran[1]) {
        k <- NA; break
    }
    if(p > p_ran[2]) {
        k <- k_ran[2]; break
    }
    q <- (p - p_ran[1]) / diff(p_ran)
    k <- round(q * diff(k_ran) + k_ran[1])
    if(k == k_ran[1]) k <- k + 1 else if(k == k_ran[2]) k <- k - 1
    k_p <- Kendall::Kendall(x = 1:n, y = inv_vec(n, k))$sl / 2
    if(k_p == p) break else if(k_p < p) {
        k_ran[1] <- k; p_ran[1] <- k_p
    } else if(k_p > p) {
        k_ran[2] <- k; p_ran[2] <- k_p
    }
    if(diff(k_ran) < 2) {
        k <- k_ran[1]; break
    }
}
tau_K  <- round((choose(n, 2) - 2 * k) / choose(n, 2), 4)
print(paste("need tau >", tau_K))

n <- 30; p <- .01
# range of revelant k (at most half of all possible discordant pairs)
k_ran <- c(0, floor(choose(n, 2) / 2))
# corresponding one-sided p-values
p_ran <- sapply(k_ran, function(k) cor.test(x = 1:n, y = inv_vec(n, k),
                                            alternative = "t",
                                            method = "kendall")$p.value / 2)
# interpolate (linearly)
repeat {
    if(p <= p_ran[1]) {
        k <- NA; break
    }
    if(p > p_ran[2]) {
        k <- k_ran[2]; break
    }
    q <- (p - p_ran[1]) / diff(p_ran)
    k <- round(q * diff(k_ran) + k_ran[1])
    if(k == k_ran[1]) k <- k + 1 else if(k == k_ran[2]) k <- k - 1
    k_p <- cor.test(x = 1:n, y = inv_vec(n, k),
                    alternative = "t", method = "kendall")$p.value / 2
    if(k_p == p) break else if(k_p < p) {
        k_ran[1] <- k; p_ran[1] <- k_p
    } else if(k_p > p) {
        k_ran[2] <- k; p_ran[2] <- k_p
    }
    if(diff(k_ran) < 2) {
        k <- k_ran[1]; break
    }
}
tau_s  <- round((choose(n, 2) - 2 * k) / choose(n, 2), 4)

This result disagrees slightly with the value 0.301 in Peter M Lee's critical values table, possibly due to error in Kendall, which cites this algorithm by Best and Gipps. The same procedure with Kendall(...)$sl exchanged for cor.test(..., alternative = "t", method = "kendall")$p.value, which is exact for small (n) and computed for the asymptotic Z-score otherwise, yields a critical value of r tau_s.

All this, however, makes for a very inefficient means to generate a table of critical values of (\tau). For that, rather than constructing a permutation of a desired length having a desired number of inversions, then iterating this construction over many lengths and inversion counts, it makes sense, provided it is possible, both

a. to produce the distribution of the inversion counts ((0,1,\ldots,{n\choose 2})) for each length (n); and b. to compute critical values along the (very probably) recursive process that yields these distributions, rather than producing each distribution from scratch.

Conveniently, Shashank at StackOverflow has already made the key observation that these distributions are the Mahonian numbers (T(n,k)), whose generating functions take the elegant form: [\prod_{j=0}^{n-1}\left(\sum_{i=0}^{j}x^i\right)=\sum_{k=0}^{{n\choose 2}}T_{n,k}x^k\text,] which implies that they satisfy a simple recursive formula:

# Compute the Mahonian numbers
print(mahonians_recursive)
print(paste("Example: the Mahonian numbers for n = 5 are",
            toString(mahonians_recursive(5))))

This formula underlies the construction of a critical values table for (\tau) with whatever maximum length (n) and suite of (p)-value thresholds one desires; though the external recursion slows it down dramatically, so for the implementation an internal loop is used. (The code below produces a table of critical values of (k), from which those of (\tau) can be recovered by arithmetic.)

# Rapid table of critical values (n a vector, alpha a vector)
print(tau_crit_table)

As an example, here are the critical values (of (k) and of (\tau)) for lengths at intervals of 10 up to 100 and type-I error thresholds at negative powers of 10:

# Ranges of n and of alpha
example_n <- seq(10, 100, 10); example.alpha <- 10 ^ (-1:-4)
# Critical value table for k
inv_crit_vals <- tau_crit_table(n = example_n,
                                alpha = example.alpha,
                                incl.len = TRUE, stat = "inv")
print(inv_crit_vals)
# Critical value table for tau
tau_crit_vals <- tau_crit_table(n = example_n,
                                alpha = example.alpha,
                                incl.len = TRUE)
print(tau_crit_vals)

The agreement of several rows with their analogs in Lee's table validates the procedure; other tables exist for others of the equivalent statistics i mentioned above, e.g. the number of concordant pairs and the numerator (S), which the interested reader can check for themselves. Most of these functions, and a few more, are available this package, in accordance with the Leek Group's advice.

As a final test, here are the runtimes for tables up to maximum (n) values of 25 through 200 at intervals of 25, which hint at the slow combinatorial explosion that eventually undermines the usefulness of the method, though far beyond the point at which the normal approximaiton becomes adequate:

mbm <- microbenchmark::microbenchmark(
    `50` = tau_crit_table(50L, alpha = 10^(-1:-3)),
    `100` = tau_crit_table(100L, alpha = 10^(-1:-3)),
    `150` = tau_crit_table(150L, alpha = 10^(-1:-3)),
    `200` = tau_crit_table(200L, alpha = 10^(-1:-3)),
    times = 5L
)
plot(x = mbm$expr, xlab = "Maximum number of objects",
     y = log(mbm$time), ylab = "Log-Runtime (ms)")