In slowkow/mhg: Non-parametric rank enrichment test for binary data.

Abstract

The mhg package implements an enrichment test for a ranked binary list. Given a ranked binary list of ones and zeros, this package enables you to test if the ones are enriched at the beginning of the list. The method uses a dynamic programming algorithm to obtain an exact permutation p-value.

Simulation

Suppose we have a set of $N = 5000$ genes and $K = 100$ of them are annotated with a Gene Ontology (GO) term. Further, suppose that we find some subset of these genes to be significantly differentially expressed (DE) between two conditions. Within the DE genes, we notice that $k = 15$ of the DE genes are annotated with the Gene Ontology term. At this point, we would like to know if the GO term is enriched for DE genes.

A common strategy is to use the hypergeometric distribution to compute a probability that we would observe a given number of DE genes annotated with a GO term. Suppose we only look at the top $n$ genes with the greatest fold-change. Then, the probability to observe exactly $k$ of the $n$ genes to be annotated with a GO term is:

$$\text{Prob}(X = k) = \dfrac{\binom{K}{k} \binom{N - K}{n - k}}{\binom{N}{n}}$$

The intuition is commonly described in terms of an urn with marbles. Suppose there are $N$ marbles, some of which are green and the rest are red. We know that $K$ of them are green and $N - K$ are red. We draw $n$ marbles from the urn at random and observe that $k$ of them are green and $n - k$ are red. The probability to observe exactly $k$ green marbles when we draw $n$ marbles is the hypergeometric probability. This is also equivalent to the one-tailed version of Fisher's exact test [@Rivals2007-vx].

Algorithm

The minimum hypergeometric (mHG) algorithm consists of three steps:

Compute a hypergeometric probability at each rank in the list.
Choose the minimum hypergeometric probability (mHG) as the test statistic.
Use dynamic programming to compute the exact permutation p-value for observing a test statistic at least as extreme by chance.

Eden et al. described the original mHG algorithm in the context of transcription factor binding motif enrichments [@Eden2007-af]. The interested reader should go to the methods section for more details and mathematical proofs.

Wagner extended the algorithm to be more robust by using two extra parameters $X$ and $L$ that limit the number of tests [@Wagner2015-ub]. The $X$ paramter requires at least $X$ ones to be present in the beginning of the list. The $L$ parameter limits testing to the first $L$ ranks $(L < N)$. The interested reader should go to the methods section for more details.

This R package implements the extended algorithm, which Wagner calls "XL-mHG". A Cython implementation of the extended algorithm is described in [@Wagner2015-ub] and is available here.

Example

Run the enrichment test like this:

library(mhg)

# Size of the population.
N <- 5000
# Successes in the population.
K <- 100
# Only consider enrichments in the first L observations.
L <- N / 4
# Require at least X successes in the first L observations.
X <- 5
# Define 15 items in the population as successes.
set.seed(42)
x <- rep(0, N)
x[sample(100, 5)] <- 1
x[sample(200, 10)] <- 1

# Test for enrichment.
res <- mhg_test(x, N, K, L, X)

The results are stored in a list called res with three elements:

names(res)

res$threshold

res$mhg[1:30]

res$pvalue

Plot the results like this:

# Simulate a fold-change for the plot.
fc <- sort(rnorm(N, 0, 1))

# This is how you can plot the results.
plot_mhg(
  values = fc,
  x = x,
  res = res,
  n = L,
  main = "GO:123",
  value = bquote("log"[2] ~ "FC")
)

The top panel shows the enrichment score. The y-value is the negative log10 of the hypergeometric probability to observe $k$ successes after $n$ trials. $n$ is increased by one each time we step along the x-axis and $k$ is increased when we encounter a new success. The red dotted line shows where we find the minimum hypergeometric probability (mHG).

The middle panel shows which of the ranked items in the tested list are successes (blue) or failures (white). In this case, a gene is a success if it is annotated with a GO term and significantly differentially expressed.

The bottom panel shows the values used to rank the items in the tested list in decreasing order. In this case, we found that some of the most down-regulated genes are enriched for the GO term. If we wish to test the most up-regulated genes instead, then we can reverse the list of genes by sorting in decreasing order, and then repeat the test. If we wish to test genes with large fold-change in either direction, then we can take the absolute value of log fold-change, rank the genes in decreasing order, and then repeat the test.