mhg_test: Test for enrichment in a ranked binary list.
In slowkow/mhg: Non-parametric rank enrichment test for binary data.

Description Usage Arguments Details Value References See Also Examples

View source: R/RcppExports.R

Given a ranked binary list of ones and zeros, test if the ones are enriched at the beginning of the list.

1	mhg_test(x, N, K, L, X, upper_bound = FALSE, tol = 1e-16)

`x`	Binary vector of ones and zeros.
`N`	Size of the population.
`K`	Number of successes in the population.
`L`	Only consider scores for the first L observations.
`X`	Require at least X ones to get a score less than 1.
`upper_bound`	Instead of running a dynamic programming algorithm, return the upper bound for the p-value.
`tol`	The tolerance for testing equality of two numbers.

Suppose we have a set of N = 5000 genes and K = 100 of them are annotated with a Gene Ontology (GO) term. Further, suppose that we find some subset of these genes to be significantly differentially expressed (DE) between two conditions. Within the DE genes, we notice that k = 15 of the DE genes are annotated with the Gene Ontology term. At this point, we would like to know if the GO term is enriched for DE genes.

We use the hypergeometric distribution to compute a probability that we would observe a given number of DE genes annotated with a GO term. You can find more details in the documentation for dhyper.

The method consists of three steps:

Compute a hypergeometric probability at each rank in the list.
Choose the minimum hypergeometric probability (mHG) as the test statistic.
Use dynamic programming to compute the exact permutation p-value for observing a test statistic at least as extreme by chance.

A list with items "threshold", "mHG", and "pvalue".

Eden, E., Lipson, D., Yogev, S. & Yakhini, Z. Discovering motifs in ranked lists of DNA sequences. PLoS Comput. Biol. 3, e39 (2007). http://dx.doi.org/10.1371/journal.pcbi.0030039

Wagner, F. GO-PCA: An Unsupervised Method to Explore Biological Heterogeneity Based on Gene Expression and Prior Knowledge. bioRxiv (2015). http://dx.doi.org/10.1101/018705

plot_mhg

# Size of the population.
N <- 5000L
# Successes in the population.
K <- 100L
# Only consider enrichments in the first L observations.
L <- N / 4L
# Require at least X successes in the first L observations.
X <- 5L

set.seed(42)

# Binary vector of successes and failures.
x <- rep(0, N)
x[sample(100, 5)] <- 1
x[sample(200, 10)] <- 1

res <- mhg_test(x, N, K, L, X)

abs(res$pvalue - 1.810658e-05) < 1e-6 # TRUE

# Plot the result.
plot_mhg(sort(rnorm(N)), x, res, L)