laet: A Length-Aware Enrichment Test
In dmanescu/enrichment-test:

Description Usage Arguments Details Examples

This function implements the symmetric length-aware enrichment test, a generalisation of Fisher's exact test. Fisher's exact test invokes the hypergeometric distribution to describe the size of the overlap between N objects of which m have property X with equal likelihood, and k of the N objects have property Y with equal likelihood. In this symmetric, length-aware generalisation we permit the N objects to have independent but not identical probabilities of having properties X & Y. This function also implements an asymmetric generalisation of Fisher's exact test which may be more appropriate under some circumstances. In it we assume it is known which m of the N objects have property X, but each object has a certain probability of having property Y.

laet(observed = NULL, m = NULL, k = NULL, x_probs = NULL,
  y_probs = NULL, test = c("symmetric", "x-cond", "y-cond"),
  side = c("two-sided", "gt", "lt"), kind = c("cdf", "pmf"),
  method = c("exact", "MC", "fast_normal", "normal", "binom", "saddlepoint"),
  MC.iterations = NULL)

`observed`	observed value or vector of observed values
`m`	number of the N objects which have property X
`k`	number of the N objects which have property Y
`x_probs`	vector of length N, holding the probabilities that each object has property X
`y_probs`	vector of length N, holding the probabilities that each object has property Y
`test`	one of "symmetric", "x-cond" (in which case valid x_probs are 1 and 0) and "y-cond"
`side`	one of "gt", "lt" and "two-sided" (in which we define the p-value as the probability of at least as unlikely an outcome)
`kind`	one of "cdf" and "pmf" (parameter side is optional if "pmf" is given)
`method`	method(s)/approximation(s) to use. Valid methods are MC, fast_normal, normal, binom, exact and saddlepoint. Only the latter two are available for asymmetric tests.
`MC.iterations`	(optional) number of iterations to use if Monte Carlo simulation is selected

The binomial method (method="binom") calculates the distribution of the overlap given only the marginal probabilities x_probs and y_probs. In this situation the probabilities that each of the N objects has both properties X and Y are given by x_probs * y_probs, resulting in a simple binomial distribution. The exact method uses dynamic programming to calculate the distribution of the overlap exactly, conditioned on the totals m and k, and is fairly computationally intensive. The normal method calculates the first and second moments of the same distribution to generate a normal distribution. The moments are derived using FFTs and provide neither a significant speedup nor particularly good accuracy. The faster normal method (method="fast_normal") approximates the moments and provides a more meaningful speedup in exchange for only a small reduction in accuracy over the normal approximation which derives the moments exactly. A Monte Carlo method is available and provides a good balance between running time and accuracy, but cannot be used if full precision is required or the p-values in question are too small. Finally a saddlepoint approximation is available. Its implementation is considerably more involved than the normal approximation but gives much greater accuracy even down to the lowest p-values and its running time is excellent.

For asymmetric tests (test="x-cond" and test="y-cond") we assume those objects with property X (resp. Y) are known, and the probabilities that each of the objects has property Y are given by y_probs (resp. X, x_probs). As such it is assumed that the vector of x_probs (resp. y_probs) consists only of the values 1 and 0 corresponding to objects which have/do not have property X (resp. Y), and therefore this vector sums to m (resp. k) by design. The exact method for the asymmetric tests is more performant than for the symmetric test, but a saddlepoint approximation is also provided.

The results of the test come in a variety of forms. Given kind="pmf", the test will return a vector giving the probability that each observed value occurs, i.e. a probability density function. With kind="cdf" the output will be the probability that a result "at least as surprising" as each observed value occurs. The definition of "at least as surprising" differs based on the side given. For side="lt" and side="gt" we "at least as small" or "at least as large" are selected. With side="two-sided" we define the p-value to be the probability of an event will occur whose probability mass is less than or equal that of the observed value.

> laet_out = laet(0:3, m=3, k=5, x_probs=rep(0.2, 8), y_probs=rep(0.3, 8), test="symmetric",
       side="lt", kind="cdf", method=c("exact", "saddlepoint"))
> print(laet_out$results$saddlepoint)
> print(laet_out$results$exact)