# GeneOverlap: Test overlap between two gene lists using Fisher's exact... In shenlab-sinai/geneoverlap-old: Test and visualize gene overlaps

## Description

Given two gene lists, tests the significance of their overlap in comparison with a genomic background. The null hypothesis is that the odds ratio is no larger than 1. The alternative is that the odds ratio is larger than 1.0. It returns the p-value, estimated odds ratio and intersection.

## Usage

 ```1 2 3 4``` ```## S4 method for signature 'GeneOverlap' show(object) ## S4 method for signature 'GeneOverlap' print(x, ...) ```

## Arguments

 `object` A GeneOverlap object. `x` A GeneOverlap object. `...` They are not used.

## Details

The problem of gene overlap testing can be described by a hypergeometric distribution where one gene list A defines the number of white balls in the urn and the other gene list B defines the number of white balls in the draw. Assume the total number of genes is n, the number of genes in A is a and the number of genes in B is b. If the intersection between A and B is t, the probability density of seeing t can be calculated as:

`dhyper(t, a, n - a, b)`

without loss of generality, we can assume b <= a. So the largest possible value for t is b. Therefore, the p-value of seeing intersection t is:

`sum(dhyper(t:b, a, n - a, b))`

The Fisher's exact test forms this problem slightly different but its calculation is also based on the hypergeometric distribution. It starts by constructing a contingency table:

```matrix(c(n - union(A,B), setdiff(A,B), setdiff(B,A), intersect(A,B)), nrow=2)```

It therefore tests the independence between A and B and is conceptually more straightforward. The GeneOverlap class is implemented using Fisher's exact test.

It is better to illustrate a concept using some example. Let's assume we have a genome of size 200 and two gene lists with 70 and 30 genes each. If the intersection between the two is 10, the hypergeometric way to calculate the p-value is:

sum(dhyper(10:30, 70, 130, 30))

which gives us p-value 0.6561562. If we use Fisher's exact test, we should do:

```fisher.test(matrix(c(110, 20, 60, 10), nrow=2), alternative="greater")```

which gives exactly the same p-value. In addition, the Fisher's test function also provides an estimated odds ratio, confidence interval, etc.

The Jaccard index is a measurement of similarity between two sets. It is defined as the number of intersections over the number of unions.

## Note

Although Fisher's exact test is chosen for implementation, it should be noted that the R implementation of Fisher's exact test is slower than using `dhyper` directly. As an example, run:

`system.time(sum(dhyper(10e3:30e3, 70e3, 130e3, 30e3)))`

uses around 0.016s to finish. While run:

```system.time(fisher.test(matrix(c(110e3, 20e3, 60e3, 10e3), nrow=2), alternative="greater"))```

uses around 0.072s. In practice, this time difference can often be ignored.

## Author(s)

Li Shen <li.shen@mssm.edu>

Mount Sinai profile:http://www.mountsinai.org/profiles/li-shen

## References

`GeneOverlapMatrix-class`
 ```1 2 3 4 5 6 7 8``` ```data(GeneOverlap) go.obj <- newGeneOverlap(hESC.ChIPSeq.list\$H3K4me3, hESC.ChIPSeq.list\$H3K9me3, gs.RNASeq) go.obj <- testGeneOverlap(go.obj) go.obj # show. print(go.obj) # more details. getContbl(go.obj) # contingency table. ```