GeneOverlap: Test overlap between two gene lists using Fisher's exact...

Description Usage Arguments Details Note Author(s) References See Also Examples

Description

Given two gene lists, tests the significance of their overlap in comparison with a genomic background. The null hypothesis is that the odds ratio is no larger than 1. The alternative is that the odds ratio is larger than 1.0. It returns the p-value, estimated odds ratio and intersection.

Usage

1
2
3
4
## S4 method for signature 'GeneOverlap'
show(object)
## S4 method for signature 'GeneOverlap'
print(x, ...)

Arguments

object

A GeneOverlap object.

x

A GeneOverlap object.

...

They are not used.

Details

The problem of gene overlap testing can be described by a hypergeometric distribution where one gene list A defines the number of white balls in the urn and the other gene list B defines the number of white balls in the draw. Assume the total number of genes is n, the number of genes in A is a and the number of genes in B is b. If the intersection between A and B is t, the probability density of seeing t can be calculated as:

dhyper(t, a, n - a, b)

without loss of generality, we can assume b <= a. So the largest possible value for t is b. Therefore, the p-value of seeing intersection t is:

sum(dhyper(t:b, a, n - a, b))

The Fisher's exact test forms this problem slightly different but its calculation is also based on the hypergeometric distribution. It starts by constructing a contingency table:

matrix(c(n - union(A,B), setdiff(A,B), setdiff(B,A), intersect(A,B)), nrow=2)

It therefore tests the independence between A and B and is conceptually more straightforward. The GeneOverlap class is implemented using Fisher's exact test.

It is better to illustrate a concept using some example. Let's assume we have a genome of size 200 and two gene lists with 70 and 30 genes each. If the intersection between the two is 10, the hypergeometric way to calculate the p-value is:

sum(dhyper(10:30, 70, 130, 30))

which gives us p-value 0.6561562. If we use Fisher's exact test, we should do:

fisher.test(matrix(c(110, 20, 60, 10), nrow=2), alternative="greater")

which gives exactly the same p-value. In addition, the Fisher's test function also provides an estimated odds ratio, confidence interval, etc.

The Jaccard index is a measurement of similarity between two sets. It is defined as the number of intersections over the number of unions.

Note

Although Fisher's exact test is chosen for implementation, it should be noted that the R implementation of Fisher's exact test is slower than using dhyper directly. As an example, run:

system.time(sum(dhyper(10e3:30e3, 70e3, 130e3, 30e3)))

uses around 0.016s to finish. While run:

system.time(fisher.test(matrix(c(110e3, 20e3, 60e3, 10e3), nrow=2), alternative="greater"))

uses around 0.072s. In practice, this time difference can often be ignored.

Author(s)

Li Shen <li.shen@mssm.edu>

Mount Sinai profile:http://www.mountsinai.org/profiles/li-shen

Personal:http://www.linkedin.com/in/lshen/

References

http://en.wikipedia.org/wiki/Fisher's_exact_test

http://en.wikipedia.org/wiki/Jaccard_index

See Also

GeneOverlapMatrix-class

Examples

1
2
3
4
5
6
7
8
data(GeneOverlap)
go.obj <- newGeneOverlap(hESC.ChIPSeq.list$H3K4me3, 
                         hESC.ChIPSeq.list$H3K9me3, 
                         gs.RNASeq)
go.obj <- testGeneOverlap(go.obj)
go.obj  # show.
print(go.obj)  # more details.
getContbl(go.obj)  # contingency table.

shenlab-sinai/geneoverlap-old documentation built on May 29, 2019, 9:22 p.m.