Description Usage Arguments Details Note Author(s) References See Also Examples
Given two gene lists, tests the significance of their overlap in comparison with a genomic background. The null hypothesis is that the odds ratio is no larger than 1. The alternative is that the odds ratio is larger than 1.0. It returns the p-value, estimated odds ratio and intersection.
1 2 3 4 |
object |
A GeneOverlap object. |
x |
A GeneOverlap object. |
... |
They are not used. |
The problem of gene overlap testing can be described by a hypergeometric distribution where one gene list A defines the number of white balls in the urn and the other gene list B defines the number of white balls in the draw. Assume the total number of genes is n, the number of genes in A is a and the number of genes in B is b. If the intersection between A and B is t, the probability density of seeing t can be calculated as:
dhyper(t, a, n - a, b)
without loss of generality, we can assume b <= a. So the largest possible value for t is b. Therefore, the p-value of seeing intersection t is:
sum(dhyper(t:b, a, n - a, b))
The Fisher's exact test forms this problem slightly different but its calculation is also based on the hypergeometric distribution. It starts by constructing a contingency table:
matrix(c(n - union(A,B), setdiff(A,B),
setdiff(B,A), intersect(A,B)),
nrow=2)
It therefore tests the independence between A and B and is conceptually more straightforward. The GeneOverlap class is implemented using Fisher's exact test.
It is better to illustrate a concept using some example. Let's assume we have a genome of size 200 and two gene lists with 70 and 30 genes each. If the intersection between the two is 10, the hypergeometric way to calculate the p-value is:
sum(dhyper(10:30, 70, 130, 30))
which gives us p-value 0.6561562. If we use Fisher's exact test, we should do:
fisher.test(matrix(c(110, 20, 60, 10), nrow=2),
alternative="greater")
which gives exactly the same p-value. In addition, the Fisher's test function also provides an estimated odds ratio, confidence interval, etc.
The Jaccard index is a measurement of similarity between two sets. It is defined as the number of intersections over the number of unions.
Although Fisher's exact test is chosen for implementation, it should be
noted that the R implementation of Fisher's exact test is slower than using
dhyper
directly. As an example, run:
system.time(sum(dhyper(10e3:30e3, 70e3, 130e3, 30e3)))
uses around 0.016s to finish. While run:
system.time(fisher.test(matrix(c(110e3, 20e3, 60e3, 10e3), nrow=2),
alternative="greater"))
uses around 0.072s. In practice, this time difference can often be ignored.
Li Shen <shenli.sam@gmail.com>
Lab:http://shenlab-sinai.github.io/shenlab-sinai/
Personal:http://www.linkedin.com/in/lshen/
http://en.wikipedia.org/wiki/Fisher's_exact_test
http://en.wikipedia.org/wiki/Jaccard_index
1 2 3 4 5 6 7 8 | data(GeneOverlap)
go.obj <- newGeneOverlap(hESC.ChIPSeq.list$H3K4me3,
hESC.ChIPSeq.list$H3K9me3,
gs.RNASeq)
go.obj <- testGeneOverlap(go.obj)
go.obj # show.
print(go.obj) # more details.
getContbl(go.obj) # contingency table.
|
GeneOverlap object:
listA size=13448
listB size=297
Intersection size=253
Overlapping p-value=4.6e-17
Jaccard Index=0.0
Detailed information about this GeneOverlap object:
listA size=13448, e.g. ENSG00000187634 ENSG00000188976 ENSG00000187961
listB size=297, e.g. ENSG00000215912 ENSG00000232423 ENSG00000204501
Intersection size=253, e.g. ENSG00000215912 ENSG00000121903 ENSG00000157184
Union size=13492, e.g. ENSG00000187634 ENSG00000188976 ENSG00000187961
Genome size=21196
# Contingency Table:
notA inA
notB 7704 13195
inB 44 253
Overlapping p-value=4.6e-17
Odds ratio=3.4
Overlap tested using Fisher's exact test (alternative=greater)
Jaccard Index=0.0
notA inA
notB 7704 13195
inB 44 253
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.