Given two gene lists, tests the significance of their overlap in comparison with a genomic background. The null hypothesis is that the odds ratio is no larger than 1. The alternative is that the odds ratio is larger than 1.0. It returns the p-value, estimated odds ratio and intersection.
1 2 3 4
A GeneOverlap object.
A GeneOverlap object.
They are not used.
The problem of gene overlap testing can be described by a hypergeometric distribution where one gene list A defines the number of white balls in the urn and the other gene list B defines the number of white balls in the draw. Assume the total number of genes is n, the number of genes in A is a and the number of genes in B is b. If the intersection between A and B is t, the probability density of seeing t can be calculated as:
dhyper(t, a, n - a, b)
without loss of generality, we can assume b <= a. So the largest possible value for t is b. Therefore, the p-value of seeing intersection t is:
sum(dhyper(t:b, a, n - a, b))
The Fisher's exact test forms this problem slightly different but its calculation is also based on the hypergeometric distribution. It starts by constructing a contingency table:
matrix(c(n - union(A,B), setdiff(A,B),
It therefore tests the independence between A and B and is conceptually more straightforward. The GeneOverlap class is implemented using Fisher's exact test.
It is better to illustrate a concept using some example. Let's assume we have a genome of size 200 and two gene lists with 70 and 30 genes each. If the intersection between the two is 10, the hypergeometric way to calculate the p-value is:
sum(dhyper(10:30, 70, 130, 30))
which gives us p-value 0.6561562. If we use Fisher's exact test, we should do:
fisher.test(matrix(c(110, 20, 60, 10), nrow=2),
which gives exactly the same p-value. In addition, the Fisher's test function also provides an estimated odds ratio, confidence interval, etc.
The Jaccard index is a measurement of similarity between two sets. It is defined as the number of intersections over the number of unions.
Although Fisher's exact test is chosen for implementation, it should be
noted that the R implementation of Fisher's exact test is slower than using
dhyper directly. As an example, run:
system.time(sum(dhyper(10e3:30e3, 70e3, 130e3, 30e3)))
uses around 0.016s to finish. While run:
system.time(fisher.test(matrix(c(110e3, 20e3, 60e3, 10e3), nrow=2),
uses around 0.072s. In practice, this time difference can often be ignored.
Li Shen <[email protected]>
1 2 3 4 5 6 7 8
GeneOverlap object: listA size=13448 listB size=297 Intersection size=253 Overlapping p-value=4.6e-17 Jaccard Index=0.0 Detailed information about this GeneOverlap object: listA size=13448, e.g. ENSG00000187634 ENSG00000188976 ENSG00000187961 listB size=297, e.g. ENSG00000215912 ENSG00000232423 ENSG00000204501 Intersection size=253, e.g. ENSG00000215912 ENSG00000121903 ENSG00000157184 Union size=13492, e.g. ENSG00000187634 ENSG00000188976 ENSG00000187961 Genome size=21196 # Contingency Table: notA inA notB 7704 13195 inB 44 253 Overlapping p-value=4.6e-17 Odds ratio=3.4 Overlap tested using Fisher's exact test (alternative=greater) Jaccard Index=0.0 notA inA notB 7704 13195 inB 44 253
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.