knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message=FALSE, warning=FALSE )
In a recent paper (@SanchezPla2019) we extended tyhe goProfiles method (@AlexSanchez2017) to allow it to be used for the equivalence analysis of multiple gene lists.
One of the reviewers of the paper suggested that we should compare our equivalence test with an alternative approach.
This, so-called from now on, standard test
between two gene lists at a given GO level of a given ontology, consitss of checking if the percentage of enriched categories is similar in both lists using a Fisher test with unilateral alternative equaling "greater".
The rationale for this tests seems to be that rejecting the null hypothesis of independence leads to establishing that both lists show some degree of relation -which is distinct but related to saying that bost lists are equivalent.
This test has been illustrated in the Supplementary materials of the paper (file SupplementaryMaterials_2-StandardTest.Rmd
).
In order to further investigate its properties it has been implemented in the current, equivStandardTest
, package.
The rationale for the standard test is as follows:
gl1
and gl2
stdTest4GeneLists
that takes two gene lists, an ontology and a GO level and returns the result of the test has been implemented.stdTest4GOIDs
has been implemented whose inputs are GOTerm lists instead of Gene lists.crossTabGOIds
performs GOIDs crosstabulation directly from two lists of enriched GO TermscrossTabGOIds4GeneLists
performs crosstabulation of two lists of enriched GO Terms obtained by an enrichment analysis of two gene lists.enrichGO
function from clusterProfiler
package (@Yu2012).Lists of enriched GO Terms can be obtained from Gene Lists through enrichment analysis. Notice that there are many methods in the literature. We have implemented a few of them and it is interesting to see how the results differ (or ressemble depending on what part of the glass you prefer).
For the sake of velocity in compiling the vignette enrichment analysis is not performed but, instead, precomputed lists of enriched GOTerms (see appoendix 2) are used.
library(equivStandardTest) data(kidneyGeneLists) data(humanEntrezIDs) data(kidneyEnrichedGOIDs) gl1 <-kidneyGeneLists[[1]] gl2 <-kidneyGeneLists[[2]] geneUniverse<- humanEntrezIDs anOnto<- "BP" pvalCutoff <- 0.01 # enriched1 <- enrichOnto(geneList=gl1, geneUniverse=geneUniverse, # orgPackage="org.Hs.eg.db", # ont=anOnto) # GOIDs1 <- as.character(as.data.frame(enriched1)$ID) GOIDs1<- kidneyEnrichedGOIDs[[1]] # enriched2 <- enrichOnto(geneList=gl2, geneUniverse=geneUniverse, orgPackage="org.Hs.eg.db", # ont=anOnto) # GOIDs2 <- as.character(as.data.frame(enriched2)$ID) GOIDs2<- kidneyEnrichedGOIDs[[2]] GOLev<- 3 names4lists<- names(kidneyGeneLists)[1:2] crossTabbedGOIds <- crossTabGOIDs (GO1 = GOIDs1, GO2 = GOIDs2, onto = anOnto, GOLevel =GOLev, listNames=names4lists) show(crossTabbedGOIds)
When considering cross-tabulation as in the previous paragraph an interesting issue appears: It is reasonable to expect that many GO Terms in the selected level don't appear in the annotations of neither genelist 1 nor genelist 2. THis means that it will be impossible that they are enriched, and, in consequence, they will "inflate" the FALSE-FALSE cell in the cross tabulation.
This has been solved introducing a boolean variable, called restricted
to decide how tabulation is performed.
- If it is FALSE "unrestricted" cross-tabulation is performed as above, crossing all GO Terms located at the level indicated by GOLev
with the two GOIDs lists.
- If it is TRUE "restricted" cross-tabulation is performed and only terms from the selected GO level that are common to ancestor terms of either list are crossed. That is, if one term in the selected GO level is not an ancestor of at least one of the gene list most specific GO terms it is excluded from the GO Level's terms because it is impossible that it appears as being enriched.
data(kidneyGeneLists) data(humanEntrezIDs) gl1 <-kidneyGeneLists[[1]] gl2 <-kidneyGeneLists[[2]] anOnto <- 'BP' GOLev<- 3 restricted <- FALSE adjMeth<- 'BH' pValCut <- 0.05 qValCut <- 0.01 crossTabUnrestricted <- crossTabGOIDs4GeneLists (genelist1=gl1, genelist2=gl2, geneUniverse=humanEntrezIDs, orgPackg="org.Hs.eg.db", onto=anOnto, GOLevel=GOLev, restricted=restricted) show(crossTabUnrestricted) restricted <- TRUE crossTabRestricted <- crossTabGOIDs4GeneLists (genelist1=gl1, genelist2=gl2, geneUniverse=humanEntrezIDs, orgPackg="org.Hs.eg.db", onto=anOnto, GOLevel=GOLev, restricted=restricted) show(crossTabRestricted)
The simplest version of the test is comparing directly two GO Terms lists.
This can be done using function standardTest4GOIDs
library(equivStandardTest) data(kidneyEnrichedGOIDs) GOIDs1 <-kidneyEnrichedGOIDs[[1]] GOIDs2 <-kidneyEnrichedGOIDs[[2]] anOnto <- 'BP' GOLev<- 3 testedFromGOIds <- stdTest4GOIDs (GO1 = GOIDs1, GO2 = GOIDs2, onto = anOnto, GOLevel =GOLev) show(testedFromGOIds)
Conceptually, what we want to do is compare two gene lists.
As it has been indicated above this is a three step process. - First, an enrichment analysis is performed on each gene list - Then crosstabulation applied on the two enriched GO Terms lists vs the GO Terms in the selected level in an unrestricted or restricted approach. - Final a Fisher test is applied on the resulting table
data(kidneyGeneLists) data(humanEntrezIDs) gl1 <-kidneyGeneLists[[1]] gl2 <-kidneyGeneLists[[2]] anOnto <- 'BP' GOLev<- 3 adjMeth<- 'BH' pValCut <- 0.05 qValCut <- 0.01 testedFromGeneLists <- stdTest4GeneLists (genelist1=gl1, genelist2=gl2, geneUniverse=humanEntrezIDs, orgPackg="org.Hs.eg.db", onto=anOnto, GOLevel=GOLev) show(testedFromGeneLists)
By default the test is applied doing "unrestricted" tabulation, but it can also be done on the unrestricted tabulation simply setting the "restricted" parameter to TRUE.
testedFromGeneListsRest <- stdTest4GeneLists (genelist1=gl1, genelist2=gl2, geneUniverse=humanEntrezIDs, orgPackg="org.Hs.eg.db", onto=anOnto, GOLevel=GOLev, restricted = TRUE) show(testedFromGeneListsRest)
The results differ although the conclusion is almost the same.
People who are not familiar with GO Enrichment Analysis and the Structure of the Gene Ontology may experiment some confusion with terms such as "GOIDs (=GOTerms)", "enriched GOTerms" or "GOLevels". This section shows a very simple example to illustrate these concepts. More information can be found in @Dessimoz2017.
A gene list is a vector of characters that contains the IDs of genes that have been selected according some criteria, for instance genes expressing differently between two biological conditions or genes known to be associated with a certain disease.
The package contains several example gen lists.
library(equivStandardTest) data("kidneyGeneLists") geneList1 <-kidneyGeneLists[[1]] head(geneList1) sapply(kidneyGeneLists, length)
People working with genes needs to have information about what they do. This is known as "annotations". The standard annotations database (not the only one, but the most well known, and most used) is the Gene Ontology. Its name comes from the fact that it is not a standard relational database but an ontology which means that it is hierarchichaly organized from most general to most specific annotations. Strictly speaking the Gene Ontology is formed by three (sub)ontologies: Yhe BP (Biological Process) ontology, the CC (Cellular component) ontology and the MF (Molecular Function) ontology.
Annotating a gene means associating it with a series of records in this ontology (known as "GOTerms" whose identifiers are known as "GOIDs"). Every time a gene is associated with a GOTerm it is automatically associated with its "ancestors" i.e. more general terms in the same ontology connected with that GO Term.
While it is not straightforward to obtain GO annotations for each gene in a gene list, many packages have functions to facilitate this. For instance the GOTerms
function in the goProfiles
package. This can also be done very easily by directly querying the annotation package.
Notice that there are two type of annotations for a gene. The "most specific" are the direct annotations by which the gene is described. For example the name and the associated GOTerms for the first gene in the first gene list can be retrieved as follows:
library(org.Hs.eg.db) library(GO.db) shortList <- geneList1[1:3] (select(org.Hs.eg.db, shortList,"GENENAME")) (listOfGOTerms1<- goProfiles::GOTermsList(shortList, "BP", orgPkg = "org.Hs.eg.db")) (select(GO.db, unlist(listOfGOTerms1), "TERM"))
It can be seen, even with a low knowledge of Biology, how some of the GOTerms description match the name of the gene.
Alternatively, the GOTerms can be extracted using standard database queries
dfOfGOTerms<- select(org.Hs.eg.db, shortList,"GO") listOfGOTerms2 <-dfOfGOTerms [dfOfGOTerms$ONTOLOGY=="BP","GO"] # # A tidy approach # library (dplyr) # (listOfGOTerms2 <- dfOfGOTerms %>% # dplyr::filter(dfOfGOTerms$ONTOLOGY=="BP") %>% # dplyr::select ("GO")) #
The GOTerms retrieved for this gene are the most specific terms. Due to the hierarchichal structur of the GO the gene is also annotated by terms in higher positions of the ontology related with these "most specific" terms.
More general terms associated with specific annotations are called "ancestors". Again these can be obtained using a function in goProfiles
or by directly querying the annotation package with the select
command.
Using goProfiles GOTermsList and getAncestorsLst
functions yields the following results
ancestorsList1 <- unique(unlist(getAncestorsLst(listOfGOTerms1, onto="BP"))) length(ancestorsList1)
Using a direct select
command on the annotations package to a different table ("GOALL" instead of "GO")
# Making a direct query dfOfGOTerms2<- select(org.Hs.eg.db, shortList,"GOALL") listOfGOTerms2 <-dfOfGOTerms2 [dfOfGOTerms2$ONTOLOGY=="BP","GOALL"] # A tidy approach # listOfGOTerms2 <- dfOfGOTerms2 %>% # dplyr::filter(dfOfGOTerms2$ONTOLOGYALL=="BP") %>% # dplyr::select ("GOALL") length(unique(listOfGOTerms2))
Notice that the results are similar though not the same
length(ancestorsList1) length(unique(listOfGOTerms2)) setdiff(ancestorsList1, listOfGOTerms2)
It is clear that, even for a short gene list, the number of associated GOTerms is very high especially if one considers general and specific terms.
This can be seen by considering the "induced GO graphs"
library(GOstats) graphBP1<-GOstats::makeGOGraph(shortList,"BP", chip="hgu133plus2.db") length(nodes(graphBP1))
plot(graphBP1, main=paste(paste(shortList, collapse = ", "), "BP Ontology"))
![induced GO Graph]("vignettes/graphBP1.png")
Notice that this graph has been produced using only three genes but contains 212 nodes, precisely all that have been obtained querying the database for all GO Terms associated with the short gene list.
Because the GO is organized as a directed graph we can introduce the concept of GO level. GO levels represent the set of nodes found at the same distance (in term of graph paths) of the root node. That is, we say that a GO term (a node in the graph) is at the i-th GO level if the shortest path from this node to the top node has $i$ nodes including that top node.
The graph figure of the previous example highlights this concept very well because, due to the layout used, all nodes at the same level appear aligned.
The concept of GO level is central when using the goProfiles
approach although there is no universal agreement about if it can be considered equivalent to specialization. This is so because for certain terms, or paths level 5 may be the most specific one whilst for others there may still be many deeper levels.
Nodes at a given level of a given ontology can be recovered using function goProfiles::getGOLevel
.
getGOLevel(2, onto="CC") getGOLevel(1, onto="CC") length(getGOLevel(3, onto="CC"))
This simple example illustrates the difference between tabulating enriched gene lists vs all terms at a given GO level ("unrestricted" mode in function crossTabGOIDs
) vs doing it with only terms that are associated with genes in at least one of the lists ("Restricted" mode in function crossTabGOIDs
).
Assume the selected GO level (level "XX" of onto "YY") is formed by all capital letters
levelXXontoYYAll <- LETTERS length(levelXXontoYYAll)
And assume that, given two lists "l1" and "l2" (not shown) their corresponding GO annotations in that ontologhy are:
annots1inYY <- union(LETTERS[1:10], letters[11:20]) annots2inYY <- union(LETTERS[6:15], letters[16:25]) annots1AND2 <- union(annots1inYY,annots2inYY)
Notice that some annotations are common to levelXXontoYYAll
but others are not because they belong to different levels.
levelXXcommonTo1or2 <- levelXXontoYYAll[levelXXontoYYAll %in% annots1AND2] length(levelXXcommonTo1or2)
Now assume we do some enrichment analysis on lists 1 and 2 and obtain towo lists of enriched GO Terms
enriched1 <- annots1inYY[c(7:10,16:20)] enriched2 <- annots2inYY[c(3:7,16:20)]
We can now do two types of cross tabulations:
Unrestricted
Check which of the enriched terms appears in both lists, only one or none in level XX
levelXXannotsIn1 <- levelXXontoYYAll %in% enriched1 levelXXannotsIn2 <- levelXXontoYYAll %in% enriched2 table(levelXXannotsIn1, levelXXannotsIn2)
Restricted
levelXXcommonAnnotsIn1 <- levelXXcommonTo1or2 %in% enriched1 levelXXcommonAnnotsIn2 <- levelXXcommonTo1or2 %in% enriched2 table(levelXXcommonAnnotsIn1, levelXXcommonAnnotsIn2)
Notice that this correspond to the two ways of using function crossTabGOIDs
(type ? crossTabGOIDs
for more information).
Enrichment analysis is a very active field. There are many methods and different criteria on how to do it. It has been observed that some analysis components such as
may affect the final results. Because there is no clear winner between these comparisons it is good that we are aware of this problem.
If we change, for instance the gene Universe
from all human genes (as in the example above) to "human genes available in arrays hgu133plus2" the results show a slight change
data(kidneyGeneLists) data(kidneyEnrichedGOIDs) gl1 <-kidneyGeneLists[[1]] gl2 <-kidneyGeneLists[[2]] anOnto<- "BP" pvalCutoff <- 0.01 data("hgu133plus2EntrezIDs") geneUniverse2<- hgu133plus2EntrezIDs enriched1 <- enrichOnto(geneList=gl1, geneUniverse=geneUniverse2, orgPackage="org.Hs.eg.db", ont=anOnto) GOIDs1.2 <- as.character(as.data.frame(enriched1)$ID) enriched2 <- enrichOnto(geneList=gl2, geneUniverse=geneUniverse2, orgPackage="org.Hs.eg.db", ont=anOnto) GOIDs2.2 <- as.character(as.data.frame(enriched2)$ID) crossTabbedGOIds.2 <- crossTabGOIDs (GO1 = GOIDs1.2, GO2 = GOIDs2.2, onto = anOnto, GOLevel =GOLev, listNames=names4lists) show(crossTabbedGOIds.2) # Enriched_in_IRITD3 # Enriched_in_ENDAT FALSE TRUE # FALSE 544 1 # TRUE 7 6
data
for the examplesThe package includes three datasets that can be used to illustrate the usage of the functions implemented. This section illustrates how these datasets have been obtained.
Gene lists in the kidneyGeneLists
datasets have been directly obtained from the goProfiles
package examples. Type ? kidneyGeneLists
for more information.
Enrichment analysis requires a list of selected genes and a list of "all genes" from where the former has been selected. Many researchers claim that the ideal gene universe is the set of genes that where tested for differential expression and from where the gene list was selected.
In many situations however this is unknown so that some standard set of "all genes" is used. In this dataset we have included two distinct gene universe:
hgu133plus2
). org.Hs.eg.db
Interestingly the first datset contains around 22000 genes while the second contains more than 62.000. The results of doing an enrichment analysis using one or other as gene universe are not very different.
data(kidneyGeneLists) sapply(kidneyGeneLists, length) library(org.Hs.eg.db) entrezs <-keys(org.Hs.eg.db) geneUniverseTable <- select(org.Hs.eg.db, keys=entrezs, columns=c("ENTREZID", "SYMBOL")) humanEntrezIDs <- as.character(unique(geneUniverseTable$ENTREZID)) library(hgu133plus2.db) probes<-keys(hgu133plus2.db) geneUniverseTable2 <- select(hgu133plus2.db, keys=probes, columns=c("PROBEID", "ENTREZID", "SYMBOL")) hgu133plus2EntrezIDs <- as.character(unique(geneUniverseTable2$ENTREZID)) length(intersect(hgu133plus2.EntrezIDs, org.Hs.eg.EntrezIDs)) save(hgu133plus2EntrezIDs, file="data/hgu133plus2EntrezIDs.Rda", version=2) save(humanEntrezIDs, file="data/humanEntrezIDs.Rda", version=2)
The standard test compares two lists of GO Terms corresponding to two gene lists. An enrichment analysis has been performed on each gene list to obtain these GO Terms lists with predefined thresholds.
Enrichment analysis an important procedure in bioinformatics and, for the sake of brevity, it is not described here. A thorough explanation of both, concept and usage, can be found in clusterProfiler
manual.
#' CODE TO GENERATE THE DATA USED IN THE EXAMPLES data(humanEntrezIDs) geneUniverse <- humanEntrezIDs anOnto <- 'BP' adjMeth<- 'BH' pValCut <- 0.05 qValCut <- 0.01 kidneyEnriched <- sapply(kidneyGeneLists, function (g) clusterProfiler::enrichGO(gene=g,universe=geneUniverse, OrgDb='org.Hs.eg.db', ont=anOnto, pvalueCutoff=pValCut)) kidneyEnrichedGOIDs <- sapply (kidneyEnriched, function (enrich) as.character(as.data.frame(enrich)$ID)) sapply (kidneyEnrichedGOIDs, length) save(kidneyEnrichedGOIDs, file="data/kidneyEnrichedGOIDs.Rda", version=2)
It is important to remember that the concept of "Gene Universe" is not well defined. Changing from all human genes, available in package org.Hs.eg.db
to all genes in the used array hgu133plus2.db
yield a different number of enriched GO categories.
data(hgu133plus2EntrezIDS) kidneyEnriched2 <- sapply(kidneyGeneLists, function (g) clusterProfiler::enrichGO(gene=g, universe=geneUniverse2, OrgDb='hgu133plus2.db', ont=anOnto, pvalueCutoff=pValCut)) kidneyEnrichedGOIDs2 <- sapply (kidneyEnriched2, function (enrich) as.character(as.data.frame(enrich)$ID)) sapply (kidneyEnrichedGOIDs2, length) for (i in 1:5 ) cat (names(kidneyEnrichedGOIDs2)[i],", ", length(intersect(kidneyEnrichedGOIDs[[i]], kidneyEnrichedGOIDs2[[i]])), "\n")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.