This package simplifies the enrichment analysis for annotation for features (e.g. GO terms, protein domains, pathways, .. ) against a background. The annotations are called terms.
All (or selected) terms will be tested for overrepresentation. By default, the package uses right-sided fisher exact test and tests for overrepresentation, however also two-sided or left-sided test can be performed. Multiple hypothesis testing correction is applied and significant enriched terms are marked.
For fast computation, the package provides parallisation using the r Biocpkg("BiocParallel")
package. If you have this package installed you can choose the parallel = TRUE parameter to parallise the tests.
The results of the overrepresentation analysis can be visualized with a plotting function provided by the package. The log~2~ odds ratios are displayed of the significant terms. The signficance is displayed by the colours of the bars.
The inputs for the analysis are:
ID | term ---------|------ GeneA | term1 GeneB | term1 GeneB | term2 GeneC | term3 GeneD | term1 GeneD | term4 .. | ..
term | name | .. ------|--------------------|---- term1 | Descriptive Name | .. term2 | Other Name | .. term3 | Wanted | .. .. | .. | ..
The calculation of the enrichment for all terms in the termTable can be performed with a single command:
X <- termEnrichment(termTable, foregroundIDs, universeIDs, annotation)
Alternatively, if you want to use all features from the termTable as background, and do not have any annotation you can use this call:
X <- termEnrichment(termTable, foregroundIDs)
If you have BiocParallel installed, you can use parallel = TRUE for using multiple cores of a machine.
X <- termEnrichment(termTable, foregroundIDs, universeIDs, annotation, parallel = TRUE)
The output X is a tibble with the fisher exact test p.values and odd Ratios which can be plottet using the plotEnrichment function
plotEnrichment(X)
The package makes use of r CRANpkg("tibble")
, r CRANpkg("dplyr")
and r CRANpkg("ggplot2")
. Here we load all required packages.
require(termEnrichment) require(dplyr) require(ggplot2) require(tibble) require(IHW)
In the next step we load the term tables, yeastGO and yeastInterPro are tibbles (or data.frames) with two columns each. The first columns are the ID column which uniquely identify a feature. The second column is the term column which annotates a features with a term (GO term or InterPro Domain in our case).
First for GO terms
data("yeastGO", package = "termEnrichment") head(yeastGO)
Then for InterPro Domains
data("yeastInterPro", package = "termEnrichment") head(yeastInterPro)
Next we get a list (vector) of identifiers for the foreground and the background IDs which correspond to the ids in the annotation table.
data("foregroundIDs", package = "termEnrichment") data("universeIDs", package = "termEnrichment")
The foregroundIDs could be significantly enriched genes in a study.
foregroundIDs
The universeIDs would be all identified genes in the study
head(universeIDs)
Please note, that all foregroundIDs must be contained in the universeID list!
Here we load optional tables, yeastGOdesc and yeastInterProDesc which describe the "terms". The first column of those tables are the term identifier and the other columns are annotations. Please note, that this table is not needed for the analysis, if your terms are descriptive enough, you can skip this step.
data("yeastGOdesc", package = "termEnrichment") head(yeastGOdesc)
For the InterPro domain data we will not provide any additional annotation.
Please also note that the annotation table can only have one term per row.
The analysis is very simple. Here we test for the overrepresentation of all the terms in the annotation table yeastGO.
# debugging purpose only termTable <- yeastGO annotation <- yeastGOdesc permutations = 10 terms = NULL padj.cutoff = 0.05 padj.method = "BH" alternative = "greater" parallel = FALSE quiet = T dropIDs = F correction = F removeDuplicatedTerms = T
# debugging purpose only termTable <- yeastInterPro annotation <- NULL permutations = 10 terms = NULL padj.cutoff = 0.05 padj.method = "BH" alternative = "greater" parallel = FALSE quiet = T dropIDs = F correction = F removeDuplicatedTerms = T
GOenrichment <- termEnrichment(yeastGO, foregroundIDs, universeIDs, yeastGOdesc) InterProEnrichment <- termEnrichment(yeastInterPro, foregroundIDs, universeIDs)
By default, the function performs right-sided fisher exact tests for all the terms in the data sets. That means it tests for overrepresentation rather than underrepresentation. If you want to test for both sides, you can use the parameter 'alternative = "two.sides". Use ?fisher.test for more info.
Then it does multiple hypothesis testing correction, by default Benjamini-Hochberg. You can change the method to your liking with the parameter padj.method
.
GOenrichment is a tibble consisting of the results of the GOenrichment
head(GOenrichment)
The Interpro anlaysis did show just one significant results
head(InterProEnrichment)
For plotting the results you can use the plotting function
plotEnrichment(GOenrichment %>% mutate( name = gsub(pattern=", ", replacement = "\n", name)), nameColumn = "name")
# If you are oldschool, you can use the parameter *gg = FALSE* to get a standard barplot instead of a ggplot2 barplot. plotEnrichment(GOenrichment %>% mutate( name = gsub(pattern=", ", replacement = "\n", name)), gg = F)
If there are no significant results, a message would be displayed. The minimum terms displayed is 1.
plotEnrichment(InterProEnrichment)
By default, only significant results will be displayed (which is highly recommended). For displaying non-significant results, you can set the parameter onlySignificant to false.
plotEnrichment(InterProEnrichment, onlySignificant = F)
If you do not have a background list, or the background is the whole genome you can run the command without the background list. Then the script assumes that all the genes of the annotation table are background.
GOenrichment.no.background <- termEnrichment(yeastGO, foregroundIDs, annotation = yeastGOdesc)
Please note that using a background list is very important for correct testing.
The sessionInfo() information.
sessionInfo()
# #calculate# # ## # create list of randomly selected genes with the length of foreground list # SAMPLED.LIST <- LAPPLY(1:permutations, function(x) sample_n(BACKGROUND, nrow(FOREGROUND), replace = FALSE)) #
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.