knitr::opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, error = FALSE)
Functional interpretation of the biological data typically involves identifying key genes, molecules, reactions or pathways by finding non-random changes between two or more conditions or phenotypes, and it is often followed by enrichment analysis on set of molecules selected from differential -omics analyses. Among many packages that can be applied for this task, only few provide a support for multiple species, ontology types or include statistical tests beyond simple overrepresentation analysis.
MulEA is addressing this gap by allowing enrichment analysis not only within the most popular gene and pathway ontologies (e.g. GO, KEGG, Reactome), but also in gene expression, protein domain, miRNA and transcription factors data bases created from publicly available resources, presented in standardized manner. Beyond genes or proteins, MulEA allows the user to work with with any kind of data types, i.e. small molecules, chromosome regions, enhancers, molecular interactions or any other information defined by the, provided it is submitted in the GMT format (see below), and contains reasonably large amount of categories to test. To analyse the data MulEA, provides multiple types of statistical tests in one tool, including the hypergeometric test for count based analysis (in contingency table), and analyses of ranked input by modified Kolmogorov-Smirnov test.
In addition, MulEA features improved way to calculate correction for multiple testing that assume partial dependence between ontology terms. By calculating permutation based, empirical false discovery rate correction of the p-values it limits number of incorrectly picked categories falsely scored as significant (false positives) or insignificant (false negatives).
MuEA supports the following organisms (where appropriate data is available):
knitr::kable(data.frame(Species=c("Bos taurus","Caenorhabditis elegans","Danio rerio","Daphnia pulex","Drosophila melanogaster","Drosophila simulans","Gallus gallus","Homo sapiens","Macaca mulatta","Mus musculus","Pan troglodytes","Rattus norvegicus","Xenopus laevis","Xenopus tropicalis","Arabidopsis thaliana","Zea mays","Neurospora crassa","Saccharomyces cerevisae","Schizosaccharomyces pombe","Chlamydomonas reinhardtii","Dictyostelium discoideum","Tetrahymena thermophila","Bacillus subtilis","Bacteroides thetaiotaomicron","Bifidobacterium longum","Escherichia coli","Mycobacterium tuberculosis","Salmonella enterica"), TaxID=c("9913","6239","7955","6669","7227","7240","9031","9606","9544","10090","9598","10116","8355","8364","3702","4577","5141","4932","284812","3055","44689","5911","1423","818","216816","83333","1773","99287")), caption = "Supported species and corresponding taxonomy IDs")
For the above species MulEA provides .gmt files for the following biological databases (where data is available):
knitr::kable(data.frame(Categories=c("Ontologies","Pathways","Gene expression","Transcription factor","miRNA","Protein domain","Genomic location"), Resources=c("Gene Ontology KEGG Reactome","Wikipathways Pathway Commons SignaLink","modENCODE FlyAtlas","SignaLink HTRI TRRUST ATRM YEASTRACT DBTBS RegulonDB SalmoNet","miRTarBase","PFAM","Ensembl: 5-10-20 consecutive genes Chromosome Bands Operons")), caption = "Categories and resources for the built-in .gmt files")
Implemented MulEA features include:
As a first step load the package:
library(package="MulEA")
library(devtools) load_all()
There are two input types required to run the analysis: 1) knowledge base that defines space of categories and 2) a ranked list of elements to be tested. MulEA supports reading the knowledge base directly from GMT files or with properly formatted data frames.
MulEA expects knowledge base to be in Gene Matrix Transposed (GMT) file format (*.gmt). This is a three column tabular format used in one of the first implementations of ranked based test for gene set enrichment analysis, and due to its simplicity it makes generation of knowledge bases relatively straightforward (format explanation). Example of the GMT file is included in MulEA installation directory.
In order to create data frame containing the knowledge base read the GMT file with MulEA::read_gmt()
method. The method requires one parameter file
to provide path to the file.
file
- path to the file. Example: "R/MulEA/extdata/model.gmt"
# Get path to the example file pathToModelGmtFile <- system.file(package="MulEA", "extdata", "model.gmt") # Read GMT KnowledgeBaseDf <- MulEA::read_gmt(file = pathToModelGmtFile)
knitr::kable(KnowledgeBaseDf, caption = "Model Data Frame")
This example data frame meets the criteria required by the package. Please follow the same structure and style to avoid pitfalls, such as incorrect recognition of knowledge base categories that might impact results.
str(KnowledgeBaseDf)
If you would like to save the knowledge base as a GMT file, use: MulEA::write_gmt()
with two arguments gmt
, file
.
gmt
- ontology data frame which represents the GMT file.
file
- path to a new file in which the ontology will be saved. Example: "R/MulEA/extdata/savedModel.gmt"
MulEA::write_gmt(gmt = modelDfFromFile, file = pathToModelGmtFile)
Set-based tests are the most commonly applied tests in enrichment analysis. Counts of genes or other entities are collected in form of 2x2 contingency table with rows representing 1) specific knowledge base category and 2) genes in remaining categories and columns corresponding to: 1) the data set under and 2) remaining part of the background set. This class allows adjusting the test's results (p-values) for multiple testing by Benjamini-Hochberg and permutation tests.
We start performing the set based enrichment test by submitting the data in the required form.
modelDfFromFile <- MulEA::read_gmt(file = system.file(package="MulEA", "extdata", "model.gmt")) dataFromExperiment <- c("FBgn0004407", "FBgn0010438", "FBgn0003742", "FBgn0029709", "FBgn0030341", "FBgn0037044", "FBgn0002887", "FBgn0028434", "FBgn0030170", "FBgn0263831") dataFromExperimentPool <- unique(c(c("FBgn0033690", "FBgn0261618", "FBgn0004407", "FBgn0010438", "FBgn0032154", "FBgn0039930", "FBgn0040268", "FBgn0013674", "FBgn0037008", "FBgn0003116", "FBgn0037743", "FBgn0035401", "FBgn0037044", "FBgn0051005", "FBgn0026737", "FBgn0026751", "FBgn0038704", "FBgn0002887", "FBgn0028434", "FBgn0030170", "FBgn0263831", "FBgn0000579"), c("FBgn0066666", "FBgn0000000", "FBgn0099999", "FBgn0011111", "FBgn0022222", "FBgn0777777", "FBgn0333333", "FBgn0003742", "FBgn0029709", "FBgn0030341")))
MulEA implements hypergeometric test as highly configurable ora
class.
setBasedTest <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, number_of_cpu_threads = 2) setBasedTestRes <- MulEA::run_test(setBasedTest)
In the set-based test, MulEA will inform the user when the tested data are not covered by selected knowledge base:
knitr::kable(setBasedTestRes, caption = "Set Based Test Result Data Frame")
Usage with definition of background_element_names data, which can be different that in presented model is presented below:
setBasedTestWithPool <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, background_element_names = dataFromExperimentPool, number_of_cpu_threads = 2) setBasedTestWithPoolRes <- MulEA::run_test(setBasedTestWithPool)
knitr::kable(setBasedTestWithPoolRes, caption = "Set Based Test Result Data Frame")
Multiple Comparisons Problem: once can adjust the p-values of the tested ontologies by adding the adjustMethod
argument to SetBasedTest
class. The value of this argument can be newly presented by MulEA method to adjust p-values besed on permutation test. To run this method, please use "eFDR"
- permutation test as the argument for adjustMethod
.
Other available arguments for adjustMethod
are: "holm"
, "hochberg"
, "hommel"
, "bonferroni"
, "BH"
, "BY"
, "fdr"
. References to these methods can be found here: link.
setBasedTestWithPoolAndAdjust <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, background_element_names = dataFromExperimentPool, p_value_adjustment_method = "eFDR", number_of_cpu_threads = 2) setBasedTestWithPoolAndAdjustRes <- MulEA::run_test(setBasedTestWithPoolAndAdjust)
knitr::kable(setBasedTestWithPoolAndAdjustRes, caption = "Set Based Test Result With Permutation Test Adjustment Data Frame")
setBasedTestWithPoolAndAdjust <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, background_element_names = dataFromExperimentPool, p_value_adjustment_method = "BH", number_of_cpu_threads = 2) setBasedTestWithPoolAndAdjustRes <- MulEA::run_test(setBasedTestWithPoolAndAdjust)
knitr::kable(setBasedTestWithPoolAndAdjustRes, caption = "Set Based Test Result Data Frame")
Data frames with adjusted p-values contain one extra column, which include q-values.
ora
class constructor accepts list of arguments, such:
gmt
- required* - a data.frame, which represents the model. Read it from file or load from DBs.element_names
- required* - Vector of your experimental data. Example: dataFromExperiment <- c("FBgn0004407", "FBgn0010438", "FBgn0003742")
.background_element_names = character()
- default: character() - It is vector of background - background_element_names data to experiment data. Example: dataFromExperimentPool <- c("FBgn0004407", "FBgn0010438", "FBgn0003742", "FBgn0003444", "FBgn0003333"")
p_value_adjustment_method = NA
- default: NA - You can specify an algorithm which helps you with Multiple Comparisons Problem.Column names presented in results data.frame are:
ontologyId
- input copy - Column copies from the input data frame. It include onlology ids. Could be for example ids from GO.ontologyName
- input copy - Column copies from the input data frame. It include onlology names. Could be for example name from GO as "mitochondrion inheritance".listOfValues
- input copy - Column copies from the input data frame. It include all symbols undet presented ontology id. Example from GO: FBgn0004407, FBgn0010438.overlappingData
- output - Column includes set of intersection of list of values from model and provided by user experiment vector.contingencyTable
- output - This colum presents contingeny tables used to count test.p.value
- output - Cells of this column include counted p-value for provided data.q.value
- output - Cells of this column include adjusted p-value according to the model. The result of adjustment is q-value.The ranked list based enrichment analysis needs an ordered list of genes (e.g. transcripts or proteins) as input. Ranking can be an ordered vector or any vector with a vector of element_scores, both of them having to be the same length. For now MulEA provides you two ranked based tests, the Kolmogorov-Smirnov test and the Subramanian test. Both of them are enclosed in the gsea
class, which provides you a method to set input data and configure other parameters including the used statistic methods.
As mentioned previously, before running any tests you have to prepare proper input data. An example is presented below:
modelDfFromFile <- MulEA::read_gmt(file = system.file(package="MulEA", "extdata", "model.gmt")) dataFromExperiment <- c("FBgn0004407", "FBgn0010438", "FBgn0003742", "FBgn0029709", "FBgn0030341", "FBgn0037044", "FBgn0002887", "FBgn0028434", "FBgn0030170", "FBgn0263831") dataFromExperimentScores <- c(-0.35, -0.22, -0.09, 0.11, 0.15, 0.20, 0.24, 0.28, 0.45, 0.50)
Kolmogorov-Smirnov test is achieved by setting method
argument to "KS"
. It is also required to provide the argument for element_names
.
#GSEAKs <- gsea(method = "KS", gmt = modelDfFromFile, element_names = dataFromExperiment) #GSEAKsRes <- MulEA::run_test(GSEAKs)
During the execution of this chunk you will see many warnings from ks.test function from stats package. Warnings like the following that:
## Warning in ks.test(matchedFromModelDist, randomFromExperimentDist): cannot ## compute exact p-value with ties
The reason is that data which we are using in the vignette are artificial. They are constructed to show you how MulEA is working. When you see any error or warnings when you are working with real data, it should be investigated. MulEA is not stopping warnings and error propagation from packages which are used. The interpretation of those messages belongs to user.
To use the Subramanian set the method
argument to "Subramanian"
, and input element_names
and element_scores
. It is important that the latter two vectors have to be of the same length.
GSEASubramanian <- gsea(method = "Subramanian", gmt = modelDfFromFile, element_names = dataFromExperiment, element_scores = dataFromExperimentScores) GSEASubramanianRes <- MulEA::run_test(GSEASubramanian)
Below is the list of arguments accepted by the constructor of the gsea
:
method
- required* - It allows user to choose the method, which will be used to count probabilities, one of "KS" or "Subramanian".gmt
- required* - The data.frame, which represents model. Read it from file or load from DBs.element_names
- required* - Vector of your experimental data. Example: dataFromExperiment <- c("FBgn0004407", "FBgn0010438", "FBgn0003742")
. In case of KS test it ranks them. When applying the Subramanian method it creates a ranking using the element_scores argument.element_scores
- if method="Subramanian": required*, default: numeric() - This argument is a vector of numbers. It creates a ranking with element_names
argument for Subramanian approach.number_of_permutations
- default: 1000 - This set number of permutations used to count p-value. You can speed up process of counting by set it to small value, but remember that it may impact the accuracy of your test.Returned data frame from any ranked based test look like the following (column specification):
ontologyId
- input copy - Column copies from the input data frame. It include onlology ids. Could be for example ids from GO.ontologyName
- input copy - Column copies from the input data frame. It include onlology names. Could be for example name from GO as "mitochondrion inheritance".listOfValues
- input copy - Column copies from the input data frame. It include all symbols undet presented ontology id. Example from GO: FBgn0004407, FBgn0010438.p.value
- output - Cells of this column include counted p-value for provided data.#knitr::kable(GSEAKsRes, caption = "Ranked Based Test Result Data Frame")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.