knitr::opts_chunk$set(echo = TRUE, warning = TRUE, message = TRUE, error = FALSE)

Introduction

Functional interpretation of the biological data typically involves identifying key genes, molecules, reactions or pathways by finding non-random changes between two or more conditions or phenotypes, and it is often followed by enrichment analysis on set of molecules selected from differential -omics analyses. Among many packages that can be applied for this task, only few provide a support for multiple species, ontology types or include statistical tests beyond simple overrepresentation analysis.

MulEA is addressing this gap by allowing enrichment analysis not only within the most popular gene and pathway ontologies (e.g. GO, KEGG, Reactome), but also in gene expression, protein domain, miRNA and transcription factors data bases created from publicly available resources, presented in standardized manner. Beyond genes or proteins, MulEA allows the user to work with with any kind of data types, i.e. small molecules, chromosome regions, enhancers, molecular interactions or any other information defined by the, provided it is submitted in the GMT format (see below), and contains reasonably large amount of categories to test. To analyse the data MulEA, provides multiple types of statistical tests in one tool, including the hypergeometric test for count based analysis (in contingency table), and analyses of ranked input by modified Kolmogorov-Smirnov test.

In addition, MulEA features improved way to calculate correction for multiple testing that assume partial dependence between ontology terms. By calculating permutation based, empirical false discovery rate correction of the p-values it limits number of incorrectly picked categories falsely scored as significant (false positives) or insignificant (false negatives).

Supported organisms and knowledge bases

MuEA supports the following organisms (where appropriate data is available):

knitr::kable(data.frame(Species=c("Bos taurus","Caenorhabditis elegans","Danio rerio","Daphnia pulex","Drosophila melanogaster","Drosophila simulans","Gallus gallus","Homo sapiens","Macaca mulatta","Mus musculus","Pan troglodytes","Rattus norvegicus","Xenopus laevis","Xenopus tropicalis","Arabidopsis thaliana","Zea mays","Neurospora crassa","Saccharomyces cerevisae","Schizosaccharomyces pombe","Chlamydomonas reinhardtii","Dictyostelium discoideum","Tetrahymena thermophila","Bacillus subtilis","Bacteroides thetaiotaomicron","Bifidobacterium longum","Escherichia coli","Mycobacterium tuberculosis","Salmonella enterica"),
                 TaxID=c("9913","6239","7955","6669","7227","7240","9031","9606","9544","10090","9598","10116","8355","8364","3702","4577","5141","4932","284812","3055","44689","5911","1423","818","216816","83333","1773","99287")), caption = "Supported species and corresponding taxonomy IDs")

For the above species MulEA provides .gmt files for the following biological databases (where data is available):

knitr::kable(data.frame(Categories=c("Ontologies","Pathways","Gene expression","Transcription factor","miRNA","Protein domain","Genomic location"),
                 Resources=c("Gene Ontology KEGG Reactome","Wikipathways Pathway Commons SignaLink","modENCODE FlyAtlas","SignaLink HTRI TRRUST ATRM YEASTRACT DBTBS RegulonDB SalmoNet","miRTarBase","PFAM","Ensembl: 5-10-20 consecutive genes Chromosome Bands Operons")), caption = "Categories and resources for the built-in .gmt files")

Implemented MulEA features include:

Mulea input and output data formats

As a first step load the package:

library(package="MulEA")
library(devtools)
load_all()

There are two input types required to run the analysis: 1) knowledge base that defines space of categories and 2) a ranked list of elements to be tested. MulEA supports reading the knowledge base directly from GMT files or with properly formatted data frames.

MulEA expects knowledge base to be in Gene Matrix Transposed (GMT) file format (*.gmt). This is a three column tabular format used in one of the first implementations of ranked based test for gene set enrichment analysis, and due to its simplicity it makes generation of knowledge bases relatively straightforward (format explanation). Example of the GMT file is included in MulEA installation directory.

In order to create data frame containing the knowledge base read the GMT file with MulEA::read_gmt() method. The method requires one parameter file to provide path to the file.

file - path to the file. Example: "R/MulEA/extdata/model.gmt"

# Get path to the example file
pathToModelGmtFile <- system.file(package="MulEA", "extdata", "model.gmt")

# Read GMT
KnowledgeBaseDf <- MulEA::read_gmt(file = pathToModelGmtFile)
knitr::kable(KnowledgeBaseDf, caption = "Model Data Frame")

This example data frame meets the criteria required by the package. Please follow the same structure and style to avoid pitfalls, such as incorrect recognition of knowledge base categories that might impact results.

str(KnowledgeBaseDf)

If you would like to save the knowledge base as a GMT file, use: MulEA::write_gmt() with two arguments gmt, file.

gmt - ontology data frame which represents the GMT file. file - path to a new file in which the ontology will be saved. Example: "R/MulEA/extdata/savedModel.gmt"

   MulEA::write_gmt(gmt = modelDfFromFile, file = pathToModelGmtFile)  

Set Based Test

Set-based tests are the most commonly applied tests in enrichment analysis. Counts of genes or other entities are collected in form of 2x2 contingency table with rows representing 1) specific knowledge base category and 2) genes in remaining categories and columns corresponding to: 1) the data set under and 2) remaining part of the background set. This class allows adjusting the test's results (p-values) for multiple testing by Benjamini-Hochberg and permutation tests.

We start performing the set based enrichment test by submitting the data in the required form.

modelDfFromFile <- MulEA::read_gmt(file = system.file(package="MulEA", "extdata", "model.gmt"))
dataFromExperiment <- c("FBgn0004407", "FBgn0010438", "FBgn0003742", "FBgn0029709", "FBgn0030341", "FBgn0037044", "FBgn0002887", "FBgn0028434", "FBgn0030170", "FBgn0263831")
dataFromExperimentPool <- unique(c(c("FBgn0033690", "FBgn0261618", "FBgn0004407", "FBgn0010438", "FBgn0032154", "FBgn0039930", "FBgn0040268", "FBgn0013674",
                                   "FBgn0037008", "FBgn0003116", "FBgn0037743", "FBgn0035401", "FBgn0037044", "FBgn0051005", "FBgn0026737", "FBgn0026751",
                                   "FBgn0038704", "FBgn0002887", "FBgn0028434", "FBgn0030170", "FBgn0263831", "FBgn0000579"),
                                 c("FBgn0066666", "FBgn0000000", "FBgn0099999", "FBgn0011111", "FBgn0022222", "FBgn0777777", "FBgn0333333", "FBgn0003742",
                                   "FBgn0029709", "FBgn0030341")))

MulEA implements hypergeometric test as highly configurable ora class.

setBasedTest <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, number_of_cpu_threads = 2)
setBasedTestRes <- MulEA::run_test(setBasedTest)

In the set-based test, MulEA will inform the user when the tested data are not covered by selected knowledge base:

knitr::kable(setBasedTestRes, caption = "Set Based Test Result Data Frame")

Usage with definition of background_element_names data, which can be different that in presented model is presented below:

setBasedTestWithPool <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, background_element_names = dataFromExperimentPool, number_of_cpu_threads = 2)
setBasedTestWithPoolRes <- MulEA::run_test(setBasedTestWithPool)
knitr::kable(setBasedTestWithPoolRes, caption = "Set Based Test Result Data Frame")

Permutation based correction for multiple testing

Multiple Comparisons Problem: once can adjust the p-values of the tested ontologies by adding the adjustMethod argument to SetBasedTest class. The value of this argument can be newly presented by MulEA method to adjust p-values besed on permutation test. To run this method, please use "eFDR" - permutation test as the argument for adjustMethod. Other available arguments for adjustMethod are: "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr". References to these methods can be found here: link.

setBasedTestWithPoolAndAdjust <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, background_element_names = dataFromExperimentPool, p_value_adjustment_method = "eFDR", number_of_cpu_threads = 2)
setBasedTestWithPoolAndAdjustRes <- MulEA::run_test(setBasedTestWithPoolAndAdjust)
knitr::kable(setBasedTestWithPoolAndAdjustRes, caption = "Set Based Test Result With Permutation Test Adjustment Data Frame")
setBasedTestWithPoolAndAdjust <- ora(gmt = modelDfFromFile, element_names = dataFromExperiment, background_element_names = dataFromExperimentPool, p_value_adjustment_method = "BH", number_of_cpu_threads = 2)
setBasedTestWithPoolAndAdjustRes <- MulEA::run_test(setBasedTestWithPoolAndAdjust)
knitr::kable(setBasedTestWithPoolAndAdjustRes, caption = "Set Based Test Result Data Frame")

Data frames with adjusted p-values contain one extra column, which include q-values.

ora class constructor accepts list of arguments, such:

Column names presented in results data.frame are:

Ranking Based Tests

The ranked list based enrichment analysis needs an ordered list of genes (e.g. transcripts or proteins) as input. Ranking can be an ordered vector or any vector with a vector of element_scores, both of them having to be the same length. For now MulEA provides you two ranked based tests, the Kolmogorov-Smirnov test and the Subramanian test. Both of them are enclosed in the gsea class, which provides you a method to set input data and configure other parameters including the used statistic methods.

As mentioned previously, before running any tests you have to prepare proper input data. An example is presented below:

modelDfFromFile <- MulEA::read_gmt(file = system.file(package="MulEA", "extdata", "model.gmt"))
dataFromExperiment <- c("FBgn0004407", "FBgn0010438", "FBgn0003742", "FBgn0029709", "FBgn0030341", "FBgn0037044", "FBgn0002887", "FBgn0028434", "FBgn0030170", "FBgn0263831")
dataFromExperimentScores <- c(-0.35, -0.22, -0.09, 0.11, 0.15, 0.20, 0.24, 0.28, 0.45, 0.50)

Kolmogorov Smirnov Test

Kolmogorov-Smirnov test is achieved by setting method argument to "KS". It is also required to provide the argument for element_names.

#GSEAKs <- gsea(method = "KS", gmt = modelDfFromFile, element_names = dataFromExperiment)
#GSEAKsRes <- MulEA::run_test(GSEAKs)

During the execution of this chunk you will see many warnings from ks.test function from stats package. Warnings like the following that:

## Warning in ks.test(matchedFromModelDist, randomFromExperimentDist): cannot
## compute exact p-value with ties

The reason is that data which we are using in the vignette are artificial. They are constructed to show you how MulEA is working. When you see any error or warnings when you are working with real data, it should be investigated. MulEA is not stopping warnings and error propagation from packages which are used. The interpretation of those messages belongs to user.

Subramanian Test

To use the Subramanian set the method argument to "Subramanian", and input element_names and element_scores. It is important that the latter two vectors have to be of the same length.

GSEASubramanian <- gsea(method = "Subramanian", gmt = modelDfFromFile, element_names = dataFromExperiment, element_scores = dataFromExperimentScores)
GSEASubramanianRes <- MulEA::run_test(GSEASubramanian)

Below is the list of arguments accepted by the constructor of the gsea:

Returned data frame from any ranked based test look like the following (column specification):

#knitr::kable(GSEAKsRes, caption = "Ranked Based Test Result Data Frame")


koralgooll/MulEA documentation built on Nov. 23, 2023, 3:27 p.m.