ADAM: Activity and Diversity Analysis Module
In ADAM: ADAM: Activity and Diversity Analysis Module

Overview

ADAM is a GSEA R package created to group a set of genes from comparative samples (control versus experiment) according to their respective functions (Gene Ontology and KEGG pathways as default) and show their significance by calculating p-values referring to gene diversity and activity (@Castro2009). Each group of genes is called GFAG (Group of Functionally Associated Genes). The package has support for many species in regards to the genes and their respective functions.

In the package's analysis, all genes present in the expression data are grouped by their respective functions according to the domains described by AnalysisDomain argument. The relationship between genes and functions are made based on the species annotation package. If there is no annotation package, a three column file (gene, function and function description) must be provided. For each GFAG, gene diversity and activity in each sample are calculated. As the package always compare two samples (control versus experiment), relative gene diversity and activity for each GFAG are calculated. Using bootstrap method, for each GFAG, according to relative gene diversity and activity, two p-values are calculated. The p-values are then corrected, by using the correction method defined by PCorrectionMethod argument, generating a q-value (@molan2018). The significative GFAGs will be those whose q-value stay under the cutoff set by PCorrection argument. Optionally, it's possible to run Wilcoxon test and/or Fisher's exact test (@fontoura2016). These tests also provide a corrected p-value, and siginificative groups can be seen through them.

GFAGAnalysis

GFAGAnalysis function provides a complete analysis, using all available arguments. As an example, lets consider a gene expression set of Aedes aegypti:

suppressMessages(library(ADAM))

data("ExpressionAedes")
head(ExpressionAedes)

The first column refers to the gene names, while the others are the expression obtained by a specific experiment (in this case, RNA-seq). ADAM always need two samples (control versus experiment). This way, we must select two sample columns from the expression data:

Comparison <- c("control1,experiment1","control2,experiment2")

Each GFAG has a number of genes associated to it. This way, the analysis can consider all GFAGs or just those with a certain number of genes:

Minimum <- 3
Maximum <- 20

The p-values for each GFAG is calculated through the bootstrap method, which demands a seed for generating random numbers and a number of bootstraps steps (the number of bootstraps should be a value that ensures the p-value precision):

SeedBootstrap <- 1049
StepsBootstrap <- 1000

The p-values will be corrected by a specific method with a certain cutoff value:

CutoffValue <- 0.05
MethodCorrection <- "fdr"

In order to group the genes according to their functions, it's necessary an annotation package or a file relating genes and functions. In this case, Aedes aegypti doesn't have an annotation package. This way, we build our own file:

data("KeggPathwaysAedes")
head(KeggPathwaysAedes)

It's necessary to inform which function domain and gene nomenclature are being used. As Aedes agypti doesn't have an annotation package, the domain will be "own" and the nomenclature "gene":

Domain <- "own"
Nomenclature <- "geneStableID"

Wilcoxon test and Fisher's exact test will be run:

Wilcoxon <- TRUE
Fisher <- TRUE

As all arguments were defined, then we can run GFAGAnalysis function:

ResultAnalysis <- suppressMessages(GFAGAnalysis(ComparisonID = Comparison, 
                            ExpressionData = ExpressionAedes,
                            MinGene = Minimum,
                            MaxGene = Maximum,
                            SeedNumber = SeedBootstrap, 
                            BootstrapNumber = StepsBootstrap,
                            PCorrection = CutoffValue,
                            DBSpecies = KeggPathwaysAedes, 
                            PCorrectionMethod = MethodCorrection,
                            WilcoxonTest = Wilcoxon,
                            FisherTest = Fisher,
                            AnalysisDomain = Domain, 
                            GeneIdentifier = Nomenclature))

In the example above, we used the function supressMessages just to stop showing messages during the GFAGAnalysis function execution. The output of GFAGAnalysis will be a list with two elements. The first corresponds to a data frame showing genes and their respective functions:

head(ResultAnalysis[[1]])

The second element of the output list result corresponds to data frames according to the argument ComparisonID:

DT::datatable(as.data.frame(ResultAnalysis[[2]][1]), width = 800,
            options = list(scrollX = TRUE))
DT::datatable(as.data.frame(ResultAnalysis[[2]][2]), width = 800, 
            options = list(scrollX = TRUE))

The data frames corresponding to the second element of the list have the following columns:

ID - A code identifying the GFAG (GO term, KEGG pathway or one according to users annotations).
Description - Description of the GFAG.
Raw_Number_Genes -
Sample_Number_Genes -
H_ - Two columns. GFAG gene diversity of each sample (control versus experiment).
N_ - Two columns. GFAG gene activity of each sample (control versus experiment).
h - Relative gene diversity.
n - Relative gene activity.
pValue_h - GFAG p-value related to gene diversity.
pValue_n - GFAG p-value related to gene activity.
qValue_h - GFAG corrected p-value related to gene diversity.
qValue_n - GFAG corrected p-value related to gene activity.
Significance_h - GFAG significance related to gene diversity. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
Significance_n - GFAG significance related to gene activity. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
Wilcox_pvalue - GFAG p-value generated by Wilcoxon test.
Wilcox_qvalue - Wilcoxon GFAG corrected p-value.
Wilcox_significance - GFAG significance related Wilcoxon test. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.
Fisher_pvalue - GFAG p-value generated by Fisher's exact test.
Fisher_qvalue - Fisher GFAG corrected p-value.
Fisher_significance - GFAG significance related to Fisher's exact test. "significative" means the q-value is lower than the cutoff set by PCorrection argument, while "not-significative" means the opposite.

ADAnalysis

ADAnalysis function provides a partial analysis, where is calculated just gene diversity and activity of each GFAG with no signicance by bootstrap, Wilcoxon or Fisher. As an example, lets consider the same gene expression set of Aedes aegypti previously used in GFAGAnalysis funcion example:

suppressMessages(library(ADAM))
data("ExpressionAedes")
data("KeggPathwaysAedes")

As ADAM always need two samples (control versus experiment), let's select two sample columns from the expression data and define minimum and maximum number of genes per GFAG:

Comparison <- c("control1,experiment1")
Minimum <- 3
Maximum <- 100

Aedes aegypti doesn't have an annotation package. This way, we build our own file:

SpeciesID <- "KeggPathwaysAedes"

It's necessary to inform which function domain and gene nomenclature are being used. Aedes agypti doesn't have an annotation package. So the domain will be "own" and the nomenclature "geneStableID":

Domain <- "own"
Nomenclature <- "geneStableID"

As all arguments were defined, then we can run ADAnalysis function:

ResultAnalysis <- suppressMessages(ADAnalysis(ComparisonID = Comparison, 
                            ExpressionData = ExpressionAedes,
                            MinGene = Minimum,
                            MaxGene = Maximum,
                            DBSpecies = KeggPathwaysAedes, 
                            AnalysisDomain = Domain, 
                            GeneIdentifier = Nomenclature))

In the example above, we used the function supressMessages just to stop showing messages during the ADAnalysis function execution. The output of ADAnalysis will be a list with two elements. The first corresponds to a data frame showing genes and their respective functions:

head(ResultAnalysis[[1]])

The second element of the output list result corresponds to data frames according to the argument ComparisonID:

DT::datatable(as.data.frame(ResultAnalysis[[2]][1]), width = 800, 
            options = list(scrollX = TRUE))

The data frames corresponding to the second element of the list have the following columns:

ID - A code identifying the GFAG (GO term, KEGG pathway or one according to users annotations).
Description - Description of the GFAG.
Raw_Number_Genes -
Sample_Number_Genes -
H_ - Two columns. GFAG gene diversity of each sample (control versus experiment).
N_ - Two columns. GFAG gene activity of each sample (control versus experiment).
h - Relative gene diversity.
n - Relative gene activity.