Description Usage Arguments Details Value Note Author(s)
Applying over-representation analysis and gene set enrichment analysis to supplied gene list
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | wrapClusterProfiler(
genes,
newheader = NULL,
backgroundlist = NULL,
newheaderBackground = NULL,
projectfolder = "clusterProfiler",
projectname = "",
analysis_type = c("over-representation", "GSEA"),
enrichmentCat = c("GO", "KEGG", "Reactome", "DO"),
maxInputGenes = 100,
id.type = "ENTREZID",
id.column = "ENTREZID",
sortcolumn = "adj.P.Val",
highValueHighPriority = FALSE,
sortcolumn.threshold = 0.05,
fun.transform = function(x) { identity(x) },
FCcolumn = "logFC",
threshold_FC = log2(1.5),
returnSymbolsInResultObject = TRUE,
org = "human",
pAdjustMethod = "BH",
enrich.p.valueCutoff = 0.05,
enrich.q.valueCutoff = 0.05,
nPerm = 1000,
minGSSize = 10,
maxGSSize = 500,
figure.res = 300
)
|
genes |
the input data object can be given in several formats.
|
newheader |
optional character vector with new header information for |
backgroundlist |
the background list of gene names/IDs for enrichment analysis can be given in several formats.
|
newheaderBackground |
optional character vector with new header information for |
projectfolder |
character with directory for output files (will be generated if not existing). |
projectname |
optional character prefix for output file names. |
analysis_type |
character vector giving the type of analysis to be performed. Eiter "over-representation", "GSEA" or both. |
enrichmentCat |
character vector with categories to be enriched ( |
maxInputGenes |
(numeric) max number of top diff regulated elements used for over-representation analysis (or NULL). |
id.type |
character with identifier type from annotation package ( |
id.column |
character with column name for identifier variable in |
sortcolumn |
character with column name of quantitative data in |
highValueHighPriority |
(logical) priority order of values in |
sortcolumn.threshold |
numeric threshold for |
fun.transform |
GSEA needs an input gene list with priority in decreasing order (high values have highest priority).
Since quatitative values given in |
FCcolumn |
(character) optional column name of foldchanges in |
threshold_FC |
(numeric) Fold change threshold for filtering (threshold interpreted for log2 transformed foldchange values!)
Only relevant for over-representation analysis if an unfiltered gene list is given in |
returnSymbolsInResultObject |
logical. if |
org |
character with name of organism ("human", "mouse", "rat"). |
pAdjustMethod |
method for adjusting for multiple testing. One of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none" |
enrich.p.valueCutoff |
numeric p-value threshold for returned enrichment terms. |
enrich.q.valueCutoff |
numeric q-value threshold for returned enrichment terms. |
nPerm |
permutation numbers for gene set enrichment analysis |
minGSSize |
minimal size of genes annotated by Ontology term for testing. |
maxGSSize |
maximal size of genes annotated for testing |
figure.res |
numeric resolution for output png. |
This function uses genelist and optionally background gene list as input to perform enrichment analysis
using the clusterProfiler
and DOSE
packages. By default, the function performs two kinds
of analysis for all categories given in enrichmentCat
.
over-representation analysis: DOSE
implements hypergeometric model to assess whether the
number of selected genes annotated with a term or pathway is larger than expected by chance.
For this approach, usually a cut off has been applied to select genes of interest e.g. from differential
expression analysis. All selected genes are treated with equal priority.
gene set enrichment analysis (GSEA): for GSEA all genes under investigation can be used as input.
They are ranked based on a corresponding quantitative value given in sortcolumn
. Given a priori
defined set of genes S (e.g., genes sharing the same GO term), the goal of GSEA is to determine whether the
members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom.
This approach can handle a situation where the extent of differential expression is small, but evidenced
in coordinated way in a set of related genes. GSEA aggregates the per gene statistics across genes within
a gene set, therefore making it possible to detect situations where all genes in a predefined set change
in a small but coordinated way. If no quantitative data is provided (sortcolumn = NULL
), GSEA is skipped.
The names/IDs are converted to ENTREZ IDs (if necessary) prior to enrichment using the annotation package
for the species denoted in org
. A background list with genes under investigation (e.g. expression array content)
can be provided in backgroundlist
to be used as background for over-representation analysis.
Alternatively, all genes of the designated organism can be obtained from the respective
annotation package (For GSEA, all genes of the designated organism are used anyway).
Optionally, quantitative data in sortcolumn
can be used for sorting and filtering (using sortcolumn.threshold
or maxInputGenes
) the input gene list for over-representation analysis, if not filtered prior to that.
The function generation visualisations for each enrichment result like cnetplots enrichment-maps and dotplots.
Enriched KEGG pathway maps (if any) are annotated for input genes using the pathview
package.
If quantitative data in sortcolumn
are not fold changes, fold changes may be given additionally in FCcolumn
for annotation purposes in cnetplots as well as for KEGG pathway mapping, otherwise the data in sortcolumn
is used for this. Be aware that the legend of the cnetplots will be "Fold Change" anyway.
List of enrichment-objects defined in DOSE
-package (enrichResult
-object for overrepresentation
analysis and gseaResult
-objects for gene set enrichment analysis).
Enrichment tables and plots are stored in the project folder as side effects.
There are three key elements of the GSEA method (taken from https://bioconductor.org/packages/release/bioc/vignettes/DOSE/inst/doc/GSEA.html):
Calculation of an Enrichment Score: The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the of the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic
Esimation of Significance Level of ES: The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
Adjustment for Multiple Hypothesis Testing: When the entire gene sets were evaluated, DOSE
adjust the estimated
significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.
Frank Ruehle
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.