wrapClusterProfiler: Wrapper for gene enrichment analysis using clusterProfiler
In frankRuehle/systemsbio: Streamlined Analysis and Integration of Systems Biology Data

Description Usage Arguments Details Value Note Author(s)

Applying over-representation analysis and gene set enrichment analysis to supplied gene list

wrapClusterProfiler(
  genes,
  newheader = NULL,
  backgroundlist = NULL,
  newheaderBackground = NULL,
  projectfolder = "clusterProfiler",
  projectname = "",
  analysis_type = c("over-representation", "GSEA"),
  enrichmentCat = c("GO", "KEGG", "Reactome", "DO"),
  maxInputGenes = 100,
  id.type = "ENTREZID",
  id.column = "ENTREZID",
  sortcolumn = "adj.P.Val",
  highValueHighPriority = FALSE,
  sortcolumn.threshold = 0.05,
  fun.transform = function(x) {     identity(x) },
  FCcolumn = "logFC",
  threshold_FC = log2(1.5),
  returnSymbolsInResultObject = TRUE,
  org = "human",
  pAdjustMethod = "BH",
  enrich.p.valueCutoff = 0.05,
  enrich.q.valueCutoff = 0.05,
  nPerm = 1000,
  minGSSize = 10,
  maxGSSize = 500,
  figure.res = 300
)

`genes`	the input data object can be given in several formats. vector with gene names/IDs: all genes are used for over-representation analysis. The gene names/IDs must be in the format given in `id.type`. dataframe with gene names/IDs and quantitative values: genes can be filtered for the quantitative value prior to over-representation analysis via the unfiltered list can be used as background list. Additionally to over-representation analysis, a GSEA is perfomed using the full gene list ranked for the quantitative value. Columns of the dataframe must be named according to parameters `id.column` and `sortcolumn`. character with path to dataframe: as above. The dataframe is loaded from the given path. list of items: Each item is processed separately as indicated above.
`newheader`	optional character vector with new header information for `genes` dataframe. Only relevant if 'genes' is a dataframe (or character string with filepath to a table) with wrong or missing header. NULL otherwise.
`backgroundlist`	the background list of gene names/IDs for enrichment analysis can be given in several formats. vector with gene names/IDs: The gene names/IDs must be in the format given in `id.type`. dataframe with gene names/IDs: an `id.column` is needed as in `genes` with type of IDs given in `id.type`. "genome": all ENTREZ IDs from the annotation package of the respective organism (denoted in `org`) are used as background. NULL: full name/ID list from `genes` is used as background (i.e. that `genes` need to contain all genes under investigation without pre-filtering).
`newheaderBackground`	optional character vector with new header information for `backgroundlist`.
`projectfolder`	character with directory for output files (will be generated if not existing).
`projectname`	optional character prefix for output file names.
`analysis_type`	character vector giving the type of analysis to be performed. Eiter "over-representation", "GSEA" or both.
`enrichmentCat`	character vector with categories to be enriched (`GO`: gene ontology (MF, BP, CC), `KEGG`: KEGG pathways, `Reactome`: Reactome pathways, `DO`: Disease ontology). Disease ontology is for human only.
`maxInputGenes`	(numeric) max number of top diff regulated elements used for over-representation analysis (or NULL).
`id.type`	character with identifier type from annotation package (`"ENTREZID"` or `"SYMBOL"`) Gene symbols will be converted to EntrezIDs prior to enrichment analysis.
`id.column`	character with column name for identifier variable in `genes`.
`sortcolumn`	character with column name of quantitative data in `genes` used for ordering. If `Null`, ranking of genes is omitted and GSEA is not possible.
`highValueHighPriority`	(logical) priority order of values in `sortcolumn`. `TRUE`: high values have highest priority (e.g. fold changes). `FALSE`: low values have highest priority (e.g. p-values); If `FALSE`, values in `sortcolumn` should be transformed prior to GSEA (see `fun.transform`).
`sortcolumn.threshold`	numeric threshold for `sortcolumn` to be included in over-representation analysis. `If highValueHighPriority=F, value < sortcolumn.threshold else value > sortcolumn.threshold`
`fun.transform`	GSEA needs an input gene list with priority in decreasing order (high values have highest priority). Since quatitative values given in `sortcolumn` may have priority in increasing order (e.g. p-values), these values must be transformed by a custom function to generate priority in decreasing order prior to GSEA. A suitable function definition can be given in `fun.transform`, e.g. `function(x) {-log10(x)}` for p-values or `abs` for absolute values of foldchange.
`FCcolumn`	(character) optional column name of foldchanges in `genes` if `sortcolumn` is used for a different data column. Used only for annotation cnetplot of enrichment results. Omitted if NULL
`threshold_FC`	(numeric) Fold change threshold for filtering (threshold interpreted for log2 transformed foldchange values!) Only relevant for over-representation analysis if an unfiltered gene list is given in `genes` to allow for GSEA in parallel.
`returnSymbolsInResultObject`	logical. if `TRUE` gene symbols are returned in result objects. ENTREZIDs otherwise.
`org`	character with name of organism ("human", "mouse", "rat").
`pAdjustMethod`	method for adjusting for multiple testing. One of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"
`enrich.p.valueCutoff`	numeric p-value threshold for returned enrichment terms.
`enrich.q.valueCutoff`	numeric q-value threshold for returned enrichment terms.
`nPerm`	permutation numbers for gene set enrichment analysis
`minGSSize`	minimal size of genes annotated by Ontology term for testing.
`maxGSSize`	maximal size of genes annotated for testing
`figure.res`	numeric resolution for output png.

This function uses genelist and optionally background gene list as input to perform enrichment analysis using the clusterProfiler and DOSE packages. By default, the function performs two kinds of analysis for all categories given in enrichmentCat.

over-representation analysis: DOSE implements hypergeometric model to assess whether the number of selected genes annotated with a term or pathway is larger than expected by chance. For this approach, usually a cut off has been applied to select genes of interest e.g. from differential expression analysis. All selected genes are treated with equal priority.
gene set enrichment analysis (GSEA): for GSEA all genes under investigation can be used as input. They are ranked based on a corresponding quantitative value given in sortcolumn. Given a priori defined set of genes S (e.g., genes sharing the same GO term), the goal of GSEA is to determine whether the members of S are randomly distributed throughout the ranked gene list (L) or primarily found at the top or bottom. This approach can handle a situation where the extent of differential expression is small, but evidenced in coordinated way in a set of related genes. GSEA aggregates the per gene statistics across genes within a gene set, therefore making it possible to detect situations where all genes in a predefined set change in a small but coordinated way. If no quantitative data is provided (sortcolumn = NULL), GSEA is skipped.

The names/IDs are converted to ENTREZ IDs (if necessary) prior to enrichment using the annotation package for the species denoted in org. A background list with genes under investigation (e.g. expression array content) can be provided in backgroundlist to be used as background for over-representation analysis. Alternatively, all genes of the designated organism can be obtained from the respective annotation package (For GSEA, all genes of the designated organism are used anyway). Optionally, quantitative data in sortcolumn can be used for sorting and filtering (using sortcolumn.threshold or maxInputGenes) the input gene list for over-representation analysis, if not filtered prior to that. The function generation visualisations for each enrichment result like cnetplots enrichment-maps and dotplots. Enriched KEGG pathway maps (if any) are annotated for input genes using the pathview package. If quantitative data in sortcolumn are not fold changes, fold changes may be given additionally in FCcolumn for annotation purposes in cnetplots as well as for KEGG pathway mapping, otherwise the data in sortcolumn is used for this. Be aware that the legend of the cnetplots will be "Fold Change" anyway.

List of enrichment-objects defined in DOSE-package (enrichResult-object for overrepresentation analysis and gseaResult-objects for gene set enrichment analysis). Enrichment tables and plots are stored in the project folder as side effects.

There are three key elements of the GSEA method (taken from https://bioconductor.org/packages/release/bioc/vignettes/DOSE/inst/doc/GSEA.html):

Calculation of an Enrichment Score: The enrichment score (ES) represent the degree to which a set S is over-represented at the top or bottom of the ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing when it is not. The magnitude increment depends on the gene statistics (e.g., correlation of the gene with phenotype). The ES is the of the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov-Smirnov-like statistic
Esimation of Significance Level of ES: The p-value of the ES is calculated using permutation test. Specifically, we permute the gene labels of the gene list L and recompute the ES of the gene set for the permutated data, which generate a null distribution for the ES. The p-value of the observed ES is then calculated relative to this null distribution.
Adjustment for Multiple Hypothesis Testing: When the entire gene sets were evaluated, DOSE adjust the estimated significance level to account for multiple hypothesis testing and also q-values were calculated for FDR control.

Frank Ruehle

frankRuehle/systemsbio documentation built on Sept. 14, 2020, 1:18 a.m.

frankRuehle/systemsbio index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

frankRuehle/systemsbio
Streamlined Analysis and Integration of Systems Biology Data

wrapClusterProfiler: Wrapper for gene enrichment analysis using clusterProfiler
In frankRuehle/systemsbio: Streamlined Analysis and Integration of Systems Biology Data

Description

Usage

Arguments

Details

Value

Note

Author(s)

Related to wrapClusterProfiler in frankRuehle/systemsbio...

R Package Documentation

Browse R Packages

We want your feedback!

frankRuehle/systemsbio Streamlined Analysis and Integration of Systems Biology Data

wrapClusterProfiler: Wrapper for gene enrichment analysis using clusterProfiler In frankRuehle/systemsbio: Streamlined Analysis and Integration of Systems Biology Data

Description

Usage

Arguments

Details

Value

Note

Author(s)

Related to wrapClusterProfiler in frankRuehle/systemsbio...

R Package Documentation

Browse R Packages

We want your feedback!

frankRuehle/systemsbio
Streamlined Analysis and Integration of Systems Biology Data

wrapClusterProfiler: Wrapper for gene enrichment analysis using clusterProfiler
In frankRuehle/systemsbio: Streamlined Analysis and Integration of Systems Biology Data