wrapClusterProfiler: Wrapper for gene enrichment analysis using clusterProfiler

Description Usage Arguments Details Value Note Author(s)

Description

Applying over-representation analysis and gene set enrichment analysis to supplied gene list

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
wrapClusterProfiler(
  genes,
  newheader = NULL,
  backgroundlist = NULL,
  newheaderBackground = NULL,
  projectfolder = "clusterProfiler",
  projectname = "",
  analysis_type = c("over-representation", "GSEA"),
  enrichmentCat = c("GO", "KEGG", "Reactome", "DO"),
  maxInputGenes = 100,
  id.type = "ENTREZID",
  id.column = "ENTREZID",
  sortcolumn = "adj.P.Val",
  highValueHighPriority = FALSE,
  sortcolumn.threshold = 0.05,
  fun.transform = function(x) {     identity(x) },
  FCcolumn = "logFC",
  threshold_FC = log2(1.5),
  returnSymbolsInResultObject = TRUE,
  org = "human",
  pAdjustMethod = "BH",
  enrich.p.valueCutoff = 0.05,
  enrich.q.valueCutoff = 0.05,
  nPerm = 1000,
  minGSSize = 10,
  maxGSSize = 500,
  figure.res = 300
)

Arguments

genes

the input data object can be given in several formats.

  • vector with gene names/IDs: all genes are used for over-representation analysis. The gene names/IDs must be in the format given in id.type.

  • dataframe with gene names/IDs and quantitative values: genes can be filtered for the quantitative value prior to over-representation analysis via the unfiltered list can be used as background list. Additionally to over-representation analysis, a GSEA is perfomed using the full gene list ranked for the quantitative value. Columns of the dataframe must be named according to parameters id.column and sortcolumn.

  • character with path to dataframe: as above. The dataframe is loaded from the given path.

  • list of items: Each item is processed separately as indicated above.

newheader

optional character vector with new header information for genes dataframe. Only relevant if 'genes' is a dataframe (or character string with filepath to a table) with wrong or missing header. NULL otherwise.

backgroundlist

the background list of gene names/IDs for enrichment analysis can be given in several formats.

  • vector with gene names/IDs: The gene names/IDs must be in the format given in id.type.

  • dataframe with gene names/IDs: an id.column is needed as in genes with type of IDs given in id.type.

  • "genome": all ENTREZ IDs from the annotation package of the respective organism (denoted in org) are used as background.

  • NULL: full name/ID list from genes is used as background (i.e. that genes need to contain all genes under investigation without pre-filtering).

newheaderBackground

optional character vector with new header information for backgroundlist.

projectfolder

character with directory for output files (will be generated if not existing).

projectname

optional character prefix for output file names.

analysis_type

character vector giving the type of analysis to be performed. Eiter "over-representation", "GSEA" or both.

enrichmentCat

character vector with categories to be enriched (GO: gene ontology (MF, BP, CC), KEGG: KEGG pathways, Reactome: Reactome pathways, DO: Disease ontology). Disease ontology is for human only.

maxInputGenes

(numeric) max number of top diff regulated elements used for over-representation analysis (or NULL).

id.type

character with identifier type from annotation package ("ENTREZID" or "SYMBOL") Gene symbols will be converted to EntrezIDs prior to enrichment analysis.

id.column

character with column name for identifier variable in genes.

sortcolumn

character with column name of quantitative data in genes used for ordering. If Null, ranking of genes is omitted and GSEA is not possible.

highValueHighPriority

(logical) priority order of values in sortcolumn. TRUE: high values have highest priority (e.g. fold changes). FALSE: low values have highest priority (e.g. p-values); If FALSE, values in sortcolumn should be transformed prior to GSEA (see fun.transform).

sortcolumn.threshold

numeric threshold for sortcolumn to be included in over-representation analysis. If highValueHighPriority=F, value < sortcolumn.threshold else value > sortcolumn.threshold

fun.transform

GSEA needs an input gene list with priority in decreasing order (high values have highest priority). Since quatitative values given in sortcolumn may have priority in increasing order (e.g. p-values), these values must be transformed by a custom function to generate priority in decreasing order prior to GSEA. A suitable function definition can be given in fun.transform, e.g. function(x) {-log10(x)} for p-values or abs for absolute values of foldchange.

FCcolumn

(character) optional column name of foldchanges in genes if sortcolumn is used for a different data column. Used only for annotation cnetplot of enrichment results. Omitted if NULL

threshold_FC

(numeric) Fold change threshold for filtering (threshold interpreted for log2 transformed foldchange values!) Only relevant for over-representation analysis if an unfiltered gene list is given in genes to allow for GSEA in parallel.

returnSymbolsInResultObject

logical. if TRUE gene symbols are returned in result objects. ENTREZIDs otherwise.

org

character with name of organism ("human", "mouse", "rat").

pAdjustMethod

method for adjusting for multiple testing. One of "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none"

enrich.p.valueCutoff

numeric p-value threshold for returned enrichment terms.

enrich.q.valueCutoff

numeric q-value threshold for returned enrichment terms.

nPerm

permutation numbers for gene set enrichment analysis

minGSSize

minimal size of genes annotated by Ontology term for testing.

maxGSSize

maximal size of genes annotated for testing

figure.res

numeric resolution for output png.

Details

This function uses genelist and optionally background gene list as input to perform enrichment analysis using the clusterProfiler and DOSE packages. By default, the function performs two kinds of analysis for all categories given in enrichmentCat.

The names/IDs are converted to ENTREZ IDs (if necessary) prior to enrichment using the annotation package for the species denoted in org. A background list with genes under investigation (e.g. expression array content) can be provided in backgroundlist to be used as background for over-representation analysis. Alternatively, all genes of the designated organism can be obtained from the respective annotation package (For GSEA, all genes of the designated organism are used anyway). Optionally, quantitative data in sortcolumn can be used for sorting and filtering (using sortcolumn.threshold or maxInputGenes) the input gene list for over-representation analysis, if not filtered prior to that. The function generation visualisations for each enrichment result like cnetplots enrichment-maps and dotplots. Enriched KEGG pathway maps (if any) are annotated for input genes using the pathview package. If quantitative data in sortcolumn are not fold changes, fold changes may be given additionally in FCcolumn for annotation purposes in cnetplots as well as for KEGG pathway mapping, otherwise the data in sortcolumn is used for this. Be aware that the legend of the cnetplots will be "Fold Change" anyway.

Value

List of enrichment-objects defined in DOSE-package (enrichResult-object for overrepresentation analysis and gseaResult-objects for gene set enrichment analysis). Enrichment tables and plots are stored in the project folder as side effects.

Note

There are three key elements of the GSEA method (taken from https://bioconductor.org/packages/release/bioc/vignettes/DOSE/inst/doc/GSEA.html):

Author(s)

Frank Ruehle


frankRuehle/systemsbio documentation built on Sept. 14, 2020, 1:18 a.m.