In jmw86069/jamenrich: Analysis and Visualization of Multiple Gene Set Enrichments

knitr::opts_chunk$set(echo=TRUE);

MultiEnrichMap notes

This document is intended to track the workflow steps involved in a typical MultiEnrichMap workflow. It also includes some details about custom column headers and how and where in the workflow they are handled.

Starting requirements

The starting data should include results from gene set enrichment analysis, in the form of a data.frame, an enrichResult object, or another class from which a data.frame can be coerced.

Each enrichment result is provided in its own data.frame or similar class.

Import from data.frame to enrichResult

enrichDF2enrichResult() converts a single data.frame to enrichResult.

Important arguments to enrichDF2enrichResult()

keyColname - unique identifier for each row
geneColname - column with delimited list of genes, e.g. GeneA,GeneB,GeneC
pathGenes - column reporting the number of pathway genes tested, e.g. the number of pathway genes present in the gene universe.
geneHits - column with the number of pathway genes present in the test genes.
pvalueColname - column with the enrichment P-value to use in downstream analysis steps. E.g. if there is a P-value and FDR column, choose the one column to use in subsequent steps.
geneDelim - by default "," is the delimiter between genes, but it can be defined here.

Optional arguments to enrichDF2enrichResult()

geneRatioColname - if the geneHits and pathGenes columns are not available, sometimes there is a column reporting "geneHits/pathGenes" referred to as a "geneRatio" or "BgRatio".
msigdbGmtT - optional arules format for gene set data, when provided the import will choose these genes to represent each pathway. Otherwise, pathway genes are only inferred using the reported gene hits.
pAdjustMethod - optionally apply a P-value adjustment, which may be helpful when subsetting pathways by some pre-defined, independent criteria, e.g. by pathway source.

Notes

This function returns enrichResult object, with close approximation to the data produced in the clusterProfiler package, but not identical. For example, the pathway genes will not be fully correct without supplying the GmtT data upfront, since the GmtT data will contain the full set of pathway genes for all pathways.

Running multiEnrichMap()

To run multiEnrichMap() one must supply a list of enrichment results, either in the form of data.frame or enrichResult objects.

Steps carried out by multiEnrichMap()

enrichLabels are supplied or derived from names(enrichList), these labels are used in subsequent analysis steps.
Colors for each enrichment result are provided in colorV or derived using colorjam::rainbowJam(). These colors are used in enrichment heatmaps.
geneHitList is supplied or derived, representing a list of vectors, where each vector is the set of gene hits tested for enrichment corresponding to enrichList. Note that not all enrichment results will represent every gene hit, so deriving this data will be imperfect.
Optionally run topEnrichBySource() which subsets each enrichment based upon a set of sources, and takes the top N number of pathways from each source, from each enrichment. Once the pathways are defined, then all results for these pathways are retained for downstream analysis.
geneIM incidence matrix of gene hits by enrichment result is derived. This data allows comparison of gene hits across enrichment tests.
geneIMcolors is a color matrix used to create a heatmap showing the geneIM data.
enrichIM incidence matrix of pathway enrichment P-values by enrichment test, to allow comparison of pathway enrichment results.
enrichIMcolors is a color matrix used to create a heatmap showing the enrichIM data. These colors are based upon the enrichment P-value.
Note enrichList2IM() is called during this step.
valueColname uses pvalueColname so the matrix will be filled with the actual P-value.
keyColname uses nameColname so the matrix rownames will use pathway names and not the pathway ID.
Optionally GmtT is used here.
enrichIMgeneCount is similar to enrichIM except that the gene count is reported in the matrix instead of P-value.
valueColname uses geneCountColname
enrichIM is filtered to require at least one enrichment P-value below cutoffRowMinP in each row.
i1use (internal) is defined as the set of pathways meeting the criteria
enrichList data is subsetted to include only pathways present in enrichIM.
enrichIM, enrichIMM, enrichIMcolors are subsetted to include only these pathways.
enrichList is combined into one data.frame using enrichList2df()
keyColname, geneColname, geneCountColname, pvalueColname are used at this step, to define the core components required
GmtT is optional
descriptionColname is optionally used, sent to memAdjustLabel() to clean up some annotations
enrichDF2enrichResult() is called on the merged enrichment data.frame
geneHits, pathGenes, keyColname, geneColname, geneCountColname, geneDelim, pvalueColname are passed to this function
GmtT is optionally used at this step
memIM is an incidence matrix of genes by pathways, using the merged enrichment results after filtering.
enrichMapJam() is called using the merged enrichResults data.
igraph2pieGraph() is called to convert an igraph enrichMap object to use pie node shapes.
rectifyPiegraph() is called to enable coloredrectangle node shapes.
cnetplotJam() is called to create a full cnet igraph object
geneCountColname is used to size pathway nodes by the number of genes
nodeLabel uses the first matching entry from nameColname, descriptionColname, keyColname, or "ID".
"nodeType" is defined in the igraph object, to identify nodes as "Set" or "Gene".
igraph2pieGraph() is called two times, to enable pie node shapes.
enrichIMcolors is used to colorize pathway nodes
geneIMcolors is used to colorize gene nodes
rectifyPiegraph() is called to enable coloredrectangle node shapes.

Output from multiEnrichMap()

Output is returned as a list with the following items:

colorV the named color vector used to apply colors to each enrichment result.
geneHitList the gene hits tested in each enrichment, either supplied or derived.
geneIM incidence matrix of gene hits per enrichment test.
geneIMcolors the color-coded matrix derived from geneIM.
enrichIMgeneCount incidence matrix of pathways, scored using the gene count per pathway.
enrichList list of enrichment results after filtering
enrichIM incidence matrix of pathways, scored by enrichment P-values
enrichIMcolors the color-coded matrix derived from enrichIM.
multiEnrichDF the data.frame of combined enrichment results, after filtering of pathways for at least one significant P-value.
multiEnrichER same as multiEnrichDF except an object of class enrichResult.
memIM is an incidence matrix of genes by pathways, using the merged enrichment results after filtering.
multiEnrichMap the enrichMap results in the form of igraph object
multiEnrichMap2 the enrichMap object encoded to use either pie or coloredrectangle node shape.
multiCnetPlot the cnet plot igraph object
multiCnetPlot1 is the cnet plot with pathway pie node shapes
multiCnetPlot1b is the cnet plot with pathway and gene pie node shapes
multiCnetPlot2 is the cnet plot using coloredrectangle node shapes
colnames vector of actual colnames used, including geneColname, keyColname, nameColname, descriptionColname, pvalueColname.