knitr::opts_chunk$set(
  collapse=TRUE,
  comment="#>"
)
library(multienrichjam);
library(jamba);
library(colorjam);
suppressPackageStartupMessages(library(ComplexHeatmap));
options("stringsAsFactors"=FALSE, "warn"=-1);
knitr::opts_chunk$set(
   fig.height=10,
   fig.width=10,
   fig.align="center"
)
ragg_png = function(..., res = 192) {
  ragg::agg_png(..., res = res, units = "in")
}
knitr::opts_chunk$set(dev = "ragg_png", fig.ext = "png")

Import and use Ingenuity IPA enrichment data

This document describes steps recommended for using Ingenuity IPA enrichment data.

Ingenuity IPA enrichment data can be exported using a function "Export All" which by default creates one text file, concatenating each enrichment table into one large file.

This workflow demonstrates the import process using two IPA enrichment files used by Reese et al 2019 https://doi.org/10.1016/j.jaci.2018.11.043 to compare enrichment results in newborns to older children.

It therefore requires IPA enrichment results have already been exported in text format from IPA.

Import IPA data from text files

To import an IPA text file, use importIPAenrichment():

newborn_txt <- system.file("extdata",
   "Newborns-IPA.txt",
   package="multienrichjam");
newborn_dfl <- importIPAenrichment(newborn_txt);

The result is a list of data.frame objects, where each data.frame represents one enrichment test. A convenient way to see the dimensions of each data.frame is with the function jamba::sdim():

sdim(newborn_dfl);

For MultiEnrichMap, we typically want to analyze multiple IPA enrichment files, so we can wrap the call in an lapply() function:

newborn_txt <- system.file("extdata",
   "Newborns-IPA.txt",
   package="multienrichjam");
olderchildren_txt <- system.file("extdata",
   "OlderChildren-IPA.txt",
   package="multienrichjam");
ipa_files <- c(Newborns=newborn_txt,
   OlderChildren=olderchildren_txt)

ipa_l <- lapply(ipa_files, importIPAenrichment);

Now we can check the dimensions within each list using jamba::ssdim():

ssdim(ipa_l);

In most cases, each IPA file should contain the same enrichment tests, for example "Canonical Pathways", "Upstream Regulators", "Diseases and Bio Functions", etc. However, it is not always the case, so it is recommended to check and verify each IPA file contains at least the enrichment tests needed for downstream analysis.

Analyze IPA enrichments from one enrichment test

IPA performs multiple enrichment tests, which are done independently and with unique assumptions and caveats. Therefore, I recommend using one enrichment test at a time in MultiEnrichMap.

Extract one data.frame from each result:

library(igraph)
## Take only the Ingenuity Canonical Pathways
enrichList_canonical <- lapply(ipa_l, function(i){
   i[["Canonical Pathways"]];
});
sdim(enrichList_canonical);

## Convert data.frame to enrichResult
## multienrichjam::enrichDF2enrichResult
er_canonical <- lapply(enrichList_canonical, function(i){
   enrichDF2enrichResult(i,
      keyColname="Name",
      pvalueColname="P-value",
      geneColname="geneNames",
      geneRatioColname="Ratio",
      pvalueCutoff=1)
});
sdim(er_canonical);
kable_coloring(
   head(as.data.frame(er_canonical[[1]])),
   caption="Top 10 rows of enrichment data",
   row.names=FALSE) %>%
   kableExtra::column_spec(column=seq_len(ncol(er_canonical[[1]])),
      border_left="1px solid #DDDDDD",
      extra_css="white-space: nowrap;")

run multiEnrichMap()

Now given a list of data.frame results, we can run multiEnrichMap():

mem_canonical <- multiEnrichMap(er_canonical,
   enrichBaseline=1,
   cutoffRowMinP=0.05,
   colorV=c("purple", "orange"),
   topEnrichN=20)

The output mem_canonical is a list containing various results.

kable_coloring(
   sdim(mem_canonical),
   caption="sdim(mem_canonical)") %>%
   kableExtra::column_spec(column=seq_len(4),
      border_left="1px solid #DDDDDD",
      extra_css="white-space: nowrap;")

Create a "Mem Plot Folio" (new Apr2020)

The other analysis steps describe in this vignette are valid options, however in recent work with multienrichjam we have developed a new function to create a small folio of relates plots that we call a "mem plot folio", and the function is mem_plot_folio().

The mem_plot_folio() function creates the following plots by default:

  1. Enrichment P-value heatmap (optionally colored by group)
  2. Gene-pathway incidence matrix, clustered by column and by row
  3. Cnet plot including all collapsed pathway clusters

    • The first plot labels clusters by LETTERS, "A", "B", "C", "D", etc.
    • The second plot labels clusters by the top n pathway names
    • The third plot labels clusters by the top n pathway names, but hides gene labels. This plot is best when there are too many genes to label.
  4. Cnet plot using 1,2,3 exemplars per pathway cluster (configurable)

    • The first plot uses 1 exemplar per cluster.
    • The second plot uses 2 exemplars per cluster.
    • The third plot uses 3 exemplars per cluster.
  5. Cnet plot of each pathway cluster individually

    • The number of plots is the number of heatmap pathway_clusters.
mem_canonical_plots <- multienrichjam::mem_plot_folio(mem_canonical,
   pathway_column_split=4,
   node_factor=1,
   use_shadowText=TRUE,
   label_factor=1.2,
   do_which=c(1:5),
   verbose=TRUE,
   main="Canonical Pathways");

The object returned mem_canonical_plots will contain a list of the graphical objects for each plot requested. For example, each Cnet plot returns an igraph object which can be customized.

Commentary on the "Plot Folio"

In practice, we rarely found much benefit from the multi-enrichment map, which shows pathways connected to pathways based upon Jaccard overlap. This network view was often disorganized, not clearly clustered, and lacked ability to see which genes drive the overlaps between pathways. Perhaps it works best for the highly structured Gene Ontology (GO), but in our hands GO was just not insightful for the broad range of experiments we were analyzing.

Instead, we almost always gravitated toward the Cnet plot, a clever idea by Dr. Guangchang Yu that connects pathways to genes, where genes also connect naturally to other pathways. This plot for many of our collaborators has seemed more intuitive, and answers the next question people often have: "What are the shared genes?" Our subtle addition to this plot is to color-code genes by the enrichment experiment, so you can see which genes are shared or unique by experiment.

Since most enrichment results have far more pathways than can be shown in one Cnet plot, we tried to prioritize which pathways to display. The logic in Dr. Yu's enrichplot is to take the top n pathways by P-value, however we noticed for our data this logic often shows redundant pathways and misses orthologous biological functions, which might be enriched to a lesser degree. So we started clustering pathways by P-value, then by using the pathway-gene incidence matrix.

The pathway-gene incidence matrix worked best in our hands for this clustering, partly because it clusters pathways by the genes involved, and not by the pattern of enrichment across the enrichment tests (which may not be related to the underlying genes.) So we cluster pathways, then choose an arbitrary number of clusters, and took one "exemplar" from each cluster to represent each cluster. This practice is mostly effective, especially when the clusters are well-defined, and when members of each cluster have relatively similar function. We can increase the exemplars per cluster as needed. This strategy is helpful because each pathway is very clearly labeled.

Most recently, we experimented with collapsing each pathway_cluster, so each pathway_cluster includes all genes implicated by pathways contained in the pathway_cluster. The benefit here is that no genes are lost by selecting a subset of "exemplar" pathways -- all genes are retained per cluster. Then instead of plotting pathway-gene network, we plot a pathway_cluster-gene network. Each pathway_cluster is labeled using the top n pathway names -- which is configurable, and is only a visible label. Each pathway_cluster contains all the pathways in the cluster.

Directly create each individual plot

The plots in mem_plot_folio() are also available as individual functions.

In general, it is most convenient to call mem_plot_folio() and use the output list to create custom plots, however the functions are also available to be called independently.

Plot the MultiEnrichMap network

We can view the "Multi Enrichment Map" itself with mem_multienrichplot().

This network connects pathways when they meet a Jaccard overlap coefficient threshold based upon the shared genes between the pathways. The default overlap is stored by multiEnrichMap() in the output object mem_canonical.

g <- mem_multienrichplot(mem_canonical,
   do_plot=FALSE,
   overlap=0.3,
   node_factor=2,
   repulse=3.5)
jam_igraph(g,
   node_factor=2,
   use_shadowText=TRUE)

The overlap threshold can be customized manually, or through the helper function mem_find_overlap(). This function attempts to guess a reasonable overlap threshold based upon the network connectivity, by optimizing the cluster size and total nodes involved.

g <- mem_multienrichplot(mem_canonical,
   overlap=mem_find_overlap(mem_canonical),
   do_plot=FALSE,
   node_factor=3,
   repulse=3.5)
jam_igraph(g,
   node_factor=3,
   use_shadowText=TRUE)

Notice there are distinct subnetworks, called "components", which are not connected to each other. A useful follow-up technique is to graph one component at a time, which we do using the helper function subset_igraph_components(). Components are ordered by size, largest to smallest, so you can keep the largest using argument keep=1, or the second largest with keep=2, and so on.

We also call two other helper functions:

  1. removeIgraphBlanks() removes blank colors from multi-color nodes, such as pie nodes, or colored rectangle nodes. It helps show only the remaining colors without the whitespace.
  2. relayout_with_qfr() applies Fruchterman-Reingold layout to an igraph object. It's convenient to adjust repulse here, which is the recommended method for adjusting the spacing between nodes. This function also updates other useful attributes such as the node label angle, which makes labels appear offset away from the majority of edges. (It helps more with Cnet plots below.)
## You can alternatively pull out any other component
g_sub <- subset_igraph_components(g, keep=1);

## Re-apply network layout, and remove blank colors
g_sub <- relayout_with_qfr(repulse=3.5,
   removeIgraphBlanks(g_sub))

## Plot
jam_igraph(g_sub,
   node_factor=3,
   use_shadowText=TRUE)

Plot the enrichment P-value matrix

A useful overview of the enrichment results, including the overlaps across datasets, is shown in a heatmap of -log10(Pvalue). The helper function mem_enrichment_heatmap() is used here.

The argument p_cutoff is used to set the Pvalue below which cells are colorized -- every P-value above this threshold is not colored, and displayed as white, even when the P-value is less than 1.

mem_enrichment_heatmap(mem_canonical,
   p_cutoff=0.05);

The same data can be plotted as a heatmap.

mem_enrichment_heatmap(mem_canonical,
   style="heatmap",
   p_cutoff=0.05);

The argument color_by_column=TRUE will optionally apply a color gradient to each column, using the same P-value threshold defined by p_cutoff. In this case, cells are labeled to indicate the actual enrichment P-value.

mem_enrichment_heatmap(mem_canonical,
   style="heatmap",
   color_by_column=TRUE);

Plot the pathway-gene incidence matrix

We can view the pathway-gene matrix using the function mem_gene_path_heatmap(). This function uses data in the mem object: "memIM" which contains the gene-pathway incidence matrix, "geneIM" which contains the gene-dataset incidence matrix, and "enrichIM" which contains the pathway-dataset matrix.

This function attempts to "guess" a reasonable number of column and row splits -- which breaks the column and row dendrograms into separate sub-clusters. These breaks are visual only, but can be useful in follow-up analyses shown later. You can override these numbers with arguments column_split and row_split.

Also, the colors used across the top of the heatmap indicate the enrichment P-value, and are intentionally scaled so all P-values above the P-value threshold are not colored, and are therefore white.

This function uses the amazing ComplexHeatmap::Heatmap() function, which enables many awesome customizable features. The ... argument is passed to ComplexHeatmap::Heatmap(), so you can use those features as needed.

hm <- mem_gene_path_heatmap(mem_canonical,
   column_cex=0.5,
   row_cex=0.6);
hm;

As a follow-up analysis, you can pull out each pathway cluster from the heatmap itself, using heatmap_column_order():

hm_sets <- heatmap_column_order(hm);
hm_sets;

Plot the Pathway-Gene Concept network (Cnet plot)

We can view the Cnet plot (concept network plot) which shows the pathway-gene relationship.

The helper function memIM2cnet() creates a Cnet plot from the mem_canonical output. Here, we also pipe the result through other helper functions:

#cnet <- mem_canonical$multiCnetPlot1b;
cnet <- mem_canonical %>% 
   memIM2cnet() %>%
   fixSetLabels() %>%
   removeIgraphBlanks() %>%
   relayout_with_qfr(repulse=4);
#plot(cnet, vertex.label.cex=0.6);
par("mar"=c(5,4,4,2)+0.1)
jam_igraph(cnet,
   use_shadowText=TRUE,
   node_factor=0.5,
   vertex.label.cex=0.6);
mem_legend(mem_canonical);

You can extract the largest connected subnetwork to plot, as before.

g2 <- cnet;
g2_sub <- subset_igraph_components(cnet, keep=1)

#plot(g2_sub);
jam_igraph(g2_sub,
   use_shadowText=TRUE,
   label_factor=0.5,
   node_factor=0.5);

Customizing the Cnet plot

The most useful customization is to subset the pathway nodes.

One useful method to choose a subset of pathways is to view one heatmap cluster at a time. Here we will use hm_sets made previously.

cnet_sub <- subsetCnetIgraph(cnet,
   repulse=3.5,
   includeSets=unlist(hm_sets[c("A")]));
jam_igraph(cnet_sub,
   node_factor=1,
   use_shadowText=TRUE,
   label_dist_factor=3,
   label_factor=1.3);
mem_legend(mem_canonical);

Subset the Pathway-Gene Cnet plot

You can also subset the Pathway-Gene Cnet plot using specific names, or by some network criteria, using the function subsetCnetIgraph().

Pathways are subsetted to require at least 6 genes, using the argument minSetDegree, which can be useful to help simplify the graph structure.

Similarly, the minGeneDegree=2 argument can be used to display only genes present in 2 or more pathways, but for now we will not use that argument.

During the subset operation, some convenient default operations are also performed, controlled by function arguments:

cnet3 <- multienrichjam::subsetCnetIgraph(cnet,
   repulse=5,
   minSetDegree=6,
   minGeneDegree=1);
jam_igraph(cnet3,
   node_factor=0.7,
   use_shadowText=TRUE);
mem_legend(mem_canonical);

Custom igraph plot function jam_igraph()

The jam_igraph() function is a custom version of igraph::plot() intended to enhance the base function in a few ways:

The function below replicates the previous Cnet plot, with these changes:

jam_igraph(cnet3,
   edge_factor=2,
   node_factor=1.2,
   label_factor=1.2, 
   use_shadowText=TRUE,
   label_dist_factor=3);

One more fancy alternative, you can colorize the edges based upon the node colors, a great idea inspired by the Gephi netowrk visualization tool.

jam_igraph(color_edges_by_nodes(cnet3, alpha=0.7),
   edge_bundling="connections",
   edge_factor=2,
   node_factor=1,
   label_factor=1.2, 
   use_shadowText=TRUE,
   label_dist_factor=3)


jmw86069/multienrichjam documentation built on Nov. 3, 2024, 10:29 p.m.