knitr::opts_chunk$set( collapse=TRUE, comment="#>" )
library(multienrichjam); library(jamba); library(colorjam); suppressPackageStartupMessages(library(ComplexHeatmap)); options("stringsAsFactors"=FALSE, "warn"=-1); knitr::opts_chunk$set( fig.height=10, fig.width=10, fig.align="center" ) ragg_png = function(..., res = 192) { ragg::agg_png(..., res = res, units = "in") } knitr::opts_chunk$set(dev = "ragg_png", fig.ext = "png")
This document describes steps recommended for using Ingenuity IPA enrichment data.
Ingenuity IPA enrichment data can be exported using a function
"Export All"
which by default creates one text file,
concatenating each enrichment table into one large file.
This workflow demonstrates the import process using two IPA enrichment files used by Reese et al 2019 https://doi.org/10.1016/j.jaci.2018.11.043 to compare enrichment results in newborns to older children.
It therefore requires IPA enrichment results have already been exported in text format from IPA.
To import an IPA text file, use importIPAenrichment()
:
newborn_txt <- system.file("extdata", "Newborns-IPA.txt", package="multienrichjam"); newborn_dfl <- importIPAenrichment(newborn_txt);
The result is a list of data.frame
objects, where each
data.frame
represents one enrichment test. A convenient
way to see the dimensions of each data.frame
is with
the function jamba::sdim()
:
sdim(newborn_dfl);
For MultiEnrichMap, we typically want to analyze multiple
IPA enrichment files, so we can wrap the call in
an lapply()
function:
newborn_txt <- system.file("extdata", "Newborns-IPA.txt", package="multienrichjam"); olderchildren_txt <- system.file("extdata", "OlderChildren-IPA.txt", package="multienrichjam"); ipa_files <- c(Newborns=newborn_txt, OlderChildren=olderchildren_txt) ipa_l <- lapply(ipa_files, importIPAenrichment);
Now we can check the dimensions within each list using
jamba::ssdim()
:
ssdim(ipa_l);
In most cases, each IPA file should contain the same
enrichment tests, for example "Canonical Pathways"
,
"Upstream Regulators"
, "Diseases and Bio Functions"
,
etc. However, it is not always the case, so it is
recommended to check and verify each IPA file contains
at least the enrichment tests needed for downstream analysis.
IPA performs multiple enrichment tests, which are done independently and with unique assumptions and caveats. Therefore, I recommend using one enrichment test at a time in MultiEnrichMap.
Extract one data.frame
from each result:
library(igraph) ## Take only the Ingenuity Canonical Pathways enrichList_canonical <- lapply(ipa_l, function(i){ i[["Canonical Pathways"]]; }); sdim(enrichList_canonical); ## Convert data.frame to enrichResult ## multienrichjam::enrichDF2enrichResult er_canonical <- lapply(enrichList_canonical, function(i){ enrichDF2enrichResult(i, keyColname="Name", pvalueColname="P-value", geneColname="geneNames", geneRatioColname="Ratio", pvalueCutoff=1) }); sdim(er_canonical); kable_coloring( head(as.data.frame(er_canonical[[1]])), caption="Top 10 rows of enrichment data", row.names=FALSE) %>% kableExtra::column_spec(column=seq_len(ncol(er_canonical[[1]])), border_left="1px solid #DDDDDD", extra_css="white-space: nowrap;")
Now given a list of data.frame
results, we can
run multiEnrichMap()
:
mem_canonical <- multiEnrichMap(er_canonical, enrichBaseline=1, cutoffRowMinP=0.05, colorV=c("purple", "orange"), topEnrichN=20)
The output mem_canonical
is a list containing various
results.
kable_coloring( sdim(mem_canonical), caption="sdim(mem_canonical)") %>% kableExtra::column_spec(column=seq_len(4), border_left="1px solid #DDDDDD", extra_css="white-space: nowrap;")
The other analysis steps describe in this vignette are
valid options, however in recent
work with multienrichjam
we have developed a new function
to create a small folio of relates plots that we call
a "mem plot folio", and the function is mem_plot_folio()
.
The mem_plot_folio()
function creates the following plots by default:
Cnet plot including all collapsed pathway clusters
LETTERS
, "A", "B", "C", "D", etc.n
pathway namesn
pathway names, but hides gene labels.
This plot is best when there are too many genes to label.Cnet plot using 1,2,3 exemplars per pathway cluster (configurable)
Cnet plot of each pathway cluster individually
mem_canonical_plots <- multienrichjam::mem_plot_folio(mem_canonical, pathway_column_split=4, node_factor=1, use_shadowText=TRUE, label_factor=1.2, do_which=c(1:5), verbose=TRUE, main="Canonical Pathways");
The object returned mem_canonical_plots
will contain a list
of the graphical objects for each plot requested. For example,
each Cnet plot returns an igraph
object which can be customized.
In practice, we rarely found much benefit from the multi-enrichment map, which shows pathways connected to pathways based upon Jaccard overlap. This network view was often disorganized, not clearly clustered, and lacked ability to see which genes drive the overlaps between pathways. Perhaps it works best for the highly structured Gene Ontology (GO), but in our hands GO was just not insightful for the broad range of experiments we were analyzing.
Instead, we almost always gravitated toward the Cnet plot, a clever idea by Dr. Guangchang Yu that connects pathways to genes, where genes also connect naturally to other pathways. This plot for many of our collaborators has seemed more intuitive, and answers the next question people often have: "What are the shared genes?" Our subtle addition to this plot is to color-code genes by the enrichment experiment, so you can see which genes are shared or unique by experiment.
Since most enrichment results have far more pathways than can be shown
in one Cnet plot, we tried to prioritize which pathways to display.
The logic in Dr. Yu's enrichplot
is to take the top n
pathways by P-value,
however we noticed for our data this logic often shows redundant pathways and
misses orthologous biological functions, which might be enriched
to a lesser degree. So we started clustering pathways by P-value,
then by using the pathway-gene incidence matrix.
The pathway-gene incidence matrix worked best in our hands for this clustering, partly because it clusters pathways by the genes involved, and not by the pattern of enrichment across the enrichment tests (which may not be related to the underlying genes.) So we cluster pathways, then choose an arbitrary number of clusters, and took one "exemplar" from each cluster to represent each cluster. This practice is mostly effective, especially when the clusters are well-defined, and when members of each cluster have relatively similar function. We can increase the exemplars per cluster as needed. This strategy is helpful because each pathway is very clearly labeled.
Most recently, we experimented with collapsing each pathway_cluster,
so each pathway_cluster includes all genes implicated by pathways
contained in the pathway_cluster. The benefit here is that no genes are lost
by selecting a subset of "exemplar" pathways -- all genes are retained
per cluster. Then instead of plotting pathway-gene network, we plot
a pathway_cluster-gene network. Each pathway_cluster is labeled
using the top n
pathway names -- which is configurable, and is
only a visible label. Each pathway_cluster contains all the
pathways in the cluster.
The plots in mem_plot_folio()
are also available as individual
functions.
In general, it is most convenient to call mem_plot_folio()
and
use the output list
to create custom plots, however the functions
are also available to be called independently.
We can view the "Multi Enrichment Map" itself
with mem_multienrichplot()
.
This network connects pathways when they meet a Jaccard
overlap coefficient threshold based upon the shared genes
between the pathways. The default overlap is stored
by multiEnrichMap()
in the output object mem_canonical
.
g <- mem_multienrichplot(mem_canonical, do_plot=FALSE, overlap=0.3, node_factor=2, repulse=3.5) jam_igraph(g, node_factor=2, use_shadowText=TRUE)
The overlap threshold can be customized manually,
or through the helper function mem_find_overlap()
.
This function attempts to guess a reasonable overlap
threshold based upon the network connectivity,
by optimizing the cluster size and total nodes involved.
g <- mem_multienrichplot(mem_canonical, overlap=mem_find_overlap(mem_canonical), do_plot=FALSE, node_factor=3, repulse=3.5) jam_igraph(g, node_factor=3, use_shadowText=TRUE)
Notice there are distinct subnetworks, called "components",
which are not connected to each other. A useful follow-up
technique is to graph one component at a time, which we do
using the helper function subset_igraph_components()
.
Components are ordered by size, largest to smallest,
so you can keep the largest using argument keep=1
,
or the second largest with keep=2
, and so on.
We also call two other helper functions:
removeIgraphBlanks()
removes blank colors from multi-color
nodes, such as pie nodes, or colored rectangle nodes. It helps
show only the remaining colors without the whitespace.relayout_with_qfr()
applies Fruchterman-Reingold layout
to an igraph
object. It's convenient to adjust repulse
here,
which is the recommended method for adjusting the spacing
between nodes. This function also updates other useful
attributes such as the node label angle, which makes labels
appear offset away from the majority of edges. (It helps
more with Cnet plots below.)## You can alternatively pull out any other component g_sub <- subset_igraph_components(g, keep=1); ## Re-apply network layout, and remove blank colors g_sub <- relayout_with_qfr(repulse=3.5, removeIgraphBlanks(g_sub)) ## Plot jam_igraph(g_sub, node_factor=3, use_shadowText=TRUE)
A useful overview of the enrichment results, including
the overlaps across datasets, is shown in a heatmap
of -log10(Pvalue)
. The helper function
mem_enrichment_heatmap()
is used here.
The argument p_cutoff
is used to set the Pvalue
below which cells are colorized -- every P-value above
this threshold is not colored, and displayed as white,
even when the P-value is less than 1.
mem_enrichment_heatmap(mem_canonical, p_cutoff=0.05);
The same data can be plotted as a heatmap.
mem_enrichment_heatmap(mem_canonical, style="heatmap", p_cutoff=0.05);
The argument color_by_column=TRUE
will optionally apply
a color gradient to each column, using the same P-value
threshold defined by p_cutoff
. In this case, cells are
labeled to indicate the actual enrichment P-value.
mem_enrichment_heatmap(mem_canonical, style="heatmap", color_by_column=TRUE);
We can view the pathway-gene matrix using the function
mem_gene_path_heatmap()
. This function uses data in the
mem
object: "memIM"
which contains the gene-pathway
incidence matrix, "geneIM"
which contains the gene-dataset
incidence matrix, and "enrichIM"
which contains the
pathway-dataset matrix.
This function attempts to "guess" a reasonable number
of column and row splits -- which breaks the column and
row dendrograms into separate sub-clusters. These
breaks are visual only, but can be useful in follow-up
analyses shown later. You can override these numbers
with arguments column_split
and row_split
.
Also, the colors used across the top of the heatmap indicate the enrichment P-value, and are intentionally scaled so all P-values above the P-value threshold are not colored, and are therefore white.
This function uses the amazing ComplexHeatmap::Heatmap()
function, which enables many awesome customizable features.
The ...
argument is passed to ComplexHeatmap::Heatmap()
,
so you can use those features as needed.
hm <- mem_gene_path_heatmap(mem_canonical, column_cex=0.5, row_cex=0.6); hm;
As a follow-up analysis, you can pull out each pathway
cluster from the heatmap itself, using heatmap_column_order()
:
hm_sets <- heatmap_column_order(hm); hm_sets;
We can view the Cnet plot (concept network plot) which shows the pathway-gene relationship.
The helper function memIM2cnet()
creates a Cnet plot
from the mem_canonical
output. Here, we also pipe the
result through other helper functions:
fixSetLabels()
applies pathway label word wraprelayout_with_qfr()
applies network layout and adjusts node labelsremoveIgraphBlanks()
removes blank colors from the igraph
nodes#cnet <- mem_canonical$multiCnetPlot1b; cnet <- mem_canonical %>% memIM2cnet() %>% fixSetLabels() %>% removeIgraphBlanks() %>% relayout_with_qfr(repulse=4); #plot(cnet, vertex.label.cex=0.6); par("mar"=c(5,4,4,2)+0.1) jam_igraph(cnet, use_shadowText=TRUE, node_factor=0.5, vertex.label.cex=0.6); mem_legend(mem_canonical);
You can extract the largest connected subnetwork to plot, as before.
g2 <- cnet; g2_sub <- subset_igraph_components(cnet, keep=1) #plot(g2_sub); jam_igraph(g2_sub, use_shadowText=TRUE, label_factor=0.5, node_factor=0.5);
The most useful customization is to subset the pathway nodes.
One useful method to choose a subset of pathways
is to view one heatmap cluster at a time. Here
we will use hm_sets
made previously.
cnet_sub <- subsetCnetIgraph(cnet, repulse=3.5, includeSets=unlist(hm_sets[c("A")])); jam_igraph(cnet_sub, node_factor=1, use_shadowText=TRUE, label_dist_factor=3, label_factor=1.3); mem_legend(mem_canonical);
You can also subset the Pathway-Gene Cnet plot using
specific names, or by some network criteria, using
the function subsetCnetIgraph()
.
Pathways are subsetted to require at least 6 genes,
using the argument minSetDegree
,
which can be useful to help simplify the graph structure.
Similarly, the minGeneDegree=2
argument can be used to
display only genes present in 2 or more pathways, but for
now we will not use that argument.
During the subset operation, some convenient default operations are also performed, controlled by function arguments:
remove_singlets=TRUE
force_relayout=TRUE
do_reorder=TRUE
spread_labels=TRUE
remove_blanks=FALSE
, but can be during this step by
remove_blanks=TRUE
.cnet3 <- multienrichjam::subsetCnetIgraph(cnet, repulse=5, minSetDegree=6, minGeneDegree=1); jam_igraph(cnet3, node_factor=0.7, use_shadowText=TRUE); mem_legend(mem_canonical);
The jam_igraph()
function is a custom version of igraph::plot()
intended to enhance the base function in a few ways:
edge_bundling="connections"
improves the rendering of edges
by bundling edges from node clusters, so they are drawn with
a bezier curveuse_shadowText=TRUE
will draw labels with a contrasting border
to improve legibility of text labelsrescale=FALSE
keeps the network layout aspect ratio
instead of scaling the coordinates to fit the size
of the plot window. It also properly scales the node and edge sizes.new arguments to help resize existing values:
label_factor
re-scales the label.cex
values by a multiplier
node_factor
, edge_factor
re-scales the node.size
and edge.width
by a multiplierlabel_dist_factor
re-scales the label.dist
values by a multiplierThe function below replicates the previous Cnet plot, with these changes:
node_factor=2
makes the nodes 2x largerlabel_factor=1.2
makes the label.cex
values 1.2x largerlabel_dist_factor=3
makes the label distance from the node center
3x larger, which emphasizes the label placement at an angle away from
the edges of each node.jam_igraph(cnet3, edge_factor=2, node_factor=1.2, label_factor=1.2, use_shadowText=TRUE, label_dist_factor=3);
One more fancy alternative, you can colorize the edges based upon the node colors, a great idea inspired by the Gephi netowrk visualization tool.
jam_igraph(color_edges_by_nodes(cnet3, alpha=0.7), edge_bundling="connections", edge_factor=2, node_factor=1, label_factor=1.2, use_shadowText=TRUE, label_dist_factor=3)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.