knitr::opts_chunk$set(echo = TRUE, tidy.opts = list(width.cutoff = 80), tidy = TRUE) library(anansi) library(ggplot2)
A common challenge in integrative analysis of high-dimensional data sets is subsequent biological interpretation. Anansi addresses this challenge by only considering pairwise associations that are known to occur a priori. In order to achieve this, we need to provide this relational information to anansi in the form of a (bi)adjacency matrix.
This vignette is divided by two sections:
AnansiWeb
object and demonstrates how
to work with itRecall that the citric acid cycle, or Krebs cycle, is a central piece of metabolic machinery involved in oxidative phosphorylation. See figure below for a simplified representation with 8 unique enzymatic reactions and 9 unique metabolites.
#| plot-krebs, echo=FALSE, out.width='100%', out.height='100%', warning=FALSE, #| message=FALSE, #| fig.cap="Simplified representation of the Krebs cycle. Orange boxes mark #| metabolites and blue arrows represent enyzmes. " library(patchwork) library(ggraph) library(igraph) library(tidygraph) library(dplyr) metabs <- c( "isocitrate", "cis-aconitate", "citrate", "oxaloacetate", "L-malate", "fumarate", "succinate", "succinyl-CoA", "ketoglutarate" ) d <- data.frame( from = c( "isocitrate", "ketoglutarate", "succinyl-CoA", "succinate", "fumarate", "L-malate", "oxaloacetate", "citrate", "cis-aconitate" ), to = c( "ketoglutarate", "succinyl-CoA", "succinate", "fumarate", "L-malate", "oxaloacetate", "citrate", "cis-aconitate", "isocitrate" ), vertices = c( "isocitrate\ndehydrogenase", "ketoglutarate\ndehydrogenase", "succinyl-CoA\nsynthetase", "succinate\ndehydrogenase", "\nfumarase", "malate\ndehydrogenase", "citrate\nsynthase", "aconitase\n", "aconitase\n" ) ) krebs_layout <- data.frame( x = (cos(seq(2 * pi, 0, length = 10) + pi / 2) * 1.25)[c(3:9, 1:2)], y = (sin(seq(2 * pi, 0, length = 10) + pi / 2) * 1.25)[c(3:9, 1:2)], circular = FALSE, name = c( "isocitrate", "ketoglutarate", "succinyl-CoA", "succinate", "fumarate", "L-malate", "oxaloacetate", "citrate", "cis-aconitate" ), .ggraph.orig_index = 1:9, .ggraph.index = 1:9 ) graph_from_data_frame(d, directed = TRUE) |> as_tbl_graph() |> mutate(Legend = if_else(name %in% metabs, "Metabolite", "Enzyme")) |> ggraph(layout = krebs_layout) + geom_edge_link( arrow = arrow(length = unit(4, "mm"), type = "closed"), edge_width = 2.5, color = "black", end_cap = circle(11.5, "mm"), start_cap = circle(11.25, "mm") ) + geom_edge_link( aes(label = vertices), arrow = arrow(length = unit(4, "mm"), type = "closed"), edge_width = 2, color = "dodgerblue", end_cap = circle(11.5, "mm"), start_cap = circle(11.5, "mm"), angle_calc = "along", label_dodge = unit(c(rep(13, 5), rep(-13, 4)), "mm"), label_size = 3.15 ) + geom_node_label(aes(label = name), size = 3.5, fill = "orange", show.legend = FALSE, label.padding = unit(0.4, "lines"), nudge_x = c(0, 0.1, 0.05, -0.05, -0.1, 0, -0.1, 0, 0.1) ) + coord_equal( xlim = c(-1.25, 1.25), ylim = c(-1, 1), ratio = 1, clip = "off" ) + theme_graph()
Let's imagine we had (high-dimensional proxy) measurements available for both types. For instance, we could have transcription data for the enzymes and metabolomics data for the metabolites. We could represent our two data sets like two sets of features.
If we were to perform an all-vs-all integrative analysis between our two data sets, comprehensively testing every metabolite-enzyme pair, we'd test $8\times9=72$ hypotheses.
#| all-v-all, echo=FALSE, fig.wide=TRUE, out.width='100%', out.height='60%', #| fig.cap="Graph representation of possible association sctructures in the #| Krebs cycle. **A)** All-vs-all, with a full graph. **B)** Constrained, only #| linking metabolites an enzymes that are known to directly interact." colmat <- data.frame( name = c( "aconitase", "isocitrate\ndehydrogenase", "ketoglutarate\ndehydrogenase", "succinyl-CoA\nsynthetase", "succinate\ndehydrogenase", "fumarase", "malate\ndehydrogenase", "citrate\nsynthase", "citrate", "cis-aconitate", "isocitrate", "ketoglutarate", "succinyl-CoA", "succinate", "fumarate", "L-malate", "oxaloacetate" ), y = c(seq(9, 1, length.out = 8), 9:1), x = rep(c(0.7, 0.3), times = c(8, 9)), fill = rep(c("Enzyme", "Metabolite"), times = c(8, 9)) ) p_0 <- ggplot() + annotate("segment", x = c(0.28, 0.72), y = 0.875, xend = c(0.28, 0.72), yend = 9.125, color = "gray" ) + geom_text( data = colmat, aes(y = y, x = x, label = name), colour = "black", size = 3.25, hjust = c(rep(0, 8), rep(1, 9)), lineheight = 7 / 10, nudge_x = c(rep(0.03, 8), rep(-0.03, 9)) ) + scale_fill_manual( values = c(Metabolite = "orange", Enzyme = "dodgerblue") ) + coord_cartesian(xlim = c(1 / 20, 19 / 20), clip = "off") + annotate("text", label = c("Metabolites", "Enzymes"), x = c(2 / 8, 6 / 8), y = 9.6, size = 4 ) + ylab(NULL) + xlab(NULL) + theme_void() p_a <- p_0 + geom_point( data = colmat, aes(y = y, x = x, fill = fill), color = "black", shape = 21, size = 4, show.legend = FALSE ) p_b <- p_0 + geom_segment( data = expand.grid(seq(9, 1, length.out = 8), 9:1), aes(y = Var2, yend = Var1), x = 3 / 10, xend = 7 / 10, color = "#393D47", inherit.aes = FALSE ) + geom_point( data = colmat, aes(y = y, x = x, fill = fill), color = "black", shape = 21, size = 4, show.legend = FALSE ) p_c <- p_0 + geom_segment( data = data.frame( Var2 = seq(1, 9, length.out = 8)[ c(8, 1, 8, 8, 7, 7, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2, 1) ], Var1 = c(9, 9, 8, 7, 7, 6, 6, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1) ), aes(y = Var1, yend = Var2), x = 3 / 10, xend = 7 / 10, color = "#393D47", inherit.aes = FALSE ) + geom_point( data = colmat, aes(y = y, x = x, fill = fill), color = "black", shape = 21, size = 4, show.legend = FALSE ) (p_b | p_c) + plot_annotation(tag_levels = "A")
Often however, only a subset of these comparisons is scientifically relevant to investigate. In this case for instance, only 17 associations between enzyme-metabolite pairs are likely to make sense. Non-selectively testing all 72 associations actively harms statistical power, as 55 of these tests likely cannot be interpreted, but their p-values will still be tallied for the purpose of FDR and FWER. Moreover, they tend to obscure any biologically interpretable findings.
In order to 'know' which features-pairs in your data set should be considered, anansi requires a biadjacency matrix. The biadjacency matrix corresponding to to the sparser graph on the right can be seen in the figure below.
#| krebs-biadj, echo=FALSE, fig.small=TRUE, #| fig.cap="Matrix representation of the constrained analysis of the Krebs #| cycle, with cells marked with 'X' signifying tested feature-pairs." #| krebs_edge_df <- local({ data("krebs", package = "anansi") return(krebs) }) krebs_edge_df |> mutate(Enzyme = `levels<-`( Enzyme, gsub(x = levels(Enzyme), " ", replacement = "\n") )) |> ggplot() + geom_text(label = "X", x = I(0.5), y = I(0.5)) + facet_grid(Enzyme ~ Metabolite, switch = "y", space = "free_x") + theme_bw() + theme( strip.text.y.left = element_text( angle = 0, size = 10, lineheight = 7 / 10, hjust = 1 ), strip.text.x.top = element_text(angle = 90, size = 10, hjust = 0), strip.background.x = element_rect(fill = "orange", color = "black"), strip.background.y = element_rect(fill = "dodgerblue", color = "black"), panel.spacing = unit(0.1, "lines") ) + coord_equal(ratio = 1) + labs(x = NULL, y = NULL)
In anansi, we work with biadjacency matrices in the AnansiWeb
S4 class, which
consists of a biadjacency matrix, in this case we call it a dictionary
, as
well as two tables of observations that should be analysed jointly.
The anansi package offers some functions that should be sufficient for basic analysis. To enable more advanced users to pursue non-standard applications, our methodology is compatible with several popular interfaces.
weaveWeb()
The recommended default interface to generate an AnansiWeb
object is
weaveWeb()
. Besides a traditional interface, it also accepts R formula
syntax. Once we have an AnansiWeb
object, we can use the $ operator to access
the biadjacency matrix in the 'dictionary' slot.
# Two interfaces form.web <- weaveWeb(ec ~ ko, link = kegg_link()) trad.web <- weaveWeb(y = "ec", x = "ko", link = kegg_link()) # Same output identical(form.web, trad.web) # Get the biadjacency matrix using the $ operator head(dictionary(form.web), c(10, 10))
Note that the majority of this matrix consists of dots, it's a sparse matrix, which is a format that can considerably speed up calculations involving huge, mostly empty matrices. For further reading, see the Matrix package website.
weaveWeb()
input: link
weaveWeb()
takes additional arguments, let's focus on link
, which will take
the information necessary to link features between input data sets. The anansi
package includes such data for the KEGG database,
let's take a look at some of it:
# Extract the data.frame from the list that kegg_link() returns ec2ko <- kegg_link()[["ec2ko"]] head(ec2ko)
link
inputThe format of the data.frame
is important: It should have two named columns.
Each column contains feature IDs of one type (in this case, ec
and ko
, which
is how the KEGG database refers to
Enzyme Commission numbers and kegg orthologes, respectively. If two feature
IDs are on the same row of the data.frame
, it means those features are paired
(hence adjacency).
The column names correspond to the types of feature IDs that should be linked.
These names are used throughout the AnansiWeb
workflow. They can be directly
called from an AnansiWeb
object:
# Note that the formula controls the order names(weaveWeb(ko ~ ec, link = kegg_link())) # names() is short for the above names(weaveWeb(ec ~ ko, link = kegg_link()))
We need to provide weaveWeb()
with a map of which features are linked. We can
use the pre-packaged kegg link map, which consists of two similarly structured
data.frame
s in a list, one of which we just inspected. We can call it directly
using kegg_link()
:
# Only print the first few entries lapply(kegg_link(), head)
data.frame
Note the column names in the two data.frames
: ec
, cpd
and ko
. Two of
these correspond to the formula we used just used: weaveWeb(ec ~ ko)
. We
didn't use the term cpd
. Further, ec
is present in both data.frame
s.
We can use weaveWeb()
to make a biadjacency matrix between any combination of
one or two similarly structured data.frame
s, presented as a list. For the pre-
packaged data set, this means we can link between any two of ec
, cpd
and
ko
, in either order.
# Formula notation head(dictionary(weaveWeb(cpd ~ ko, link = kegg_link())), c(10, 10)) # Character notation head(dictionary(weaveWeb(y = "ko", x = "cpd", link = kegg_link())), c(10, 10))
The most flexible way to make an AnansiWeb
object is through AnansiWeb()
:
# Prep some dummy input tables tX <- matrix(rnorm(30), nrow = 6, dimnames = list(NULL, x = paste("x", seq_len(5), sep = "_")) ) tY <- matrix(rnorm(42), nrow = 6, dimnames = list(NULL, y = paste("y", seq_len(7), sep = "_")) ) # Prep biadjacency matrix # base::matrix is fine too, but will get coerced to Matrix::Matrix. d <- matrix( data = sample(x = c(TRUE, FALSE), size = 35, replace = TRUE), nrow = NCOL(tY), ncol = NCOL(tX), dimnames = list(colnames(tY), colnames(tX)) ) # make the AnansiWeb w <- AnansiWeb( tableY = tY, tableX = tX, dictionary = d ) # Confirm the warning: names(w)
There are numerous ways in which we can define an adjacency matrix. Here, we demonstrate a graph-based and matrix based approach.
igraph
Importantly, (bi)adjacency matrices can be understood as graphs. Two common
packages that deal with graphs are igraph
and graph
. To start off, we need to
have a graph object, so let's make one using igraph
.
library(igraph) # Prepared data.frame for the occasion head(krebs_edge_df) # Convert data.frame to graph g <- graph_from_data_frame(krebs_edge_df, directed = FALSE)
Now that we have constructed a graph, we still need to identify which features,
vertices, belong to which data modality, in this case either enzymes and
metabolites. Using the igraph
package, this should be done through the type
slot. type
can be assigned a boolean vector, where TRUE
values
become columns, corresponding to features in tableX
, whereas FALSE
values
become rows, which correspond to those in tableY
.
V(g)$type <- V(g)$name %in% krebs_edge_df$Metabolite # Now that we've defined type, we can convert to biadjacency matrix: bi_mat1 <- as_biadjacency_matrix(g, sparse = TRUE) head(bi_mat1, n = c(4, 5))
Though biadjacency support in graph
is currently limited, We note that igraph
and graph objects can be converted using the graph_from_graphnel()
and
as_graphnel()
functions in igraph
.
Matrix
We can also define a matrix directly. Conveniently, sparse matrices can be
defined easily from our starting data. The Matrix
library provides fantastic
support for this.
library(Matrix) # For this approach, it's useful to prepare the input as two factors: Enzyme <- factor(krebs_edge_df$Enzyme) Metabolite <- factor(krebs_edge_df$Metabolite) # We can get integers out of factors, corresponding to their level indices bi_mat2 <- sparseMatrix( i = as.integer(Enzyme), j = as.integer(Metabolite), dimnames = list( Enzyme = levels(Enzyme), Metabolite = levels(Metabolite) ) ) head(bi_mat2, n = c(4, 5))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.