graphsim-package | R Documentation |
graphsim is a package to simulate normalised expression data from networks
for biological pathways using ‘igraph
’ objects and multivariate
normal distributions.
This package provides functions to develop simulated continuous data
(e.g., gene expression) from a Sigma (Σ) covariance matrix derived from a
graph structure in ‘igraph
’ objects. Intended to extend
‘mvtnorm
’ to take 'igraph' structures rather than sigma
matrices as input. This allows the use of simulated data that correctly
accounts for pathway relationships and correlations. Here we present
a versatile statistical framework to simulate correlated gene expression
data from biological pathways, by sampling from a multivariate normal
distribution derived from a graph structure. This package allows the
simulation of biologicalpathways from a graph structure based on a
statistical model of gene expression, such as simulation of expression
profiles that of log-transformed and normalised data from microarray
and RNA-Seq data.
This package enables the generation of simulated gene expression datasets containing pathway relationships from a known underlying network. These simulated datasets can be used to evaluate various bioinformatics methodologies, including statistical and network inference procedures.
These are computed by 1) resolving inhibitory states to derive a consistent matrix of positive and negative edges, 2) inferring relationships between nodes from paths in the graph, 3) weighting these in a Sigma (Σ) covariance matrix and 4) using this to sample a multivariate normal distribution.
The generate_expression
function is a wrapper
around all necessary functions to give a final simulated dataset.
Here we set up an example graph object using the
igraph
package.
library("igraph") graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"), c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I")) graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE)
Then we can call generate_expression
to return
the simulated data based on the relationships defined in the graph
structure. Various options are available to fine-tune this.
expr <- generate_expression(100, graph_structure, cor = 0.8, mean = 0, sd = 1, comm = FALSE, dist = TRUE, absolute = FALSE, laplacian = FALSE)
Here we can see the final result. The graph
structure defines the covariance matrix used
by rmvnorm
to
generate a multivariate distribution.
dim(expr) library("gplots") heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4)
This dataset consists of 9 rows (one for each vertex or gene) in the graph and 100 columns (one for each sample or observation).
Input with an adjacency matrix is available using the
generate_expression_mat
function.
Graph structures can be passed directly from the
igraph
package.
Using this package, you can create an ‘igraph
’
class object.
> class(graph_structure) [1] "igraph" > graph_structure IGRAPH ba7fa2f DN-- 9 8 -- + attr: name (v/c) + edges from ba7fa2f (vertex names): [1] A->C B->C C->D D->E D->F F->G F->I H->I
This ‘igraph
’ object class can be passed
directly to generate_expression
shown above and internal functions described below:
make_sigma_mat_graph
,
make_sigma_mat_dist_graph
,
make_distance_graph
,
and
make_state_matrix
.
The ‘graphsim
’ package also supports various
matrix formats and has functions to handle these.
The following functions will compute matrices from an
‘igraph
’ object class:
make_adjmatrix_graph
to derive the adjacency matrix for a graph structure.
make_commonlink_graph
to derive the ‘common link’ matrix for a graph structure of
mutually shared neighbours.
make_laplacian_graph
to derive the Laplacian matrix for a graph structure.
The following functions will compute matrices from an
adjacency matrix
:
make_commonlink_adjmat
to derive the ‘common link’ matrix for a graph structure of
mutually shared neighbours.
make_laplacian_adjmat
to derive the Laplacian matrix for a graph structure.
We provide some pre-generate pathways from Reactoem database for testing and demonstrations:
RAF_MAP_graph
for the interactions in the “RAF/MAP kinase” cascade (17 vertices
and 121 edges).
Pi3K_graph
for the phosphoinositide-3-kinase cascade (35 vertices and 251 edges).
Pi3K_AKT_graph
for the phosphoinositide-3-kinase activation of Protein kinase B
pathway “PI3K/AKT activation” (275 vertices and 21106 edges).
TGFBeta_Smad_graph
for the TGF-β receptor signaling activates SMADs
pathway (32 vertices and 173 edges).
Please note that demonstrations on larger graph objects. These can be called directly from the pakage:
> graphsim::Pi3K_graph IGRAPH 21437e3 DN-- 35 251 -- + attr: name (v/c) + edges from 21437e3 (vertex names): [1] AKT1->AKT2 AKT1->AKT3 AKT1->CASP9 AKT1->CASP9 [5] AKT1->CASP9 AKT1->FOXO1 AKT1->FOXO1 AKT1->FOXO1 [9] AKT1->FOXO3 AKT1->FOXO3 AKT1->FOXO3 AKT1->FOXO4 [13] AKT1->FOXO4 AKT1->FOXO4 AKT1->GSK3B AKT1->GSK3B [17] AKT1->GSK3B AKT1->NOS1 AKT1->NOS2 AKT1->NOS3 [21] AKT1->PDPK1 AKT2->AKT3 AKT2->CASP9 AKT2->CASP9 [25] AKT2->CASP9 AKT2->FOXO1 AKT2->FOXO1 AKT2->FOXO1 [29] AKT2->FOXO3 AKT2->FOXO3 AKT2->FOXO3 AKT2->FOXO4 + ... omitted several edges + ... omitted several edges
They can also be imported into R:
data(Pi3K_graph)
You can assign them to your local environment by calling with from the package:
graph_object <- identity(Pi3K_graph)
You can also change the object class directly from the package:
library("igraph") Pi3K_adjmat <- as_adjacency_matrix(Pi3K_graph)
Pi3K_AKT_graph
and
TGFBeta_Smad_graph
contain graph edge attributes for the ‘state’ parameter
described below.
> TGFBeta_Smad_graph IGRAPH f3eac04 DN-- 32 173 -- + attr: name (v/c), state (e/n) + edges from f3eac04 (vertex names): [1] BAMBI ->SMAD7 BAMBI ->TGFB1 BAMBI ->TGFBR1 BAMBI ->TGFBR2 [5] CBL ->NEDD8 CBL ->NEDD8 CBL ->TGFBR2 CBL ->TGFBR2 [9] CBL ->UBE2M CBL ->UBE2M FKBP1A->TGFB1 FKBP1A->TGFBR1 [13] FKBP1A->TGFBR2 FURIN ->TGFB1 FURIN ->TGFB1 MTMR4 ->SMAD2 [17] MTMR4 ->SMAD2 MTMR4 ->SMAD3 MTMR4 ->SMAD3 NEDD4L->RPS27A [21] NEDD4L->SMAD7 NEDD4L->SMURF1 NEDD4L->SMURF2 NEDD4L->TGFB1 [25] NEDD4L->TGFBR1 NEDD4L->TGFBR2 NEDD4L->UBA52 NEDD4L->UBB [29] NEDD4L->UBC NEDD8 ->TGFBR2 NEDD8 ->UBE2M PMEPA1->SMAD2 + ... omitted several edges > E(TGFBeta_Smad_graph)$state [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [32] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [63] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [94] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 [125] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [156] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 > states <- E(TGFBeta_Smad_graph)$state > table(states) states 1 2 103 70
The following functions are used by
generate_expression
to compute a simulated dataset. They can be called separately
to summarise the steps used to compute the final data matrix
or for troubleshooting.
make_sigma_mat_adjmat
,
make_sigma_mat_comm
,
make_sigma_mat_laplacian
, and
make_sigma_mat_graph
will
compute a Sigma (Σ) covariance matrix from an
adjacency matrix, common link matrix, Laplacian matrix,
or an ‘igraph’ object. There are computed as above
and passed to rmvnorm
.
make_distance_adjmat
,
make_distance_comm
,
make_distance_laplacian
, and
make_distance_graph
will
compute a distance matrix of relationships from an
adjacency matrix, common link matrix, Laplacian matrix,
or an ‘igraph’ object. There are computed as above
and passed to make_sigma
.
make_state_matrix
will compute a “state matrix” resolving positive and
negative correlations from a vector of edge properties. This
is called by make_sigma
and generate_expression
to ensure that the signs of correlations are consistent.
These internal functions can be called to compute steps of the simulation procedure and examine the results.
1. first we create a graph structure and define the input parameters
library("igraph") graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"), c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I")) graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE) #sample size data.n <- 100 #data distributions data.cor <- 0.75 data.mean <- 3 data.sd <- 1.5 #inhibition states edge_states <- c(1, 1, -1, -1, 1, 1, 1, 1)
2. examine the relationships between the genes.
Here we can see which nodes share an edge:
> adjacency_matrix <- make_adjmatrix_graph(graph_structure) > adjacency_matrix A C B D E F G I H A 0 1 0 0 0 0 0 0 0 C 1 0 1 1 0 0 0 0 0 B 0 1 0 0 0 0 0 0 0 D 0 1 0 0 1 1 0 0 0 E 0 0 0 1 0 0 0 0 0 F 0 0 0 1 0 0 1 1 0 G 0 0 0 0 0 1 0 0 0 I 0 0 0 0 0 1 0 0 1 H 0 0 0 0 0 0 0 1 0
Here we define a geometrically decreasing series of relationships between genes based on distance by paths in the graph:
> relationship_matrix <- make_distance_graph(graph_structure, absolute = FALSE) > relationship_matrix A C B D E F G I H A 1.00000000 0.20000000 0.10000000 0.10000000 0.06666667 0.06666667 0.05000000 0.05000000 0.04000000 C 0.20000000 1.00000000 0.20000000 0.20000000 0.10000000 0.10000000 0.06666667 0.06666667 0.05000000 B 0.10000000 0.20000000 1.00000000 0.10000000 0.06666667 0.06666667 0.05000000 0.05000000 0.04000000 D 0.10000000 0.20000000 0.10000000 1.00000000 0.20000000 0.20000000 0.10000000 0.10000000 0.06666667 E 0.06666667 0.10000000 0.06666667 0.20000000 1.00000000 0.10000000 0.06666667 0.06666667 0.05000000 F 0.06666667 0.10000000 0.06666667 0.20000000 0.10000000 1.00000000 0.20000000 0.20000000 0.10000000 G 0.05000000 0.06666667 0.05000000 0.10000000 0.06666667 0.20000000 1.00000000 0.10000000 0.06666667 I 0.05000000 0.06666667 0.05000000 0.10000000 0.06666667 0.20000000 0.10000000 1.00000000 0.20000000 H 0.04000000 0.05000000 0.04000000 0.06666667 0.05000000 0.10000000 0.06666667 0.20000000 1.00000000
Here we can see the resolved edge states through paths in the adjacency matrix:
> names(edge_states) <- apply(graph_structure_edges, 1, paste, collapse = "-") > edge_states A-C B-C C-D D-E D-F F-G F-I H-I 1 1 -1 -1 1 1 1 1 > state_matrix <- make_state_matrix(graph_structure, state = edge_states) > state_matrix A C B D E F G I H A 1 1 1 -1 1 -1 -1 -1 -1 C 1 1 1 -1 1 -1 -1 -1 -1 B 1 1 1 -1 1 -1 -1 -1 -1 D -1 -1 -1 1 -1 1 1 1 1 E 1 1 1 -1 1 -1 -1 -1 -1 F -1 -1 -1 1 -1 1 1 1 1 G -1 -1 -1 1 -1 1 1 1 1 I -1 -1 -1 1 -1 1 1 1 1 H -1 -1 -1 1 -1 1 1 1 1
3. define a Sigma (Σ) covariance matrix
Here we can see that the signs match the state_matrix
and the covariance is based on the relationship_matrix
weighted by the correlation (cor
) and standard
deviation (sd
) parameters.
Note that where sd = 1
, the diagonals will be 1
and the off-diagonal terms will be correlations.
> sigma_matrix <- make_sigma_mat_dist_graph( + graph_structure, + state = edge_states, + cor = data.cor, + sd = data.sd, + absolute = FALSE + ) > sigma_matrix A C B D E F G I H A 2.250000 1.687500 0.843750 -0.84375 0.562500 -0.56250 -0.421875 -0.421875 -0.337500 C 1.687500 2.250000 1.687500 -1.68750 0.843750 -0.84375 -0.562500 -0.562500 -0.421875 B 0.843750 1.687500 2.250000 -0.84375 0.562500 -0.56250 -0.421875 -0.421875 -0.337500 D -0.843750 -1.687500 -0.843750 2.25000 -1.687500 1.68750 0.843750 0.843750 0.562500 E 0.562500 0.843750 0.562500 -1.68750 2.250000 -0.84375 -0.562500 -0.562500 -0.421875 F -0.562500 -0.843750 -0.562500 1.68750 -0.843750 2.25000 1.687500 1.687500 0.843750 G -0.421875 -0.562500 -0.421875 0.84375 -0.562500 1.68750 2.250000 0.843750 0.562500 I -0.421875 -0.562500 -0.421875 0.84375 -0.562500 1.68750 0.843750 2.250000 1.687500 H -0.337500 -0.421875 -0.337500 0.56250 -0.421875 0.84375 0.562500 1.687500 2.250000
4. generate an expression dataset using this sigma matrix
We use generate_expression
to compute and expression
dataset, simulated using these parameters:
> expression_data <- generate_expression( + n = data.n, + graph_structure, + state = edge_states, + cor = data.cor, + mean = data.mean, + sd = data.sd, + comm = FALSE, + dist = FALSE, + absolute = FALSE, + laplacian = FALSE + ) > dim(expression_data) [1] 9 100
Here we also compute the final observed correlations in the simulated dataset:
> cor_data <- cor(t(expression_data)) > dim(cor_data) [1] 9 9
These functions are demonstrated in more detail in the main vignette.
Heatmaps can be used from the gplots
package to display these simulated datasets.
library("gplots") heatmap.2(adjacency_matrix, scale = "none", trace = "none", col = colorpanel(50, "white", "black"), key = FALSE) heatmap.2(relationship_matrix, scale = "none", trace = "none", col = colorpanel(50, "white", "red")) heatmap.2(state_matrix, scale = "none", trace = "none", col = colorpanel(50, "royalblue", "palevioletred"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) heatmap.2(sigma_matrix, scale = "none", trace = "none", col = colorpanel(50, "royalblue", "white", "palevioletred"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) heatmap.2(expression_data, scale = "none", trace = "none", col = colorpanel(50, "royalblue", "white", "palevioletred"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) heatmap.2(cor_data, scale = "none", trace = "none", col = colorpanel(50, "royalblue", "white", "palevioletred"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure)))
In particular we can see here that the expected correlations
show by the sigma_matrix
are similar to the observed
correlations in the cor_data
.
The ‘graphsim’ package comes with a built-in plotting function to display graph objects.
graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"), c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I")) graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE) plot_directed(graph_structure, layout = layout.kamada.kawai)
This supports the ‘state’ parameter to display activating relationships (with positive correlations) and inhibiting or repressive relationships (with negative correlations).
edge_states <- c(1, 1, -1, -1, 1, -1, 1, -1) graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE) plot_directed(graph_structure, state = edge_states, col.arrow = c("darkgreen", "red")[edge_states / 2 + 1.5] layout = layout.kamada.kawai)
These states can also be passed from the ‘state’ edge attribute of the graph object.
graph_pathway <- identity(TGFBeta_Smad_graph) edge_properties <- E(graph_pathway)$state plot_directed(graph_pathway, col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[edge_properties], fill.node = c("lightblue"), layout = layout.kamada.kawai)
This plotting function is demonstrated in more detail in the plots_directed.Rmd plotting vignette.
The graphsim package is published in the Journal of Open Source Software. See the paper here for more details: doi: 10.21105/joss.02161
The graphsim GitHub repository is here: TomKellyGenetics/graphsim You can find the development version and submit an issue if you have questions or comments.
To cite package 'graphsim' in publications use:
Kelly, S.T. and Black, M.A. (2020). graphsim: An R package for simulating gene expression data from graph structures of biological pathways. Journal of Open Source Software, 5(51), 2161, doi: 10.21105/joss.02161
A BibTeX entry for LaTeX users is:
@article{Kelly2020joss02161, doi = {10.21105/joss.02161}, year = {2020}, publisher = {The Open Journal}, volume = {5}, number = {51}, pages = {2161}, author = {S. Thomas Kelly and Michael A. Black}, title = {graphsim: An R package for simulating gene expression data from graph structures of biological pathways}, journal = {Journal of Open Source Software} }
Maintainer: Tom Kelly tom.kelly@riken.jp
Authors:
Reviewers:
Editor: Mark Jensen (Frederick National Laboratory for Cancer Research)
Publication at Journal of Open Source Software:
doi: 10.21105/joss.02161
GitHub repository:
Report bugs:
Contributions:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.