#knitr::opts_chunk$set(collapse = TRUE, comment = "#>", width = 68) knitr::opts_chunk$set(fig.cap = "", fig.path = "Plot") options(width = 68, cli.unicode = FALSE, cli.width = 68) #par(mai=c(2.82, 2.82, 0.82, 0.82)-0.82) par(mar=c(7, 10, 4, 2) + 0.1) par(bty="o") #captioner::captioner(prefix = "Fig.")

**Summary**

Guide on how to simulate gene expression data from pathway structures defined by graphs from `igraph`

objects.

**Package**

graphsim 1.0.0

This package is designed to balance user-friendliness (by providing sensible defaults and built-in functions) and flexbility (many options are available to use as needed).

If you have problems or feedback, sumbmitting an issue to the the GitHub repository is encouraged. See the DESCRIPTION and README.md for more details on how to suggest changes to the package.

Pathway and graph structures have a wide array of applications. Here we consider the simulation of (log-normalised) gene expression data from a biological pathway. If you have another use for this software you are welcome to apply it to your problem but please bear in mind that it was designed with this application in mind. In principle, normally-distributed continuous data can be generated based on any defined relationships. This package uses the graph structure to define a ∑ covariance matrix and generate simulated data by sampling from a multivariate normal distribution.

Crucially, this allows the simulation of negative correlations based on inhibitory or repressive relationships, as commonly occur in biology (Barabási and Oltvai 2004). A custom plotting function `plot_directed`

is provided to visualise these relationships using the "state" parameter. This plotting function has a dedicated vignette.

For more details on the background of this package, see the paper included with the package on GitHub. This vignette provides more detail on the code needed to reproduce the figures presented in the manuscript.

The package can be installed as follows. Run the following command to install the current release from CRAN (recommended).

#install packages required (once per computer) install.packages("graphsim")

Run the following command to install the development version from GitHub (advanced users). This will import the latest changes to the package so behaviour may be unstable.

#install stable release remotes::install_github("TomKellyGenetics", ref = "master") #install development version remotes::install_github("TomKellyGenetics", ref = "dev")

Once the required packages are installed, we must load the packages required to use the package functions with `igraph`

objects and to generate plots (Csardi and Nepusz 2006). Here `igraph`

is required to create `igraph`

objects and `gplots`

is required for plotting heatmaps.

library("igraph") library("gplots") library("graphsim") library("scales")

Here we set up a simple graph to demonstrate how connections in the graph structure lead to correlations in the final output. We create a simple pathway of 9 genes with various branches.

graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"), c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I")) graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE) plot_directed(graph_structure, layout = layout.kamada.kawai)

A simulated dataset can be generated with a single command. This is all that is required to get started.

expr <- generate_expression(100, graph_structure, cor = 0.8, mean = 0, comm = FALSE, dist = TRUE, absolute = FALSE) heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4)

Here we've generated a simulated dataset of 100 samples with gene expression data for the genes in the graph shown above. We will show below how these are used to demonstrate what computations are being performed to generate this data from the graph structure given.

Various arguments are supported to alter how the simulated datasets are computed. The other functions used below are called internally by `generate_expression`

and are not needed to compute the final dataset in the above heatmap plot. See the documentation for details.

Here we show the data generated by this graph structure. This demonstrates how several of the available options are used to compute the necessary steps.

The data can be summarised by an "adjacency matrix" where a one (1) is given between a row `i`

and column `j`

if there is an edge between genes `i`

and `j`

. Otherise it is a zero (0) for genes that are not connected. For an undirected graph, edges are shown in a symmetical matrix.

adj_mat <- make_adjmatrix_graph(graph_structure) heatmap.2(make_adjmatrix_graph(graph_structure), scale = "none", trace = "none", col = colorpanel(3, "grey75", "white", "blue"), colsep = 1:4, rowsep = 1:4)

For a directed graph, the adjacency matrix may be assymetric. A non-zero element `adjmat[i,j]`

represents the presence or weight of the edge from gene `i`

(matrix rows) to gene `j`

(matrix columns).

heatmap.2(make_adjmatrix_graph(graph_structure, directed = TRUE), scale = "none", trace = "none", col = colorpanel(3, "grey75", "white", "blue"), colsep = 1:4, rowsep = 1:4)

We can compute the common links between each pair of genes. This shows how many genes are connected to both genes `i`

and `j`

.

comm_mat <- make_commonlink_graph(graph_structure) heatmap.2(make_commonlink_graph(graph_structure), scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

This shows how many edges to a shared neighbour these nodes have between them. The diagonal will therefore reflect vertex degree as all edges are counted.

We define commonlinks between each pair of nodes as how many nodes are mutually connected to both of the nodes in the adjacency matrix (how many paths of length 2 exist between them). Note that this weights towards genes with a higher vertex degree (as does the Laplacian).

The Laplacian matrix has the same dimensions as the adjancency matrix. For undirected graphs it is a symmetric matrix but for directed graphs it is not. It has the same number of rows and columns as the number of nodes.

The Laplacian matrix is defined as `laplacian[i,j] = vertex_degree(i)`

if `i == j`

and `laplacian[i,j] = -1`

if `i != j`

. The wieghted Laplacian matrix is defined as `laplacian[i,j] = -wieghts(graph)[i,j]`

for the off-diagonal terms.

laplacian_mat <- make_laplacian_graph(graph_structure) heatmap.2(make_laplacian_graph(graph_structure), scale = "none", trace = "none", col = bluered(50),colsep = 1:4, rowsep = 1:4)

As expected, the off-diagonal terms of the Laplacian are negative integer values and the diagonals reflect the vertex degree.

To compute the relationships between each gene by "distance" we first compute the shortest paths between each pair of nodes, using Dijkstra's algorithm (Prim 1957; Dijkstra 1959).

shortest.paths(graph_structure) heatmap.2(shortest.paths(graph_structure), scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), thincolsep = 1:4, rowsep = 1:4)

Here we plot the number of edges in the shortest paths between each pair of nodes in the graph (as an integrer value). Relative to the "diameter" (length of the longest shortest path between any 2 nodes), we can show which genes are more similar or different based on the graph structure.

diam <- diameter(graph_structure) relative_dist <- (1+diam-shortest.paths(graph_structure))/diam relative_dist heatmap.2(relative_dist, scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

These relationships are used to create a distance graph relative to the diameter. A relative geometrically decreasing distance is computed as follows. In this case every connected node is weighted in fractions of the diameter.

dist_mat <- make_distance_graph(graph_structure, absolute = FALSE) dist_mat heatmap.2(dist_mat, scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

An arithmetically decreasing distance is computed as follows. In this case every connected node is by the length of their shortest paths relative to the diameter.

dist_mat <- make_distance_graph(graph_structure, absolute = TRUE) dist_mat heatmap.2(dist_mat, scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

The Sigma (Σ) covariance matrix defines the relationships between the simulated gene distributions. Where the diagonal is one (1), the covariance terms are correlations between each gene. Where possible these are derived from the distance relationships described above. In cases where this is not compatible, the nearest "positive definite" symmetric matrix is computed.

These can be computed directly from an adjacency matrix.

#sigma from adj mat sigma_mat <- make_sigma_mat_graph(graph_structure, 0.8) sigma_mat heatmap.2(sigma_mat, scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

A commonlink matrix can also be used to compute a Σ matrix.

#sigma from comm mat sigma_mat <- make_sigma_mat_graph(graph_structure, 0.8, comm = TRUE) sigma_mat heatmap.2(sigma_mat, scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

It is recommended to compute the distance relationships and use these. This is supported with the built-in functions. For instance Σ from the geometrically computed distances.

# sigma from geometric distance matrix make_sigma_mat_dist_graph(graph_structure, 0.8, absolute = FALSE) heatmap.2(make_sigma_mat_dist_graph(graph_structure, 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

Sigma (Σ) can also be computed for arithmetically computed distances.

# sigma from absolute distance matrix sigma_mat <- make_sigma_mat_dist_graph(graph_structure, 0.8, absolute = TRUE) heatmap.2(sigma_mat, scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

Here we generate the final simulated expression dataset. Note that none of the prior steps are required. These are called internalled as needed.

For example, the adjacency matrix is derived to generate the following dataset. Note that the nearest positive definite matrix is required for the Σ matrix in this case.

#simulate expression data #adj mat expr <- generate_expression(100, graph_structure, cor = 0.8, mean = 0, comm = FALSE) heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4) heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

Here we compute a simulated dataset based on common links shared to other nodes.

#comm mat expr <- generate_expression(100, graph_structure, cor = 0.8, mean = 0, comm =T) heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4) heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

Here we use relative distance (relationships are geometrically weighted to the diameter).

# relative dist expr <- generate_expression(100, graph_structure, cor = 0.8,mean = 0, comm = FALSE, dist = TRUE, absolute = FALSE) heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4) heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "grey75", "red"), colsep = 1:4, rowsep = 1:4)

Here we use absolute distance (relationships are arithmetically weighted to the diameter).

set.seed(9000)

#absolute dist expr <- generate_expression(100, graph_structure, cor = 0.8, mean = 0, comm = FALSE, dist = TRUE, absolute = TRUE) heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4) heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = bluered(50), colsep = 1:4, rowsep = 1:4)

In summary, we compute the following expression dataset but on these underlying relationships in the graph structure. Here we use geometrically decreasing correlations between more distant nodes in the network. This code is a complete example to compute the following plots.

set.seed(9000)

# activating graph state <- rep(1, length(E(graph_structure))) plot_directed(graph_structure, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_structure, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix sigma_mat <- make_sigma_mat_dist_graph(graph_structure, cor = 0.8, absolute = FALSE) heatmap.2(sigma_mat, scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_structure, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we simulate the same graph structure with inhibiting edges but passing the `"state"`

parameter. This takes one argument for each each to identify which are inhibitory (as documented).

set.seed(9000)

# activating graph state <- c(1, 1, -1, 1, 1, 1, 1, -1) plot_directed(graph_structure, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_structure, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_structure, state, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_structure, state, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_structure)), rowsep = 1:length(V(graph_structure)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we give some toy examples to show how the simulations behave in simple cases. This serves to understand how modules within a larger graph will translate to correlations in the final simulated datasets.

set.seed(9000)

graph_diverging_edges <- rbind(c("A", "B"), c("B", "C"), c("B", "D")) graph_diverging <- graph.edgelist(graph_diverging_edges, directed = TRUE) # activating graph state <- rep(1, length(E(graph_diverging))) plot_directed(graph_diverging, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_diverging, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_diverging, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_diverging, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none",col = bluered(50), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we simulate the same graph structure with inhibiting edges but passing the `"state"`

parameter. This takes one argument for each each to identify which are inhibitory (as documented).

set.seed(9000)

# activating graph state <- c(1, 1, -1) plot_directed(graph_diverging, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_diverging, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_diverging, state, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_diverging, state, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_diverging)), rowsep = 1:length(V(graph_diverging)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

set.seed(9000)

graph_converging_edges <- rbind(c("C", "E"), c("D", "E"), c("E", "F")) graph_converging <- graph.edgelist(graph_converging_edges, directed = TRUE) # activating graph state <- rep(1, length(E(graph_converging))) plot_directed(graph_converging, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_converging, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_converging, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_converging, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we simulate the same graph structure with inhibiting edges but passing the `"state"`

parameter. This takes one argument for each each to identify which are inhibitory (as documented).

set.seed(9000)

# activating graph state <- c(-1, 1, -1) plot_directed(graph_converging, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_converging, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_converging, state, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_converging, state, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_converging)), rowsep = 1:length(V(graph_converging)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

set.seed(9000)

graph_reconnecting_edges <- rbind(c("A", "B"), c("B", "C"), c("B", "D"), c("C", "E"), c("D", "E"), c("E", "F")) graph_reconnecting <- graph.edgelist(graph_reconnecting_edges, directed = TRUE) # activating graph state <- rep(1, length(E(graph_reconnecting))) plot_directed(graph_reconnecting, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_reconnecting, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_reconnecting, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_reconnecting, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

`"state"`

parameter. This takes one argument for each each to identify which are inhibitory (as documented).

set.seed(9000)

# activating graph state <- c(1, 1, -1, -1, 1, 1, 1, 1) plot_directed(graph_reconnecting, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2) mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(graph_reconnecting, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(graph_reconnecting, state, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, graph_reconnecting, state, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(graph_reconnecting)), rowsep = 1:length(V(graph_reconnecting)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Next we demonstrate the simulation procedure based on real biological pathways from the "Reactome" database (Croft *et al*. 2014). We can import these from the `data`

directory included with this package. These graphs are given for examples and convenience.

The following pathways are treated as all relationships are activating.

Here we generate simulated data for the RAF/MAP kinase cascade pathway.

#RAF_MAP_graph <- identity(RAF_MAP_graph) #plot_directed(graph, col.arrow = "#00A9FF", fill.node = "lightblue")

set.seed(9000)

RAF_MAP_graph <- identity(RAF_MAP_graph) # activating graph state <- rep(1, length(E(RAF_MAP_graph))) plot_directed(RAF_MAP_graph, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2, col.arrow = "#00A9FF", fill.node = "lightblue") mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(RAF_MAP_graph, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(RAF_MAP_graph)), rowsep = 1:length(V(RAF_MAP_graph))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(RAF_MAP_graph, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(RAF_MAP_graph)), rowsep = 1:length(V(RAF_MAP_graph))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, RAF_MAP_graph, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(RAF_MAP_graph)), rowsep = 1:length(V(RAF_MAP_graph))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(RAF_MAP_graph)), rowsep = 1:length(V(RAF_MAP_graph)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we generate simulated data for the phosphoinositide-3-kinase (Pi3K) cascade pathway.

set.seed(9000)

graph <- identity(Pi3K_graph) # activating graph state <- rep(1, length(E(Pi3K_graph))) plot_directed(Pi3K_graph, state=state, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2, col.arrow = "#00A9FF", fill.node = "lightblue") mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(Pi3K_graph, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(Pi3K_graph)), rowsep = 1:length(V(Pi3K_graph))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(Pi3K_graph, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(Pi3K_graph)), rowsep = 1:length(V(Pi3K_graph))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, Pi3K_graph, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = state) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(Pi3K_graph)), rowsep = 1:length(V(Pi3K_graph))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(Pi3K_graph)), rowsep = 1:length(V(Pi3K_graph)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we generate simulated data for the phosphoinositide-3-kinase activation of Protein kinase B (PKB) cascade (also known as Pi3k/AKT) pathway. States are imported as edge attributes from the imported graph.

Pi3K_AKT_graph <- identity(Pi3K_AKT_graph) edge_properties <- E(Pi3K_AKT_graph)$state plot_directed(Pi3K_AKT_graph, state = edge_properties, col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[edge_properties], fill.node = c("lightblue"))

set.seed(9000)

Pi3K_AKT_graph <- identity(Pi3K_AKT_graph) Pi3K_AKT_graph <- simplify(Pi3K_AKT_graph, edge.attr.comb = function(x){ ifelse(any(x %in% list(-1, 2, "inhibiting", "inhibition")), 2, 1) }) Pi3K_AKT_graph <- simplify(Pi3K_AKT_graph, edge.attr.comb = "first") edge_properties <- E(Pi3K_AKT_graph)$state # activating graph #state <- rep(1, length(E(Pi3K_AKT_graph))) plot_directed(Pi3K_AKT_graph, state = edge_properties, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2, col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[edge_properties], fill.node = "lightblue") mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(Pi3K_AKT_graph, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red")) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(Pi3K_AKT_graph, state = edge_properties, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red")) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, Pi3K_AKT_graph, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = edge_properties) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red")) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we generate simulated data for the TGFβ-Smad gene regulatory pathway with inhibitions known. States are imported as edge attributes from the imported graph.

TGFBeta_Smad_graph <- identity(TGFBeta_Smad_graph) edge_properties <- E(TGFBeta_Smad_graph)$state plot_directed(TGFBeta_Smad_graph, state = edge_properties, col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[edge_properties], fill.node = c("lightblue"))

set.seed(9000)

TGFBeta_Smad_graph <- identity(TGFBeta_Smad_graph) edge_properties <- E(TGFBeta_Smad_graph)$state # activating graph plot_directed(TGFBeta_Smad_graph, state = edge_properties, layout = layout.kamada.kawai, cex.node=2, cex.arrow=4, arrow_clip = 0.2, col.arrow = c(alpha("navyblue", 0.25), alpha("red", 0.25))[edge_properties], fill.node = "lightblue") mtext(text = "(a) Activating pathway structure", side=1, line=3.5, at=0.075, adj=0.5, cex=1.75) box() #plot relationship matrix heatmap.2(make_distance_graph(TGFBeta_Smad_graph, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "white", "red"), colsep = 1:length(V(TGFBeta_Smad_graph)), rowsep = 1:length(V(TGFBeta_Smad_graph))) mtext(text = "(b) Relationship matrix", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot sigma matrix heatmap.2(make_sigma_mat_dist_graph(TGFBeta_Smad_graph, state = edge_properties, cor = 0.8, absolute = FALSE), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(TGFBeta_Smad_graph)), rowsep = 1:length(V(TGFBeta_Smad_graph))) mtext(text = expression(paste("(c) ", Sigma, " matrix")), side=1, line=3.5, at=0, adj=0.5, cex=1.75) #simulated data expr <- generate_expression(100, TGFBeta_Smad_graph, cor = 0.8, mean = 0, comm = FALSE, dist =TRUE, absolute = FALSE, state = edge_properties) #plot simulated correlations heatmap.2(cor(t(expr)), scale = "none", trace = "none", col = colorpanel(50, "blue", "white", "red"), colsep = 1:length(V(TGFBeta_Smad_graph)), rowsep = 1:length(V(TGFBeta_Smad_graph))) mtext(text = "(d) Simulated correlation", side=1, line=3.5, at=0, adj=0.5, cex=1.75) #plot simulated expression data heatmap.2(expr, scale = "none", trace = "none", col = bluered(50), colsep = 1:length(V(TGFBeta_Smad_graph)), rowsep = 1:length(V(TGFBeta_Smad_graph)), labCol = "") mtext(text = "samples", side=1, line=1.5, at=0.2, adj=0.5, cex=1.5) mtext(text = "genes", side=4, line=1, at=-0.4, adj=0.5, cex=1.5) mtext(text = "(e) Simulated expression data (log scale)", side=1, line=3.5, at=0, adj=0.5, cex=1.75)

Here we have demonstrated that simulated datasets can be generated based on graph structures. These can be either constructed networks for modelling purposes or empirical networks such as those from curated biological databases. Various parameters are available and described in the documentation. You can alter these parameters in the examples given here to see the impact they have on the final network. It is encouraged to try different parameters and examine the results carefully, in addition to carefully considering which assumptions are appropriate for your investigations. A model or simulation is never correct, it is a tool to test your assumptions and find weaknesses in your technique, consider which conditions could your method struggle with and model these. Pathway structure in particular should be considered in biological datasets as correlations within a pathway can lead to false positive results and confounding.

The intended application for this package is modelling RNA-Seq gene expression data. However, other applications are encouraged, provided that they require multivariate normal simulations based on relationships in graph structures.

If you use this package, please cite it where appropriate to recognise the efforts of the developers.

citation("graphsim")

Please see the GitHub repository for reporting problems and requesting changes. Details on how to contribute can be found in the DESCRIPTION and README.

Here is the output of `sessionInfo()`

on the system on which this document was compiled running pandoc 2.1:

```
sessionInfo()
```

Barabási, A. L., and Oltvai, Z. N. 2004. “Network Biology: Understanding the Cell’s Functional Organization.” *Nat Rev Genet* 5 (2): 101–13.

Croft, D., Mundo, A.F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., et al. 2014. “The Reactome pathway knowledgebase.” Journal Article. *Nucleic Acids Res* 42 (database issue): D472–D477. https://doi.org/10.1093/nar/gkt1102.

Csardi, G., and Nepusz, T. 2006. “The Igraph Software Package for Complex Network Research.” *InterJournal* Complex Systems: 1695. https://igraph.org/.

Dijkstra, E.W., 1959. “A note on two problems in connexion with graphs.” *Numerische mathematik* 1 (1): 269–271.

Prim, R.C., 1957. “Shortest connection networks And some generalizations” *Bell System Technical Journal* 36 (6): 1389-1401.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.