library(parallel) library(ggraph) library(tidygraph) library(tidyr) library(gridExtra) library(RColorBrewer)
This vignette details how to infer a network, using fishes counts from the Fatala River (Barans95 data from the ade4
package), and shows how to represent a network with the EMtree
package.
The data is composed of 33 species abundances measures in 95 samples.
library(ade4) library(tibble) data(baran95) Y = as.matrix(baran95$fau) n = nrow(Y) p = ncol(Y) head(Y[,1:5]) str(Y)
The available covariates are the site and date of the samples. At each date and site, 4 samples were measured, except for kilometer 03 of January 93.
X = as_tibble(baran95$plan) str(X) table(X$date, X$site)
When dealing with abundance data, most modeling approaches switch back to the Gaussian space using different methods. EMtree is an inference procedure which only requires an estimate of a Gaussian covariance matrix, and can be used with any model which either use Gaussian latent variables, Gaussian copulas, or Gaussian data transformations.
The Poisson log-normal model, implemented and estimated in the PLNmodels
package, is easy to use and can handle both covariates and offsets efficiently. Details on covariates and offsets specification can be found here. For this example we use PLNmodels, however the user is free to provide a Gaussian correlation estimate as an input.
We create a PLNmodels
object:
library(PLNmodels) PLNfit<-PLN(Y ~ X$site)
And then run EMtree()
as follows:
library(EMtree) EMtreeFit<-EMtree(PLNfit, maxIter = 20, plot=TRUE) str(EMtreeFit)
The output matrix edges_prob
gathers the probabilities for each edge to be part of the latent random tree of the model. By construction, this matrix is constrained to sum to $(p-1)$ (number of edges in a spanning tree), and to have an average value of $2/p$.
This package involves a lot of switching between symmetric matrices and the vector of their upper triangular part. The functions ToVec()
and ToSym()
help to transition from one representation to the other.
vec_of_proba=ToVec(EMtreeFit$edges_prob) mean(vec_of_proba)-2/p sum(vec_of_proba) - (p-1)
To obtain a network and select edges, we resort to a stability selection strategy.
The function ResampleEMtree()
implements a stability selection of EMtree()
on S sub-samples. This function uses parallel computations with mclapply()
. The Pmat
output gathers all the inferred edges probabilities for each sub-sample ; it is a stack of each ToVec(EMtreeFit$edges_prob)
.
ResampEmtreeFit<-ResampleEMtree(counts=Y, covar_matrix = X$site , S=5, maxIter=10,cond.tol=1e-8, cores=1) str(ResampEmtreeFit)
Another way to specify the covariates matrix is to use the model.matrix() function as follows:
built_covar_matrix=model.matrix(~X$site+X$date)
Then built_covar_matrix
would be the covar_matrix
parameter.
The ResampleEMtree()
function can be used with another model than PLN. To do this, simply provide a function which estimates a Gaussian correlation matrix from counts (using latent variables, copulas or transformations). The function should be passed to ResampleEMtree()
through the user_covariance_estimation
parameter. The covariate matrix will also be a parameter to the user's function. An example using a dummy estimation is available in the help of ResampleEMtree()
.
For the sake of the example we continue with PLN and the "site" covariate only. Edges selection frequencies can be derived from the Pmat
output with the function freq_selec()
. This function simply summarizes the number of times an edge has a probability above the threshold Pt
. A network can then be obtained by thresholding the frequencies, to keep for example edges that are selected in more than $80\%$ of sub-samples:
freqs<-freq_selec(ResampEmtreeFit$Pmat,Pt=0.2) a_first_idea_of_network<-1*(freqs>0.8)
The problem of course is that there is no rationale for the threshold on probabilities Pt
; the optimal threshold is unknown. A way to select the threshold is to compute the stability of the frequencies for any threshold, and then select the desired stability. The function StATS()
adapts the well-known penalty selection strategy StARS in the context of threshold selection.
stab_selection=StATS(ResampEmtreeFit$Pmat, nlambda=50, stab.thresh=0.9,plot=TRUE)
We then keep the frequencies corresponding to the optimal lambda value, which is here at about $exp(-9.5)$. The optimal frequencies are stored in stab_selection$freqs_opt
, the optimal threshold is stab_selection$lambda_opt
.
The optimal frequencies still need to be thresholded in order to obtain a network. However, the task is very much simpler to understand now: the higher the frequency, the more stable the edge along the resamples.
The package EMtree
has one general plotting function for network visualization: draw_network()
. It builds from the ggraph
and tidygraph
packages, and provides with several interesting functionalities, among which:
For this section we consider three networks from the previous inference: the weighted network with all edges selection frequencies, the network obtained when applying a 0.5 threshold on frequencies, and finally the one obtained with a threshold of 0.95.
weighted_net=ToSym(stab_selection$freqs_opt) net_50=ToSym(1*(stab_selection$freqs_opt>0.5)) net_95=ToSym(1*(stab_selection$freqs_opt>0.95))
The function draw_network()
takes a possibly weighted adjacency matrix as input, and represents a network with edges widths proportional to the input weights. Several layouts are available (among which nicely, kk, fr, circle, stress... See igraph documentation or this blog post). Highlighting of nodes possessing among the highest betweenness centrality measure is done with the parameter btw_rank
. To disable this functionality, just set btw_rank=1
.
shade=TRUE
lowers the opacity of edges not linked to a highlighted node. This allows to better see the direct neighborhood of important nodes.
g1<-draw_network(weighted_net,title="Weighted", pal_edges="dodgerblue3",layout="stress", shade = TRUE, btw_rank=3)$G g2<-draw_network(net_50,title="Selected > 50%", pal_edges="dodgerblue3",layout="kk", shade = TRUE, btw_rank=3)$G g3<-draw_network(net_95,title="Selected > 95%", pal_edges="dodgerblue3",layout="fr", shade = FALSE, btw_rank=3)$G grid.arrange(g1, g2, g3, ncol=3)
To store a layout, we need to create it first, using the ggraph package for example.
a_layout = create_layout(net_95, layout="nicely") %>% data.frame()
Then the layout is taken as the stored_layout
parameter of draw_network()
, as follows:
args=list( pal_edges="dodgerblue3") g1<-draw_network(net_50,title="Selected > 50%", pal_edges="dodgerblue3", stored_layout = a_layout, btw_rank=3)$G g2<-draw_network(net_95,title="Selected > 95%", pal_edges="dodgerblue3", stored_layout = a_layout, btw_rank=3)$G grid.arrange(g1, g2, ncol=2)
It is possible to define groups of nodes and edges with node_groups
and edge_groups
respectively, and color them accordingly using pal_nodes
and pal_edges
. For example :
draw_network(net_95, title="Groups of nodes", pal_edges="gray70", node_groups=sample(3,p, replace=TRUE), pal_nodes=brewer.pal(3,"Dark2"),layout="stress", legend=TRUE)$G
draw_network(net_95, title="Groups of edges", pal_nodes="gray70", edge_groups=ToSym(sample(3,p*(p-1)/2,replace=TRUE)), pal_edges=brewer.pal(3,"Dark2"),layout="stress", btw_rank=1)$G
The draw_network()
function outputs the graph G
, but also graph_data
, which is in igraph format and thus useful if you would like to use your own graphic representation, or another package like networkD3
for example.
In the example below, we first create a graph object with nodes labels as the species names. We then extract the edges and nodes information following the tidygraph grammar, and store them in data frames. Edges information are modified so that they are zero-indexed, which is necessary for JavaScript.
Finally we use the networkD3::forceNetwork()
function to obtain an interactive network, where we colored nodes according to their degree.
library(networkD3) graph<-draw_network(net_95,nodes_label = baran95$species.names) edges_data=graph$graph_data %>% activate(edges) %>% as_tibble() %>% mutate(from=from-1, to = to-1) %>% data.frame() nodes_data=graph$graph_data %>% activate(nodes) %>% data.frame() forceNetwork(Links = edges_data, Nodes =nodes_data, Source = 'from', Target = 'to', NodeID = 'name', Group = 'deg',opacity = 1,fontSize = 15, legend=TRUE)
Another way to do this is to use the networkD3::igraph_to_networkD3()
, which takes advantage of the igraph nature of the graph_data
output. It gives the right format and automatically re-index the edges, but requires to manually add the degrees or any other information on node and edges.
degrees=graph$graph_data %>% activate(nodes) %>% dplyr::select(deg) %>% pull() graph_d3 <- igraph_to_networkD3(graph$graph_data, group=degrees) forceNetwork(Links = graph_d3$links, Nodes =graph_d3$nodes, Source = 'source', Target = 'target', NodeID = 'name', Group = 'group',opacity = 1,fontSize = 15, legend=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.