knitr::opts_chunk$set(collapse = TRUE, comment = "#>") knitr::opts_knit$set(global.par = TRUE)
set.seed(5) par(mar = c(0, 0, 0, 0) + 0.1)
Thank you for your interest in the backbone package! This vignette illustrates how to use this package to extract the backbone of a network. A backbone is a sparse and unweighted subgraph of a network that contains only the most 'important' or 'significant' edges. A backbone can be useful when the original network is too dense, when edge weights are not needed, or when edge weights are difficult to interpret.
The backbone
package can be cited as:
Neal, Z. P. (2022). backbone: An R package to Extract Network Backbones. PLOS ONE, 17. https://doi.org/10.1371/journal.pone.0269137
For additional resources on the backbone package, please see https://www.rbackbone.net/.
If you have questions about the backbone package or would like a backbone hex sticker, please contact the maintainer Zachary Neal by email (zpneal\@msu.edu) or via Twitter (\@zpneal). Please report bugs in the backbone package at https://github.com/zpneal/backbone/issues.
The backbone package can be loaded in the usual way:
library(backbone)
Upon successful loading, a startup message will display that shows the version number, citation, ways to get help, and ways to contact us.
The primary use of the backbone package is backbone extraction, that is, identifying and preserving only the most important edges in a network. Different types of networks require different backbone extraction methods, however the basic workflow using the backbone package is the same for all methods:
You begin with some network data dat
, which may take any of the following forms: an adjacency matrix (as a matrix, Matrix, or Sparse Matrix object), an edge list (as a 2- or 3-column matrix or data frame), or an igraph object.
You run an applicable backbone function on these data (e.g., sdsm(dat)
), which yields an unweighted network of the same class as dat
.
The figure illustrates which functions are applicable for which types of network data, and highlights in blue the analytic workflow and function that is recommended for each type.
If you are unsure about your type of network data or which backbone function to use, you can run suggest.backbone(dat)
. This helper function will examine the data and make a suggestion. If you also specify a significance level or sparsification parameter s
, then the function will extract the backbone using the suggested approach and will display text describing what it did. In all cases, s
can range from 0 to 1, with smaller values yielding sparser backbones. For example:
dat <- matrix(runif(100),10,10) #Some data backbone.suggest(dat) #What should I do? backbone <- backbone.suggest(dat, s = 0.05) #Or, just do it
The sections below describe and illustrate the backbone functions applicable to different types of networks, including weighted bipartite projections, non-projection weighted networks, and unweighted networks. Additionally, the package includes several utility functions that are used by the backbone methods, but may also have independent applications.
narrative
parameter {#narrative}Most functions in the backbone package include a narrative
parameter. When narrative = TRUE
, narrative text describing the backbone extraction will be displayed with the relevant citations. This text is suitable for use in manuscripts, and ensures complete and consistent reporting of the backbone extraction. You can see an example of this type output in the backbone.suggest()
example above, as well as in some of the examples shown below.
The functions in the backbone package support many different data formats that can be used to represent a network or graph in R:
matrix - A bipartite network can be represented as an incidence matrix, and a unipartite matrix can be represented as an adjacency matrix. An incidence/adjacency matrix can be supplied to a backbone extraction function as a matrix object or Matrix object created using the Matrix package.
edgelist - A network can be represented as an edgelist, supplied as a dataframe object. The dataframe should have two or three columns; the first two columns contain the IDs of the sending and receiving nodes, respectively, while the third column (optionally) contains the edge weights.
igraph - A network can be supplied to a backbone extraction function as an igraph object from the igraph package.
Note: Earlier versions of backbone also supported network objects created using the statnet package. To reduce package dependencies, these objects are no longer supported. You can convert a network bipartite graph into an incidence matrix using network::as.matrix.network(graph, type = "incidence")
, and can convert a network unipartite graph into an adjacency matrix using network::as.matrix.network(graph, type = "adjacency")
. In either case, if the graph is weighted, the command should also include attrname = "weight"
.
A weighted unipartite network contains one type of node, and pairs of nodes are connected by edges that have weights. These weights usually indicate the intensity of the relationship between them (e.g., strength, intensity, etc.). A weighted unipartite network can be represented by a matrix W, where W~ij~ indicates the weight of the edge between node i and node j. Weighted networks can be undirected (i.e., W~ij~ = W~ji~) or directed (i.e., W~ij~ != W~ji~).
The goal of backbone extraction from a weighted network is to identify edges in W whose weights are statistically significantly strong (or weak, for signed backbones), and thus should be preserved in the backbone. One simple possibility is to apply a global threshold T, such that all edges with weights stronger than T are preserved, however this approach tends to systematically overlook nodes with low degree. Alternative approaches, such as the disparity filter [@serrano2009extracting], involve comparing edge weights to their expected values under a null model that considers each node's local neighborhood.
To illustrate extracting the backbone from a weighted network, we begin by creating a simulated weighted network, stored in a matrix W
. This network has strongly heterogeneous edge weights; some are quite weak (10) while others are quite strong (100), which gives the network a multi-scalar character. This network might represent the number of passengers flying between each of 10 airports, where there are two hub airports (one large, one small) that each serve their local areas.
W <- matrix(c(0,10,10,10,10,75,0,0,0,0, 10,0,1,1,1,0,0,0,0,0, 10,1,0,1,1,0,0,0,0,0, 10,1,1,0,1,0,0,0,0,0, 10,1,1,1,0,0,0,0,0,0, 75,0,0,0,0,0,100,100,100,100, 0,0,0,0,0,100,0,10,10,10, 0,0,0,0,0,100,10,0,10,10, 0,0,0,0,0,100,10,10,0,10, 0,0,0,0,0,100,10,10,10,0),10)
We could simply examine this network as a weighted network, however this yields a cluttered visualization, and many network analytic techniques (e.g., ERGM, SOAM) are not possible with weighted networks:
weighted <- igraph::graph_from_adjacency_matrix(W, mode = "undirected", weighted = TRUE, diag = FALSE) plot(weighted, edge.width = sqrt(igraph::E(weighted)$weight), vertex.label = NA)
We could extract a backbone by applying a global threshold using the global()
function. One possibility is to extract a backbone that preserves all edges with weights greater than 0. However, this yields a network that is very dense because all edges are preserved:
bb <- global(W, upper = 0, class = "igraph") plot(bb, vertex.label = NA)
Instead, we could choose the threshold value based on the edge weights. For example, we could extract a backbone by using the average edge weight as the global threshold, thus obtaining a backbone that preserves stronger-than-average edges. However, this yields a network that is focused only on high-degree nodes (i.e., the large hub airport), while ignoring the low-degree nodes (i.e., the smaller hub airport):
bb <- global(W, upper = function(x)mean(x), class = "igraph") plot(bb, vertex.label = NA)
A better option is to extract the backbone using a statistical null model, such as the disparity filter (currently, the only available model). Now we can clearly see the hub-and-spoke structure of the transportation system:
bb <- disparity(W, alpha = 0.05, narrative = TRUE, class = "igraph") plot(bb, vertex.label = NA)
A bipartite network contains two types of nodes, which we call agents and artifacts, and edges that exist only between nodes of different types (i.e., an edge may connect an agent to an artifact, but never an agent to another agent). A bipartite network can be represented by a matrix B, where B~ik~ = 1 if agent i is connected to artifact k, and otherwise is 0 (e.g., author i wrote paper k). The row sums of B indicate agent degrees (e.g., the number of papers written by author i), while the column sums of B indicate artifact degrees (e.g., the number of authors on paper k).
A bipartite projection is a network of agents that are connected by the fact that they share artifacts in common. A bipartite projection can be represented by a square symmetric matrix P, which is formed as P = B * B', where B' is the transpose of B. In P, P~ij~ indicates the number of artifacts k that are shared by agents i and j, and can be viewed as the weight of the edge connecting them in the network (e.g., the number of papers co-authored by authors i and j).
The goal of backbone extraction from bipartite projections is to identify edges in P whose weights are statistically significantly strong (or weak, for signed backbones), and thus should be preserved in the backbone. All methods implemented in the backbone package use the same approach for testing the statistical significance of edges:
fixedfill()
preserves the total number of 1s in B, fixedrow()
preserved the row sums of B, and fixedcol()
preserved the column sums of B. Finally, fdsm()
and sdsm()
preserve both the row and column sums of B**; the former preserves them exactly, while the latter preserves them on average.This approach relies on Monte Carlo simulation, and is used by fdsm()
. However, for the other bipartite backbone null models, the distribution of P*~ij~ is known and does not require simulation [@neal2021comparing], which makes them faster. Additionally, this process preserves only edges that are statistically significantly stronger than expected under the chosen null model, thus yielding a binary backbone. However, it is also possible to preserve edges that are statistically significantly weaker, and thus to obtain a signed backbone.
Because all bipartite backbone functions work similarly, here we will use the stochastic degree sequence model and sdsm()
function as an example, applying it to simulated data.
We begin by creating a simulated bipartite network, stored in a matrix B
. This network contains 30 agents (rows) and 75 artifacts (columns). The agents are connected to the artifacts in a way that we might expect if the agents form three cohesive communities. For example, this could represent authors from three distinct disciplines, and the papers they have written. In practice, B
can take the form of a matrix object (as in this example), but could take other forms, such as a bipartite network stored in a statnet
or igraph
class object.
B <- rbind(cbind(matrix(rbinom(250,1,.8),10), matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.2),10)), cbind(matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.8),10), matrix(rbinom(250,1,.2),10)), cbind(matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.8),10)))
Looking at a portion of the bipartite network, we can see that author 1 was an author on papers 2-5, but not paper 1:
B[1:5,1:5]
Looking at the row and column sums, we can also see that each author wrote about 30 papers, and each paper has about 10 authors, but there is some variation:
rowSums(B) colSums(B)
We can construct an ordinary weighted bipartite projection, then plot it using igraph
, but the result is a dense and uninformative network with no obvious communities:
P <- B%*%t(B) plot(igraph::graph_from_adjacency_matrix(P, mode = "undirected", diag = FALSE, weighted = TRUE), vertex.label = NA)
Instead, we can extract the backbone of the weighted bipartite projection using the stochastic degree sequence model. By choosing the SDSM, we are comparing the observed edge weights to those that would be observed under a null model where each of the authors wrote approximately the same number of papers as in B
, and each of the papers have approximately the same number of authors as in B
, but which authors wrote which papers is random:
bb <- sdsm(B, alpha = 0.075, narrative = TRUE, class = "igraph")
Notice that this function is performed on the original bipartite network B
, not on the bipartite projection P
. We use alpha = 0.075
to indicate that edges should be preserved at the 0.075 level of statistical significance; we use an unusually large value in this example only because it is a small toy network. We use narrative = TRUE
to request that suggested manuscript text and citations be displayed. Finally, we use class = "igraph"
to request that the resulting backbone be returned as an igraph
class object, to facilitate further analysis. When we plot the backbone, the three communities are now clearly visible:
plot(bb, vertex.label = NA)
An unweighted unipartite network contains one type of node, and pairs of nodes are either connected or not. An unweighted unipartite network can be represented by a matrix U where U~ij~=1 if nodes i and j are connected, and otherwise is 0. Unweighted networks can be undirected (i.e., W~ij~ = W~ji~) or directed (i.e., W~ij~ != W~ji~).
The goal of backbone extraction from an unweighted network is to identify edges in U that are important to the network's structure, and thus should be preserved in the backbone. Because the network is already unweighted, and the goal is simply to reduce the number of edges, this process is sometimes also called sparsification. The challenge is that, unlike bipartite projections and weighted networks, the edges in unweighted networks do not have weights that can be used to evaluate their importance.
To overcome this challenge, backbone models designed for unweighted networks first assign each edge a score that is intended to capture its importance. Many different possible edge scoring metrics have been proposed. For example, the jaccard coefficient assigns an edge a score that captures the amount of overlap in its two endpoints' neighborhoods. In the context of a social network, the jaccard coefficient of edge U~ij~ would indicate the extent to which i's friends are also j's friends. Similarly, the triangle count has also been proposed as an edge scoring metric, and counts the number of triangles that are completed by an edge. This metric draws on sociological theory (e.g., Simmel, Granovetter) to view triangle-completing edges as important because they form cohesive triads.
Although many sparsification models have been proposed, each one is defined by four characteristics: What metric is used for the edge score? How is the edge score normalized? How are edges filtered based on the (normalized) edge score? Are edges in the union of minimum spanning trees (UMST) preserved to ensure connectivity?
The backbone of an unweighted network can be extracted using the sparsify()
function, which includes parameters that answer each of the four questions that define a sparsification model (escore
, normalize
, filter
, and umst
), as well as a sparsification parameter s
that controls the amount of sparsification.
Specific combinations of these parameters correspond to named sparsification models that have been proposed in the literature. For example, the L-spar sparsification model [@satuluri2011local] can be obtained using sparsify(escore = "jaccard", normalize = "rank", filter = "degree", umst = FALSE)
. For this reason, the backbone package includes several wrappers for the sparsify()
function that return specific named sparsification models. For example, the sparsify.with.lspar()
function is a wrapper that uses the above sparsify()
options.
Different sparsification models are designed to perserve different structural features believed to be embedded in dense networks. To illustrate extracting the backbone from an unweighted network believed to contain communities, we begin by creating a dense unweighted network U1
. This network contains 60 nodes, but is so dense that the presence of communities is obscured:
U.with.communities <- igraph::sbm.game(60, matrix(c(.75,.25,.25,.25,.75,.25,.25,.25,.75),3,3), c(20,20,20)) plot(U.with.communities, vertex.label = NA)
However, the backbone of this dense network can be extracted using the L-spar sparsification model, which clearly reveals a three-community structure:
bb <- sparsify.with.lspar(U.with.communities, s = 0.6, narrative = TRUE) plot(bb, vertex.label = NA)
To illustrate extracting the backbone from an unweighted network believed to contain highly central hub nodes, we begin by creating another dense unweighted network U2
. This network contains 60 nodes, but is so dense that the presence of hubs is obscured:
U.with.hubs <- igraph::as.undirected(igraph::sample_pa(60, m = 3), mode = "collapse") plot(U.with.hubs, vertex.size = igraph::degree(bb), vertex.label = NA) #A hairball
However, the backbone of this dense network can be extracted using the local degree sparsification model [@hamann2016structure], which clearly reveals the hub-and-spoke structure of the network:
bb <- sparsify.with.localdegree(U.with.hubs, s = 0.3, narrative = TRUE) plot(bb, vertex.size = igraph::degree(bb), vertex.label = NA)
If the alpha
parameter is specified when using any of the bipartite projection (e.g., sdsm()
) or weighted (e.g., disparity()
) backbone functions, they return the backbone network. If alpha = NULL
, these functions instead return an S3 backbone-class object that contains p-values for each edge. The backbone.extract()
function can be applied to this object to extract backbones at different alpha significance levels.
This can be particularly useful in the case of backbones extracted using models that are computationally intensive, such as the fixed degree sequence model (FDSM). values To illustrate, we use a simulated bipartite network whose backbone should contain evidence of agents clustered in three communities, and we use fdsm()
with alpha = NULL
to compute edgewise p-values and store them in a backbone object:
B <- rbind(cbind(matrix(rbinom(250,1,.8),10), matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.2),10)), cbind(matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.8),10), matrix(rbinom(250,1,.2),10)), cbind(matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.2),10), matrix(rbinom(250,1,.8),10))) bb.object <- fdsm(B, alpha = NULL, trials = 1000) #Backbone object containing edgewise p-values
We can extract a backbone from this backbone object using a very liberal alpha significance level, which results in many edges being retained:
bb1 <- backbone.extract(bb.object, alpha = 0.5, class = "igraph") #Backbone extracted at alpha = 0.5 plot(bb1, vertex.label = NA)
Then we can extract another backbone, using a more conservative alpha significance level that retains fewer edges, from the same backbone object without needing to recompute the p-values:
bb2 <- backbone.extract(bb.object, alpha = 0.05, class = "igraph") #Backbone extracted at alpha = 0.05 plot(bb2, vertex.label = NA)
Extracting a backbone from a bipartite projection using the fixed degree sequence model and fdsm()
function requires randomly sampling matrices from the space of all binary matrices with given row and column sums, which is performed using the 'fastball' algorithm [@godard2021fastball]. Given a matrix, fastball()
returns a new matrix randomly sampled from the space of all binary matrices with the same row and column sums:
mat <- rbind(c(1,0,0), c(0,0,1), c(0,1,1)) mat fastball(mat)
Extracting a backbone from a bipartite projection using the stochastic degree sequence model and sdsm()
function requires determining the probability that B~ij~ = 1 in the space of all matrices with given row and column sums. For example, if the row sums of B are [1,1,2] and the column sums of B are [1,1,2], then in the space of all matrices with these row and column sums, the probability that B~ij~ = 1 is:
|||| |:--:|:--:|:--:| |.2|.2|.6| |.2|.2|.6| |.6|.6|.8|
These probabilities are usually unknown and must be estimated. The fastest and most accurate way to estimate them [@neal2021comparing] is using the Bipartite Configuration Model [@saracco2015]. Given a matrix mat
with specific row and column sums, bicm()
estimates these probabilities:
mat <- rbind(c(1,0,0), c(0,0,1), c(0,1,1)) bicm(mat)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.