knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)

Introduction

The netdist package currently considers two broad methodologies for network comparison, namely Netdis and NetEmd. Netdis considers multiple variants (via background expectations) to capture the dissimilarity between the local structure of networks exhibited by the occurrence of small subgraphs on 3, 4 and 5 nodes. NetEmd is also a method to capture the dissimilarity between networks using subgraph counts, but it has in addition been defined for any type of network features; for example eigen distributions. The variants of Netdis are controlled by the input selected for the background expectations, whereas the variants of NetEmd are controlled indirectly by the user by the selection of the network features being compared via NetEmd (by default this package uses subgraph counts as features).

The following shows a quick introduction to the simplest functions of the package, and to some of the variants of Netdis and NetEmd.

For other vignettes in this package see the "Menu".

Load required packages/libraries

# Load packages/libraries
library("netdist")
library("igraph")

Load protein interaction graphs included in the netdist package

The netdist package also includes examples of a few real networks. These are protein interaction networks (PPI) of a few Herpes virus and of the bacteria Escherichia Coli . In these networks, each node represents a protein and each link represents an interaction between proteins. See help(virusppi).

Although the virusppi list of PPI networks is loaded along with the netdist package, the following code shows how to read a network data from a file in disk:

# Set source directory for Virus protein-protein interaction edge files stored in the netdist package.
source_dir <- system.file(file.path("extdata", "VRPINS"), package = "netdist")

# Load query graphs as undirected igraph objects, with no loops, multiple edges or degree zero nodes.
graph_1 <- read_simple_graph(file = file.path(source_dir, "EBV.txt"),
                             format = "ncol")

graph_2 <- read_simple_graph(file = file.path(source_dir, "ECL.txt"),
                             format = "ncol")

# Herpes virus EBV protein-protein interaction graph with 60 nodes and 208 edges.
graph_1
#Note this graph is the same as
# virusppi$EBV

# Herpes virus ECL protein-protein interaction graph with 1941 nodes and 3989 edges.
graph_2
#Note this graph is the same as
# virusppi$ECL

#A simple visualization of the graphs.
plot(graph_1,vertex.size=4,vertex.label=NA)
plot(graph_2,vertex.size=4,vertex.label=NA)

Other networks loaded in this package are discussed in "NetEmd: World trade networks". You can also see ?virusppi and ?worldtradesub.

The next two sections will provide a basic introduction and usage of both NetEmd and Netdis.

Compare two networks via NetEmd

What is NetEmd?

(Extracted from Wegner et al. (2017)): NetEmd is based on the idea that the information encapsulated in the shape of the degree distribution and other network properties reflects the topological organization of the network. From an abstract point of view, NetEmd views the shape of a distribution as a property that is invariant under linear deformations i.e$.$ translations and re-scalings of the axis.

Based on the previous ideas NetEmd uses the following measure between distributions $p$ and $q$ that are supported on $\mathbb{R}$ and have non-zero, finite variances: \begin{equation}\label{emdmet} EMD^*(p,q)=\mathrm{inf}{c\in\mathbb{R}}\left( EMD\big(\tilde{p}(\cdot+c),\tilde{q}(\cdot)\big)\right), \end{equation} where $EMD$ is the earth mover's distance and $\tilde{p}$ and $\tilde{q}$ are the distributions obtained by rescaling $p$ and $q$ to have variance 1. More precisely, $\tilde{p}$ is the distribution obtained from $p$ by the transformation $x\rightarrow \frac{x}{\sigma(p)}$, where $\sigma(p)$ is the standard deviation of $p$. For probability distributions $p$ and $q$ with support in $\mathbb{R}$ and bounded absolute first moment, the $EMD$ between $p$ and $q$ is given by $EMD(p,q)=\int{-\infty}^\infty|F(x)-G(x)|\,\mathrm{d}x$, where $F$ and $G$ are the cumulative distribution functions of $p$ and $q$ respectively.

Now, for two networks $G$ and $G'$ and for a given set $T={t_1,t_2,...,t_m}$ of network features, the $NetEmd$ measure corresponding to $T$ is: \begin{equation}\label{eq:def_netemd} NetEmd_T(G,G')=\frac{1}{m}\sum_{j=1}^{m} NetEmd_{t_j} (G,G'), \end{equation} where \begin{equation} NetEmd_{t_i} (G,G')=EMD^*(p_{t_i}(G),p_{t_i}(G')), \end{equation} and where $p_{t_i}(G)$ and $p_{t_i}(G')$ are the distributions of ${t_i}$ on $G$ and $G'$ respectively. $NetEmd_{t_i}$ can be shown to be a pseudometric between graphs for any feature $t$, that is it is non-negative, symmetric and satisfies the triangle inequality.

Comparing two graphs with NetEmd

# Set source directory for Virus protein-protein interaction network edge files stored in the netdist package.
source_dir <- system.file(file.path("extdata", "VRPINS"), package = "netdist")

# Load query graphs as igraph objects
# Herpes virus EBV protein-protein interaction graph with 60 nodes and 208 edges.
graph_1 <- read_simple_graph(file.path(source_dir, "EBV.txt"),
                             format = "ncol")

# Herpes virus ECL protein-protein interaction graph with 1941 nodes and 3989 edges.
graph_2 <- read_simple_graph(file.path(source_dir, "ECL.txt"),
                             format = "ncol")

# One to one NetEmd comparison.
netemd_one_to_one(graph_1=graph_1,graph_2=graph_2,feature_type="orbit",smoothing_window_width = 1)#Use of smoothing window 1 is given for discrete integer distributions. If the network features are considered continuous variables smoothing_window_width equal to zero is recommended.

Comparing two graphs with NetEmd via their Laplacian spectrum

#Laplacian
Lapg_1 <- igraph::laplacian_matrix(graph = graph_1,normalized = FALSE,sparse = FALSE)
Lapg_2 <- igraph::laplacian_matrix(graph = graph_2,normalized = FALSE,sparse = FALSE)

#Normalized Laplacian
NLapg_1 <- igraph::laplacian_matrix(graph = graph_1,normalized = TRUE,sparse = FALSE)
NLapg_2 <- igraph::laplacian_matrix(graph = graph_2,normalized = TRUE,sparse = FALSE)

#Spectra (this may take a couple of minutes).
props_1 <- cbind(L.Spectra= eigen(Lapg_1)$values, NL.Spectra= eigen(NLapg_1)$values) 
props_2 <- cbind(L.Spectra= eigen(Lapg_2)$values, NL.Spectra= eigen(NLapg_2)$values) 

head(props_1,n=3)
head(props_2,n=3)

netemd_one_to_one(dhists_1 = props_1,dhists_2 = props_2,smoothing_window_width = 0)#If the network features are considered continuous variables smoothing_window_width equal to zero is recommended.

Compare two networks via Netdis and its variants

What is Netdis?

(Extracted from Ali et al. (2014)): Netdis counts small subgraphs $w$ on $k$ nodes for all 2-step ego-networks, $k=3,4,5$. These counts are centred by subtracting the expected number of counts $E_w$. These centred counts of each network are then compared thus leading to the Netdis statistic.

Netdis is constructed as follows:

Let $N_{w,i}(G)$ be the number of induced occurrences of small graphs $w$ in the 2-step ego network of vertex $i$. Now, bin all 2-step ego-networks of network $G$ according to their network density. Let $E_w(G,\rho)$ be the expected number of occurrences of $w$ in an ego-network whose density falls in density bin $\rho$. For a given network $G$ compute the centred subgraph counts as [ S_w(G)=\sum\limits_{i }{\bigg (N_{w,i}(G)- E_w(G, \rho(i)) \bigg )}, ] where $i$ is a node in $G$ and $\rho(i)$ the density bin of the 2-step ego-network of node $i$.

Now, to compare networks $G_1$ and $G_2$, set $$ \displaystyle netD_2^S(k) = \tfrac{1}{ \sqrt{ M(k)} } \sum\limits_{w \in A(k)} \bigg ({ \tfrac{S_w(G_1) S_w(G_2)} {\sqrt{S_w(G_1)^2 + S_w(G_2)^2}} }\bigg ), \quad k=3,4, 5, $$ where $A(k)$ is the set of connected subgraphs of size $k$, and where $M(k)$ is a normalising constant so that $netD_2^S(k)\in[-1,1]$. $M(k)$ is equal to [ M(k) = \sum\limits_{w \in A(k)} \left( \tfrac{ S_w(G_1)^2 }{\sqrt{S_w(G_1)^2 + S_w(G_2)^2}} \right) \sum\limits_{w \in A(k)} \left(\tfrac{ S_w(G_2)^2 } {\sqrt{S_w(G_1)^2 + S_w(G_2)^2}} \right) . ] The corresponding Netdis statistic is defined as $$Netdis(k)=netd_2^S(k)=\tfrac{1}{2}(1-netD_2^S(k)) \in [0,1].$$ Small values of Netdis suggest higher `similarity' between the networks. By default Netdis uses subgraphs on $k=4$ nodes.

Using Netdis with a gold-standard graph to obtain $E_w$

The selection of a gold-standard graph as a substitute for $E_w$ could be done when such graph is known to be a good proxy for $E_w$, or alternatively as a good reference point for the comparison. This option will focus on detecting discrepancies between the networks relative to the ego-network structure of the reference network / gold-standard graph and which is summarized in $E_w$.

# Lattice graphs to be used as a gold-standard reference point
goldstd_1 <- igraph::graph.lattice(c(8,8)) #Graph with 8^2 nodes
goldstd_2 <- igraph::graph.lattice(c(44,44)) #Graph with 44^2 nodes

plot(goldstd_1,vertex.size=4,vertex.label=NA)
plot(goldstd_2,vertex.size=4,vertex.label=NA)


# Netdis using the goldstd_1 graph as gold-standard reference point
netdis_one_to_one(graph_1= graph_1, graph_2= graph_2,  ref_graph = goldstd_1)

# Netdis using the goldstd_2 graph as gold-standard reference point
netdis_one_to_one(graph_1= graph_1, graph_2= graph_2,  ref_graph = goldstd_2)

Netdis-GP: Using a Geometric-Poisson approximation

(Extracted from Ospina-Forero et al. (2018)): Instead of considering an approximation based on an observed gold-standard network whose selection may be difficult, $E_w$ is computed independently for each graph, based on a Geometric-Poisson (GP) approximation of the distribution of the number of occurrences of subgraph $w$. It assumes that $N_{w,i} \sim GP(\lambda^{\rho(i)}_k, \theta^{\rho(i)}_w)$, where $\lambda^{\rho(i)}_k$ is the Poisson parameter indexed by the size of subgraph $w$ and the density bin $\rho(i)$; and where $\theta^{\rho(i)}_w$ is the geometric parameter indexed by subgraph $w$ and density bin $\rho(i)$. $E_w(G, \rho(i))$ is taken as the mean of the GP approximation, i.e. $\lambda^{\rho(i)}_k/\theta^{\rho(i)}_w$.

As $\lambda^{\rho(i)}k$ and $\theta^{\rho(i)}_w$ are not known, they are estimated as follows: Let $x{w,d}^j$ be the number of subgraphs $w$ on the 2-step ego-network $j$ of density bin $d$, and let [ \bar{X}{w,d}=\frac{1}{q} \sum{j=1}^q x_{w,d}^j, \qquad V^2_{w,d}=\frac{1}{q-1} \sum_{j=1}^q (x_{w,d}^j - \bar{X}{w,d})^2 , ] where $q$ is the number of ego-networks in density bin $d$. Then, [ \hat{\lambda}^{d}{k}= \frac{1}{l} \sum_{h \in A(k)} \frac{2 (\bar{X}{h,d})^2}{V^2{h,d}+\bar{X}{h,d}} , \qquad \hat{\theta}^{d}_w= \frac{2\bar{X}{w,d}}{V^2_{w,d}+\bar{X}_{w,d}}, ] where $l$ is the number of connected subgraphs of size $k$, for example, $l=6$ for $k=4$. These estimators are based on the moment estimators of a GP random variable and the proposal made by (Picard et al.(2008)), where the total count of each individual subgraph could be thought as the sum of the total subgraph counts over multiple ``clumps'' of edges that appear across the network.

This variant focuses on detecting more meso-level discrepancies between the ego-network structures.

#Netdis using the Geometric-Poisson approximation as a way to obtain background expectations. 
netdis_one_to_one(graph_1= graph_1, graph_2= graph_2,  ref_graph = NULL)

Using Netdis with no expectation ($E_w=0$)

Comparing the networks via their observed ego counts without centring them, (equivalent to using expectation equal to zero). This variant thus focus on detecting small discrepancies between the networks.

#Netdis using no expectations (or equivalently, expectation equal to zero).
netdis_one_to_one(graph_1= graph_1, graph_2= graph_2,  ref_graph = 0)

Bibliography



alan-turing-institute/network-comparison documentation built on June 7, 2022, 10:41 p.m.