Using the R package `motifcluster`

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#> "
)
set.seed(200401293)
old_options = options()
options(digits = 2)
library(mclust)

\tableofcontents \pagebreak

Introduction

This vignette demonstrates how to use the R package motifcluster. These methods are detailed in the paper Motif-Based Spectral Clustering of Weighted Directed Networks, which is available at arXiv:2004.01293 [@UnderwoodElliottCucuringu_2020_Motifs]. The functionality of the motifcluster package falls into a few main categories:

This vignette comprehensibly demonstrates all of these functionalities, showcasing the full capability of the motifcluster package. We first load some helpful packages for this tutorial:

library(devtools)
library(mclust)

The motifcluster package can be installed from GitHub with:

install_github("wgunderwood/motifcluster/R")

The package can then be loaded from within R with:

library(motifcluster)

Building motif adjacency matrices

The main novelty in the motifcluster package is its ability to build a wide variety of motif adjacency matrices (MAMs), and to do so quickly. There are several options to consider when building an MAM, which are covered in this section.

An example network

In order to demonstrate the construction of MAMs, we first need a small weighted directed network $\mathcal{G}_1$ to use as an example. Note that throughout this package we represent networks by their weighted directed adjacency matrices (possibly in sparse form). This means that for use alongside R packages such as igraph, one must manually convert between adjacency matrices and igraph objects.

G1 <- matrix(c(
  0, 2, 0, 0,
  0, 0, 2, 3,
  0, 4, 0, 0,
  4, 0, 5, 0
), byrow = TRUE, nrow = 4)

Basic motif adjacency matrix construction

The build_motif_adjacency_matrix function is the main workhorse for building MAMs with motifcluster. Let's use it to build an MAM for the network $\mathcal{G}_1$. First we must choose a motif to look for. A full list can be obtained with:

get_motif_names()

Let's use the 3-cycle motif $\mathcal{M}_1$.

build_motif_adjacency_matrix(G1, motif_name = "M1")

Note that all the entries are zero except for entries $(1,2)$, $(1,4)$, $(2,1)$, $(2,4)$, $(4,1)$, and $(4,2)$. This is because vertices 1, 2 and 4 form an exact copy of the motif $\mathcal{M}_1$ in the network $\mathcal{G}_1$, and the $(i,j)$th MAM entry simply counts the number of instances containing both vertices $i$ and $j$.

Functional and structural motif adjacency matrices

Looking at our example network $\mathcal{G}_1$ again, you might notice that there is seemingly another instance of the motif $\mathcal{M}_1$ in our network $\mathcal{G}_1$, on the vertices 2, 3 and 4, albeit with an "extra" edge from 2 to 3. The reason for this is that we instructed build_motif_adjacency_matrix to look for structural motif instances (this is the default). Structural instances require an exact match, with no extra edges. If we want to also include instances which may have extra edges present, we must instead use functional motif instances:

build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func")

This time we also pick up the 3-cycle on vertices 2, 3 and 4. Vertices 2 and 4 therefore occur in two distinct instances of the motif, and so their motif adjacency matrix entries are equal to two.

Weighted motif adjacency matrices

Our example network $\mathcal{G}_1$ has weighted edges, which we have not yet used: so far our MAMs have been simply counting instances of motifs. This is because the default weighting scheme is "unweighted", assigning every instance a weight of one.

Mean-weighted instances

We could instead use the "mean" weighting scheme, where every instance is assigned a weight equal to its mean edge weight. The $(i,j)$th MAM entry is then defined as the sum of these instance weights across all instances containing both vertices $i$ and $j$:

build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func",
  mam_weight_type = "mean")

The 3-cycle on vertices 1, 2 and 4 has edge weights of 2, 3 and 4, so its mean edge weight is 3. Similarly the 3-cycle on vertices 2, 3 and 4 has mean edge weight of 4. Vertices 2 and 4 appear in both, so their mutual MAM entries are the sum of these two mean weights, which is 7.

Product-weighted instances

We can also use the "product" weighting scheme, where every instance is assigned a weight equal to the product of its edge weights. The $(i,j)$th MAM entry is then defined as the sum of these instance weights across all instances containing both vertices $i$ and $j$:

build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func",
  mam_weight_type = "product")

The 3-cycle on vertices 1, 2 and 4 has edge weights of 2, 3 and 4, so the product of its edge weights is 24. Similarly the 3-cycle on vertices 2, 3 and 4 has product of edge weights of 60. Vertices 2 and 4 appear in both, so their shared MAM entries are the sum of these two product weights, which is 84.

Computation method

The final argument to build_motif_adjacency_matrix is the mam_method argument. This does not affect the value returned but may impact the amount of time taken to return the MAM. In general the "sparse" method (the default) is faster on large sparse networks. The "dense" method, which uses fewer operations but on denser matrices, tends to be faster for small dense networks.

mam_sparse <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "sparse")
mam_dense  <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "dense")
all(mam_sparse == mam_dense)

Sampling random weighted directed networks

Building adjacency matrices by hand is tedious, so it is useful to have methods for generating the adjacency matrices of networks drawn from some probabilistic model. We use (weighted) directed stochastic block models (DSBMs) and (weighted) bipartite stochastic block models (BSBMs).

Directed stochastic block models

First let's sample the adjacency matrix of a DSBM which has two blocks of vertices; the first containing five vertices and the second containing three. We use strong within-block connections, with the diagonal entries of the connection matrix set to $0.9$. The between-block connections are weaker, with the off-diagonal connection matrix entries set to $0.2$. Note how the resulting adjacency matrix is denser on its diagonal blocks ${1, \dots, 5} \times {1, \dots, 5}$ and ${6, \dots, 8} \times {6, \dots, 8}$, and is sparser on its off-diagonal blocks ${1, \dots, 5} \times {6, \dots, 8}$ and ${6, \dots, 8} \times {1, \dots, 5}$. The entries which lie exactly on the diagonal will always be zero, since we only consider networks without self-loops.

block_sizes = c(5, 3)
connection_matrix = matrix(c(
  0.9, 0.2,
  0.2, 0.9
), nrow = 2, byrow = TRUE)

\pagebreak

sample_dsbm(block_sizes, connection_matrix)

Constant-weighted directed stochastic block models

The matrix above has binary entries, indicating that it is the adjacency matrix of an unweighted directed network. The motifcluster package also allows sampling of weighted directed networks. The simplest example of this is "constant" weighting, where we simply multiply each block of the adjacency matrix by a constant.

weight_matrix = matrix(c(
  5, 2,
  2, 5
), nrow = 2, byrow = TRUE)
sample_dsbm(block_sizes, connection_matrix, weight_matrix,
  sample_weight_type = "constant")

Poisson-weighted directed stochastic block models

We can also use weights drawn randomly from a Poisson distribution, where each block in the adjacency matrix has its own mean parameter. This returns an adjacency matrix with weights which could be any natural number, but is equal in expectation to the constant version. Note that in this scheme it is possible for the weight to be zero, removing an edge which might have otherwise been present.

sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "poisson")

Bipartite stochastic block models

The motifcluster package can also be used to sample bipartite networks. The vertices of a bipartite network are partitioned into "source" and "destination" vertices, and edges are only permitted to go from source vertices to destination vertices. Let's sample a DSBM with a single block of two source vertices and two blocks of destination vertices, with sizes of three and two respectively. We can use a strong connection probability of 0.9 to the first block of destination vertices, and a weaker connection probability of 0.2 to the second.

source_block_sizes = c(2)
destination_block_sizes = c(3, 2)
bipartite_connection_matrix = matrix(c(0.9, 0.2), nrow = 1)
sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix)

Weighted bipartite stochastic block models

Similarly to the more general directed stochastic block models, we can also use constant-weighted or Poisson-weighted edges for bipartite stochastic block models.

bipartite_weight_matrix = matrix(c(7, 2), nrow = 1)
sample_bsbm(source_block_sizes, destination_block_sizes,
  bipartite_connection_matrix, bipartite_weight_matrix,
  sample_weight_type = "poisson")

Spectral embedding with motif adjacency matrices

Spectral methods involve performing eigenvalue and eigenvector operations on matrices related to networks. We work here with weighted undirected networks (which have symmetric adjacency matrices), because motif adjacency matrices are always symmetric.

Laplacian matrices

We can construct two types of Laplacian matrix for a network using the motifcluster package. First we create an example of a weighted undirected network $\mathcal{G}_2$.

\pagebreak

G2 <- matrix(c(
  0, 2, 0, 0,
  2, 0, 4, 3,
  0, 4, 0, 5,
  0, 3, 5, 0
), byrow = TRUE, nrow = 4)

Combinatorial Laplacian

The combinatorial Laplacian of an adjacency matrix $G$ is $L_\mathrm{c} = D - G$, where $D$ is the diagonal matrix of weighted vertex degrees:

build_laplacian(G2, type_lap = "comb")

Random-walk Laplacian

The random-walk Laplacian of an adjacency matrix $G$ is $L_\mathrm{rw} = I - D^{-1}G$, where $D$ is the diagonal matrix of weighted vertex degrees and $I$ is the identity matrix:

build_laplacian(G2, type_lap = "rw")

Laplace embedding

Once we have constructed the desired Laplacian matrix, we use it to embed each vertex into $\mathbb{R}^l$ by finding the eigenvectors associated with its first (smallest magnitude) few eigenvalues. Below we use the random-walk Laplacian, and embedding dimension $l=2$:

spectrum = run_laplace_embedding(G2, num_eigs = 2, type_lap = "rw")
spectrum$vals
spectrum$vects

For a random-walk Laplacian, the first eigenvalue is always zero (up to machine precision) and its corresponding eigenvector is constant.

Motif embedding

Motif embedding is simply the process of building an MAM and performing Laplace embedding with it. As an example we use the run_motif_embedding function on the network $\mathcal{G}_3$ below.

G3 <- matrix(c(
  0, 0, 0, 0,
  0, 0, 2, 3,
  0, 4, 0, 0,
  4, 0, 5, 0
), byrow = TRUE, nrow = 4)

An artifact of building MAMs is that although the original network may be connected, there is no guarantee that the MAM is also connected. Hence the MAM is restricted to its largest connected component before the Laplacian is formed. We observe this with the network $\mathcal{G}_3$, in which only three of the four vertices are embedded.

spectrum = run_motif_embedding(G3, motif_name = "M1", motif_type = "func",
  mam_weight_type = "unweighted", mam_method = "sparse", num_eigs = 2,
  restrict = TRUE, type_lap = "rw")
spectrum$vals
spectrum$vects

Motif-based spectral clustering

The overall aim of motifcluster is to use motifs for spectral clustering, so now we see how to extract clusters from the motif-based eigenvector embeddings. The run_motif_clustering function handles the entire process of building an MAM, restricting it to its largest connected component, performing eigenvector embedding, and extracting clusters. We therefore take the opportunity to showcase the ability of motifcluster to recover the blocks of a DSBM, demonstrating all of the methods outlined in this vignette.

Let's use a DSBM with three blocks of 10 nodes each.

block_sizes = rep(10, 3)

We use strong connections of 0.8 within the blocks, and weaker connections of 0.3 between the blocks.

connection_matrix = matrix(c(
  0.8, 0.2, 0.2,
  0.2, 0.8, 0.2,
  0.2, 0.2, 0.8
), nrow = 3)

We also set the within-block edges to be Poisson-weighted with mean 20, and the between-block edges to be Poisson-weighted with smaller mean 10.

weight_matrix = matrix(c(
  20, 10, 10,
  10, 20, 10,
  10, 10, 20
), nrow = 3)

G4 = sample_dsbm(block_sizes, connection_matrix, weight_matrix,
  sample_weight_type = "poisson")

Now we can run the motif-based spectral clustering algorithm on this network with the 3-cycle motif $\mathcal{M}_1$. We build a functional MAM, weighting the instances by their mean edge weights, using the sparse formulation. We restrict this MAM to its largest connected component. Then we construct a random-walk Laplacian and embed it using the first four eigenvalues and eigenvectors. Finally we extract three clusters.

motif_cluster = run_motif_clustering(G4, motif_name = "M1", motif_type = "func",
  mam_weight_type = "mean", mam_method = "sparse", type_lap = "rw", num_eigs = 4,
  num_clusts = 3
)

We can evaluate the performance by comparing it to the ground-truth labels using the adjusted Rand index from the mclust package:

truth = c(rep(1, 10), rep(2, 10), rep(3, 10))
mclust::adjustedRandIndex(motif_cluster$clusts, truth)

A larger value indicates better recovery of the blocks, with a value of one indicating perfect agreement.

References

options(old_options)


Try the motifcluster package in your browser

Any scripts or data that you put into this service are public.

motifcluster documentation built on April 30, 2022, 1:06 a.m.