knitr::opts_chunk$set( collapse = TRUE, comment = "#> " ) set.seed(200401293) old_options = options() options(digits = 2) library(mclust)

\tableofcontents \pagebreak

This vignette demonstrates how to use the R package `motifcluster`

.
These methods are detailed in the paper
*Motif-Based Spectral Clustering of Weighted Directed Networks*,
which is available at `arXiv:2004.01293`

[@UnderwoodElliottCucuringu_2020_Motifs].
The functionality of the `motifcluster`

package falls into a few main categories:

- Building motif adjacency matrices
- Sampling random weighted directed networks
- Spectral embedding with motif adjacency matrices
- Motif-based spectral clustering

This vignette comprehensibly demonstrates all of these functionalities,
showcasing the full capability of the `motifcluster`

package.
We first load some helpful packages for this tutorial:

library(devtools) library(mclust)

The `motifcluster`

package can be installed from GitHub with:

install_github("wgunderwood/motifcluster/R")

The package can then be loaded from within R with:

```
library(motifcluster)
```

The main novelty in the `motifcluster`

package is its ability to
build a wide variety of motif adjacency matrices (MAMs), and to do so quickly.
There are several options to consider when building an MAM,
which are covered in this section.

In order to demonstrate the construction of MAMs,
we first need a small weighted directed network
$\mathcal{G}_1$ to use as an example.
Note that throughout this package we represent networks by their
weighted directed adjacency matrices (possibly in sparse form).
This means that for use alongside R packages such as `igraph`

,
one must manually convert between adjacency matrices and
`igraph`

objects.

G1 <- matrix(c( 0, 2, 0, 0, 0, 0, 2, 3, 0, 4, 0, 0, 4, 0, 5, 0 ), byrow = TRUE, nrow = 4)

The `build_motif_adjacency_matrix`

function is the main workhorse
for building MAMs with `motifcluster`

.
Let's use it to build an MAM for the network $\mathcal{G}_1$.
First we must choose a motif to look for.
A full list can be obtained with:

```
get_motif_names()
```

Let's use the 3-cycle motif $\mathcal{M}_1$.

build_motif_adjacency_matrix(G1, motif_name = "M1")

Note that all the entries are zero except for entries $(1,2)$, $(1,4)$, $(2,1)$, $(2,4)$, $(4,1)$, and $(4,2)$. This is because vertices 1, 2 and 4 form an exact copy of the motif $\mathcal{M}_1$ in the network $\mathcal{G}_1$, and the $(i,j)$th MAM entry simply counts the number of instances containing both vertices $i$ and $j$.

Looking at our example network $\mathcal{G}_1$ again,
you might notice that there is seemingly another instance
of the motif $\mathcal{M}_1$
in our network $\mathcal{G}_1$,
on the vertices 2, 3 and 4,
albeit with an "extra" edge from 2 to 3.
The reason for this is that we instructed
`build_motif_adjacency_matrix`

to look for *structural* motif instances (this is the default).
Structural instances require an exact match, with no extra edges.
If we want to also include instances which may have extra edges present,
we must instead use functional motif instances:

build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func")

This time we also pick up the 3-cycle on vertices 2, 3 and 4. Vertices 2 and 4 therefore occur in two distinct instances of the motif, and so their motif adjacency matrix entries are equal to two.

Our example network $\mathcal{G}_1$ has weighted edges, which we have not yet used: so far our MAMs have been simply counting instances of motifs. This is because the default weighting scheme is "unweighted", assigning every instance a weight of one.

We could instead use the "mean" weighting scheme, where every instance is assigned a weight equal to its mean edge weight. The $(i,j)$th MAM entry is then defined as the sum of these instance weights across all instances containing both vertices $i$ and $j$:

build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func", mam_weight_type = "mean")

The 3-cycle on vertices 1, 2 and 4 has edge weights of 2, 3 and 4, so its mean edge weight is 3. Similarly the 3-cycle on vertices 2, 3 and 4 has mean edge weight of 4. Vertices 2 and 4 appear in both, so their mutual MAM entries are the sum of these two mean weights, which is 7.

We can also use the "product" weighting scheme, where every instance is assigned a weight equal to the product of its edge weights. The $(i,j)$th MAM entry is then defined as the sum of these instance weights across all instances containing both vertices $i$ and $j$:

build_motif_adjacency_matrix(G1, motif_name = "M1", motif_type = "func", mam_weight_type = "product")

The 3-cycle on vertices 1, 2 and 4 has edge weights of 2, 3 and 4, so the product of its edge weights is 24. Similarly the 3-cycle on vertices 2, 3 and 4 has product of edge weights of 60. Vertices 2 and 4 appear in both, so their shared MAM entries are the sum of these two product weights, which is 84.

The final argument to `build_motif_adjacency_matrix`

is the `mam_method`

argument.
This does not affect the value returned but may impact the
amount of time taken to return the MAM.
In general the "sparse" method (the default)
is faster on large sparse networks.
The "dense" method, which uses fewer operations but on
denser matrices, tends to be faster for small dense networks.

mam_sparse <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "sparse") mam_dense <- build_motif_adjacency_matrix(G1, motif_name = "M1", mam_method = "dense") all(mam_sparse == mam_dense)

Building adjacency matrices by hand is tedious, so it is useful to have methods for generating the adjacency matrices of networks drawn from some probabilistic model. We use (weighted) directed stochastic block models (DSBMs) and (weighted) bipartite stochastic block models (BSBMs).

First let's sample the adjacency matrix of a DSBM which has two blocks of vertices; the first containing five vertices and the second containing three. We use strong within-block connections, with the diagonal entries of the connection matrix set to $0.9$. The between-block connections are weaker, with the off-diagonal connection matrix entries set to $0.2$. Note how the resulting adjacency matrix is denser on its diagonal blocks ${1, \dots, 5} \times {1, \dots, 5}$ and ${6, \dots, 8} \times {6, \dots, 8}$, and is sparser on its off-diagonal blocks ${1, \dots, 5} \times {6, \dots, 8}$ and ${6, \dots, 8} \times {1, \dots, 5}$. The entries which lie exactly on the diagonal will always be zero, since we only consider networks without self-loops.

block_sizes = c(5, 3) connection_matrix = matrix(c( 0.9, 0.2, 0.2, 0.9 ), nrow = 2, byrow = TRUE)

\pagebreak

```
sample_dsbm(block_sizes, connection_matrix)
```

The matrix above has binary entries,
indicating that it is the adjacency matrix of an unweighted
directed network.
The `motifcluster`

package also allows sampling of weighted directed networks.
The simplest example of this is "constant" weighting,
where we simply multiply each block of the adjacency matrix by a constant.

weight_matrix = matrix(c( 5, 2, 2, 5 ), nrow = 2, byrow = TRUE) sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "constant")

We can also use weights drawn randomly from a Poisson distribution, where each block in the adjacency matrix has its own mean parameter. This returns an adjacency matrix with weights which could be any natural number, but is equal in expectation to the constant version. Note that in this scheme it is possible for the weight to be zero, removing an edge which might have otherwise been present.

sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "poisson")

The `motifcluster`

package can also be used to sample
bipartite networks.
The vertices of a bipartite network are partitioned
into "source" and "destination" vertices,
and edges are only permitted to go from source vertices
to destination vertices.
Let's sample a DSBM with a single block of two source vertices
and two blocks of destination vertices, with
sizes of three and two respectively.
We can use a strong connection probability of 0.9 to the first block of
destination vertices,
and a weaker connection probability of 0.2 to the second.

source_block_sizes = c(2) destination_block_sizes = c(3, 2) bipartite_connection_matrix = matrix(c(0.9, 0.2), nrow = 1) sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix)

Similarly to the more general directed stochastic block models, we can also use constant-weighted or Poisson-weighted edges for bipartite stochastic block models.

bipartite_weight_matrix = matrix(c(7, 2), nrow = 1) sample_bsbm(source_block_sizes, destination_block_sizes, bipartite_connection_matrix, bipartite_weight_matrix, sample_weight_type = "poisson")

Spectral methods involve performing eigenvalue and
eigenvector operations on matrices related to networks.
We work here with weighted *undirected* networks
(which have symmetric adjacency matrices),
because motif adjacency matrices are always symmetric.

We can construct two types of Laplacian matrix for a network
using the `motifcluster`

package.
First we create an example of a weighted undirected network $\mathcal{G}_2$.

\pagebreak

G2 <- matrix(c( 0, 2, 0, 0, 2, 0, 4, 3, 0, 4, 0, 5, 0, 3, 5, 0 ), byrow = TRUE, nrow = 4)

The combinatorial Laplacian of an adjacency matrix $G$ is $L_\mathrm{c} = D - G$, where $D$ is the diagonal matrix of weighted vertex degrees:

build_laplacian(G2, type_lap = "comb")

The random-walk Laplacian of an adjacency matrix $G$ is $L_\mathrm{rw} = I - D^{-1}G$, where $D$ is the diagonal matrix of weighted vertex degrees and $I$ is the identity matrix:

build_laplacian(G2, type_lap = "rw")

Once we have constructed the desired Laplacian matrix, we use it to embed each vertex into $\mathbb{R}^l$ by finding the eigenvectors associated with its first (smallest magnitude) few eigenvalues. Below we use the random-walk Laplacian, and embedding dimension $l=2$:

spectrum = run_laplace_embedding(G2, num_eigs = 2, type_lap = "rw") spectrum$vals spectrum$vects

For a random-walk Laplacian, the first eigenvalue is always zero (up to machine precision) and its corresponding eigenvector is constant.

Motif embedding is simply the process of building an MAM
and performing Laplace embedding with it.
As an example we use the
`run_motif_embedding`

function on the network $\mathcal{G}_3$ below.

G3 <- matrix(c( 0, 0, 0, 0, 0, 0, 2, 3, 0, 4, 0, 0, 4, 0, 5, 0 ), byrow = TRUE, nrow = 4)

An artifact of building MAMs is that although the original network may be connected, there is no guarantee that the MAM is also connected. Hence the MAM is restricted to its largest connected component before the Laplacian is formed. We observe this with the network $\mathcal{G}_3$, in which only three of the four vertices are embedded.

spectrum = run_motif_embedding(G3, motif_name = "M1", motif_type = "func", mam_weight_type = "unweighted", mam_method = "sparse", num_eigs = 2, restrict = TRUE, type_lap = "rw") spectrum$vals spectrum$vects

The overall aim of `motifcluster`

is to use motifs for spectral clustering,
so now we see how to extract clusters from the motif-based
eigenvector embeddings.
The `run_motif_clustering`

function handles the entire process of
building an MAM, restricting it to its largest connected component,
performing eigenvector embedding, and extracting clusters.
We therefore take the opportunity to showcase the ability of
`motifcluster`

to recover the blocks of a DSBM,
demonstrating all of the methods outlined in this vignette.

Let's use a DSBM with three blocks of 10 nodes each.

block_sizes = rep(10, 3)

We use strong connections of 0.8 within the blocks, and weaker connections of 0.3 between the blocks.

connection_matrix = matrix(c( 0.8, 0.2, 0.2, 0.2, 0.8, 0.2, 0.2, 0.2, 0.8 ), nrow = 3)

We also set the within-block edges to be Poisson-weighted with mean 20, and the between-block edges to be Poisson-weighted with smaller mean 10.

weight_matrix = matrix(c( 20, 10, 10, 10, 20, 10, 10, 10, 20 ), nrow = 3) G4 = sample_dsbm(block_sizes, connection_matrix, weight_matrix, sample_weight_type = "poisson")

Now we can run the motif-based spectral clustering algorithm on this network with the 3-cycle motif $\mathcal{M}_1$. We build a functional MAM, weighting the instances by their mean edge weights, using the sparse formulation. We restrict this MAM to its largest connected component. Then we construct a random-walk Laplacian and embed it using the first four eigenvalues and eigenvectors. Finally we extract three clusters.

motif_cluster = run_motif_clustering(G4, motif_name = "M1", motif_type = "func", mam_weight_type = "mean", mam_method = "sparse", type_lap = "rw", num_eigs = 4, num_clusts = 3 )

We can evaluate the performance by comparing it to the ground-truth
labels using the adjusted Rand index
from the `mclust`

package:

truth = c(rep(1, 10), rep(2, 10), rep(3, 10)) mclust::adjustedRandIndex(motif_cluster$clusts, truth)

A larger value indicates better recovery of the blocks, with a value of one indicating perfect agreement.

```
options(old_options)
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.