knitr::opts_chunk$set(echo = TRUE) library(ggplot2)
noisySBM
is the accompanying R package of the article Powerful graph inference with false discovery rate control by Rebafka, Roquain and Villers available at https://arxiv.org/abs/1907.10176. It provides functions for generating data in the so-called noisy stochastic block model (NSBM), an estimation algorithm to fit parameters and cluster the nodes, as well as a powerful graph inference procedure.
The context is the following. We observe a dense graph where most interactions are due to noise. In practice the graph may be a similarity matrix, a correlation matrix or a matrix containing the values of a test statistic applied on all pairs of nodes. The aim is to infer the meaningful edges in the graph. This is obtained by a new random graph model and a new powerful graph inference procedure that comes with a control the number of flase discoveries.
The current version of the R package supports the entire analysis for undirected graphs in the Gaussian NSBM and the Gamma NSBM. The Poisson case as well as directed graphs are left for future developments.
To start with, we load the package:
library(noisySBM)
We introduce a new random graph model suited for the problem of graph inference. In this model the observed matrix is a perturbation of an unobserved binary graph, which is the true graph to be inferred. The binary graph is chosen to be a stochastic block model (SBM) for its capacity to model graph heterogeneity. We refer to the new model as the noisy stochastic block model (NSBM).
The definition is stated for an undirected graph without self-loops, but extensions are straightforward. Let $n\geq 2$ be the number of nodes and $\mathcal{A}={(i,j): 1\leq i <j\leq n}$ the set of all possible pairs of nodes. Denote $Q$ the number of latent node blocks. In the NSBM we observe edges $X_{i,j}, (i,j)\in\mathcal{A}$, and denote $X=(X_{i,j}){1\leq i,j\leq n}\in\mathbb R^{n^2}$ the corresponding symmetric, real-valued observation matrix. The observations $X{i,j}, (i,j)\in\mathcal{A}$, are generated by the following random layers:
First, a vector $Z=(Z_1,\dots,Z_n)$ of block memberships of the nodes is generated, such that $Z_i$, $1\leq i\leq n$, are independent with common distribution given by the probabilities $\pi_q=\mathbb P(Z_1=q), q\in{1,\dots,Q},$ for some parameter $\pi=(\pi_q){q\in{1,\dots,Q}}\in (0,1)^Q$ such that $\sum{q=1}^Q \pi_q=1$.
Conditionally on $Z$, the variables $A_{i,j}$, $(i,j)\in\mathcal A$, are independent Bernoulli variables with parameter $w_{Z_i,Z_j}$, for some $w=(w_{q,l}){q,l\in{1,\dots,Q}}\in (0,1)^{Q^2}$. As the graph is undirected, $w$ is symmetric. We denote $A=(A{i,j})_{1\leq i,j\leq n}$ the corresponding symmetric adjacency matrix, which is a standard binary SBM.
Conditionally on $(Z,A)$, the observations $X_{i,j}$, $(i,j)\in\mathcal A$ are independent and $X_{i,j}$ is sampled from the distribution
$$X_{i,j} \sim (1-A_{i,j}) g_{0,\nu_0} + A_{i,j} g_{\nu_{Z_i,Z_j}},$$
where $\nu_0\in \mathcal{V}0$ and $\nu{q,l}\in \mathcal{V}$, $1\leq q,l\leq Q$ are unknown parameters, with $\nu_{l,q}=\nu_{q,l}$ for all $q, l$, and ${g_{0,\nu},\nu\in \mathcal{V}0}$ and ${g{\nu},\nu \in \mathcal{V}}$ are given parametric density families with non-empty open subsets $\mathcal{V}_0\subset\mathbb R^{d_0}$ and $\mathcal{V}\subset\mathbb R^{d_1}$.
The relation between $A$ and the observation $X$ is that missing edges ($A_{i,j}=0$) are replaced with pure random noise modeled by the density $g_{0,\nu_0}$, also called the null density, whereas in place of present edges ($A_{i,j}=1$) there is a signal or effect. The intensity of the signal depends on the block membership of the interacting nodes in the underlying SBM, modeled by the density $g_{\nu_{q,l}}$, also called alternative density for parameter $\nu_{q,l}$.
The unknown global model parameter is $\theta=(\pi,w,\nu_0,\nu)$, where $\pi$ and $w$ come from the SBM, $\nu_0$ denotes the null parameter and $\nu=(\nu_{q,l})_{1\leq q, l\leq Q}\in \mathcal{V}^{Q^2}$ denotes the parameters of the effects. Due to the symmetry of $w$ and $\nu$, we sometimes consider, with some abuse of notation, that the $w$ and $\nu$ belong to $\mathbb R^{Q(Q+1)/2}$ rather than $\mathbb R^{Q^2}$ \tr{is this really used??}.
Our main model is the Gaussian case. It is particularly suitable for modeling situations where the observations $X_{i,j}$ correspond to test statistics or correlations, which are known to be approximately Gaussian. The Gaussian NSBM corresponds to the NSBM with the following choice of the parametric density families: \begin{equation}\label{gaussmodel} {g_{0,\nu_0},\nu_0\in \mathcal{V}0}\subset{\mathcal N(\mu,\sigma^2),\mu\in\mathbb R,\sigma>0},\qquad{g{\nu},\nu \in \mathcal{V}}\subset{\mathcal N(\mu,\sigma^2),\mu\in\mathbb R,\sigma>0}, \end{equation} where different choices for $\mathcal{V}_0$ and $\mathcal{V}$ lead to different variants. For example, often we assume that the mean of the null distribution is zero ($\mu_0=0$).
We may also consider the Gamma NSBM, where the parametric density families are chosen as follows: \begin{equation}\label{gammamodel} {g_{0,\nu},\nu\in \mathcal{V}0}={\text{Exp}(\nu_0),\nu_0>0}, \:\:\:{g{\nu},\nu \in \mathcal{V}}\subset{\Gamma(a,b),a>0,b>0}. \end{equation}
In the Poisson NSBM both parametric families are the family of Poisson distirbutions: \begin{equation}\label{possoinmodel} {g_{0,\nu_0},\nu_0\in \mathcal{V}}= {g_{\nu},\nu \in \mathcal{V}}={\text{Poi}(\nu),\nu>0}. \end{equation}
In the noisySBM
package there is an overall parameters: the graph type (directed or undirected). Let us consider the case:
directed <- FALSE
The global model parameter $\theta=(\pi,w,\nu_0,\nu)$ of the NSBM is a list and its elements are pi
, w
, nu0
and nu
.
theta <- list(pi=NULL, w=NULL, nu0=NULL, nu=NULL)
Denote Q
the number of latent blocks in the SBM. Say
Q <- 2
The parameters pi
and w
are the parameters of the latent binary SBM, where pi
is a Q
-vector indicating the mean proportions of nodes per block. The vector pi
has to be normalized:
theta$pi <- c(2,1)/3
Parameter w
is a vector.
For undirected graphs w
is of length $Q(Q+1)/2$ with the following indexing $w=(w_{1,1},w_{1,2},\dots,w_{1,Q},w_{2,2},\dots,w_{2,Q},w_{3,3},\dots,w_{Q,Q})$.
The parameter $w_{q,l}\in(0,1)$ is a Bernoulli parameter indicating the probability that a node in group $q$ is connected to a node in group $l$.
For directed graphs w
is of length $Q^2$ with the following indexing $w=(w_{1,1},w_{1,2},\dots,w_{1,Q},w_{1,2},\dots,w_{2,Q},w_{1,3},\dots,w_{Q,Q})$.
In the directed case $w_{q,l}\in(0,1)$ is the probability of a directed edge from a node in group $q$ to a node in group $l$.
For our undirected graph with two latent blocks, let us consider
theta$w <- c(.8,.1,.9)
This means that both blocks are communities (as $w_{1,1}=0.8$ and $w_{2,2}=0.9$), as nodes have a large probability to be connected to nodes in the same block, and a low probability to connect to nodes in the other group ($w_{1,2}=0.1$).
For the noisy layer, we have to choose a model, Gaussian, Gamma or Poisson.
In the Gaussian model, the null parameter nu0
is a vector of length 2, giving the mean and the standard deviation of the the Gaussian null distrubtion ($\nu_0=(\nu_{0}^\text{mean},\nu_{0}^\text{sd})$). Mostly, we choose the standard normal distribution:
theta$nu0 <- c(0,1)
Parameter nu
is a matrix with two columns: the first indicates Gaussian means $\nu_{q,l}^\text{mean}$, the second standard deviations $\nu_{q,l}^\text{sd}$ of the alternative distributions for the block pairs $(q,l)$. The number of rows of nu
is identical to the length of w
, that is, in the undirected case nu
has $Q(Q+1)/2$ rows, otherwise $Q^2$. The rows are indexed like w
, that is,
$$
\nu^{\text{undirected}}=\begin{pmatrix}\nu_{1,1}\\vdots\\nu_{1,Q}\\nu_{2,2} \\vdots\ \nu_{Q,Q}
\end{pmatrix}\in\mathbb R^{Q(Q+1)/2\times 2}\qquad
\nu^{\text{directed}}=\begin{pmatrix}\nu_{1,1}\\nu_{1,Q}\vdots\\\nu_{2,1} \\vdots\ \nu_{Q,Q}
\end{pmatrix}\in\mathbb R^{Q^2\times 2}\qquad
\text{with } \nu_{q,l}=(\nu_{q,l}^\text{mean},\nu_{q,l}^\text{sd})\in\mathbb R^2.
$$
For our undirected model with two latent blocks, we choose
theta$nu <- matrix(c(-1,5,-1, 1,1,1), ncol=2) theta$nu
To generate a data set with, say, $n=10$ nodes from the corresponding NSBM we use the function rnsbm()
:
obs <- rnsbm(n = 10, theta, modelFamily = 'Gauss') round(obs$dataMatrix, digits = 2)
We see that the generated matrix is symmetric, as we are in the undirected case.
The function rnsbm()
also provides the latent binary graph, named latentAdj
, which is also symmetric:
obs$latentAdj
as well as the latent blocks, names latentZ
:
obs$latentZ
A more convenient visualisation of the data is given by the function plotGraphs()
:
plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj)
To generate data from a directed graph, the parameters w
and nu
have to be adapted to length $Q^2$, say
theta$w <- c(.8,.1,.2,.9) theta$nu <- matrix(c(-1,5,10,-1, 1,1,1,1), ncol=2) theta$nu
To generate data from the directed NSBM, we set directed = TRUE
:
obs <- rnsbm(n = 10, theta, modelFamily = 'Gauss', directed = TRUE) plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) round(obs$dataMatrix, digits = 2) obs$latentAdj
Now the matrices are no longer symmetric.
As the Gamma distribution also has two parameters, nu0
is a vector of length 2 and
nu
is a 2-column matrix as in the Gaussian case.
Let us take as an example an undirectd graph in a model with $Q=3$ latent blocks with equiprobable blocks:
theta$pi <- rep(1/3, 3)
Suppose that inter-block probabilities are high, while intra-block probabilities are low:
theta$w <- c(.1, .8, .8, .1, .8, .1)
The null distribution is the exponential distribution Exp$(2)$:
theta$nu0 <- c(1,2)
For the alternative distributions we also choose exponentials:
theta$nu <- matrix(c(1, 1, 1, 1, 1, 1, 1, .5, .5, 1, .5, 1), ncol=2) theta$nu
We generate a graph with 10 nodes:
obs <- rnsbm(n = 10, theta, modelFamily = 'Gamma') plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) round(obs$dataMatrix, digits = 2) obs$latentAdj obs$latentZ
The Poisson distribution has a single parameter, so that nu0
is a scalar and nu
becomes a one-column matrix.
As an example consider a directed $Q=1$-block Poisson NSBM with parameters:
theta <- list(pi=1, w=.5, nu0=1, nu=5)
We generate a graph with 10 nodes:
obs <- rnsbm(n = 10, theta, modelFamily = 'Poisson', directed = TRUE) plotGraphs(obs$dataMatrix, binaryTruth=obs$latentAdj) round(obs$dataMatrix) obs$latentAdj obs$latentZ
fitNSBM()
and outputThe function fitNSBM()
fits the parameters of a NSBM to a data set in different models.
The function applies to a symmetric real-valued square matrix dataMatrix
. Let us see a basic example:
set.seed(1) theta1 <- list(pi=c(.5,.5), w=c(.8,.1,.2), nu0=c(0,1), nu=matrix(c(-1,5,10, 1,1,1), ncol=2)) obs1 <- rnsbm(n=30, theta1)
res_gauss <- fitNSBM(obs1$dataMatrix, nbCores=2)
fitNSBM()
applies the VEM-algorithm to estimate model parameters and compute a node clustering. By default, the algorithm explores different SBM models starting from a model with a single latent block (Q=1
) up to 4 blocks or more if appropriate. The output of fitNSBM()
is a list with estimation results for all visited models (for Qmin=1
to Qmax=4
(or more)). Here, for Q=2
for instance, we obtain the estimates of the model parameters $\theta$ by
res_gauss[[2]]$theta
A node clustering is given by
res_gauss[[2]]$clustering
The list element sbmParam
contains further results concerning the latent binary SBM, namely
the number of latent blocks:
res_gauss[[2]]$sbmParam$Q
the variational parameters, i.e. the probabilities that a node belongs to one of the latent blocks:
res_gauss[[2]]$sbmParam$clusterProba
and the conditional probabilities of an edge between a pair of nodes given the block memberships of the nodes:
res_gauss[[2]]$sbmParam$edgeProba[,1:5]
These probabilities correspond to $1-l\text{-values}$, which are used in the test procedure for graph inference.
The dimension of sbmParam$edgeProba
is $Q(Q+1)/2$ times $n(n-1)/2$ (in the undirected case):
dim(res_gauss[[2]]$sbmParam$edgeProba)
that is, the rows correspond to group pairs $(q,l)$ and the columns to node pairs $(i,j)$.
Moreover, sbmParam
contains the values of the integrated classification criterion:
res_gauss[[2]]$sbmParam$ICL
Finally, the list element convergence
provides details on the convergence of the algorithm:
res_gauss[[2]]$convergence
For the edges different distributions can be considered. The noisySBM
package supports the Gaussian NSBM and the Gamma NSBM. In the Gaussian case a number of different restrictions on the parameter space are implemented.
By default, fitNSBM()
considers a Gaussian NSBM with
model='Gauss0'
, where the null distribution is a centered normal distribution with $\nu_{0}^{\text{mean}}=0$ and standard deviation $\nu_{0}^{\text{sd}}\in\mathbb R_+$ to be estimated,
and all parameters $\nu_{q,l}=(\nu_{q,l}^\text{mean},\nu_{q,l}^\text{sd})$ for $1\leq q\leq l\leq Q$ of the alternative Gaussian distributions are estimated without any constraint. Other choices for model
in the Gaussian case are the following:
Gauss
: all Gaussian parameters of the null and the alternative distributions are unknown ; this is the Gaussian model with the maximum number of unknown parameters ($\nu_{0}^{\text{mean}}\in\mathbb R$, $\nu_{q,l}^{\text{mean}}\in\mathbb R$, $\nu_{0}^{\text{sd}}\in\mathbb R_+$, $\nu_{q,l}^{\text{sd}}\in\mathbb R_+$ for all $q\leq l$).
Gauss01
: compared to Gauss
, the null distribution is set to $\mathcal N(0,1)$ ($\nu_0^{\text{mean}}=0$, $\nu_0^{\text{sd}}=1$))
GaussEqVar
: compared to Gauss
, all Gaussian variances (of both the null and the alternative) are supposed to be equal, but unknown ($\nu_0^{\text{sd}}=\nu_{q,l}^{\text{sd}}\in\mathbb R_+$ for all $q\leq l$).
Gauss0EqVar
: compared to GaussEqVar
, the mean of the null distribution is set to 0 ($\nu_0^{\text{mean}}=0$).
Gauss0Var1
: compared to Gauss
, all Gaussian variances are set to 1 and the null distribution is set to $\mathcal N(0,1)$ ($\nu_0^{\text{mean}}=0$, $\nu_0^{\text{sd}}=\nu_{q,l}^{\text{sd}}=1$)
Gauss2distr
: the alternative distribution is a single Gaussian distribution, i.e. the block memberships of the nodes do not influence on the alternative distribution (all $\nu_{q,l}$ are identical).
GaussAffil
: compared to Gauss
, for the alternative distribution, there's a distribution for inter-group and another for intra-group interactions (there are parameters $\nu_{\text{in}}$ and $\nu_{\text{out}}$ such that $\nu_{q,q}=\nu_{\text{in}}$ for every $q$, and $\nu_{q,l}=\nu_{\text{out}}$ for every $q< l$).
The estimation procedure fitNSBM()
supports two exponential or Gamma models:
Exp
: the null and the alternatives are all exponential distributions with unknown parameters (i.e. they are Gamma distributions with shape parameter equal to one and unknown scale parameter)
ExpGamma
: the null distribution is exponential with unknown parameter, the alterantive distribution are Gamma distributions where both parameters are unknown.
Let us consider the following example:
theta2 <- list(pi=rep(1/2, 2), w=c(.2, .8, .2), nu0=c(1,5), nu=matrix(c(1, 1, 1, 1, 1/3, 1), ncol=2)) set.seed(2) obs2 <- rnsbm(n=60, theta2, modelFamily='Gamma')
Let us fit both Gamma models to the data :
res_exp <- fitNSBM(obs2$dataMatrix, model='Exp', nbCores = 2) set.seed(3) res_gamma <- fitNSBM(obs2$dataMatrix, model='ExpGamma', nbCores = 2)
The parameter estimates obtained for $Q=2$ are close to the true parameter values in both cases:
Q <- 2 res_exp[[Q]]$theta res_gamma[[Q]]$theta
When the algorithm explores several model sizes $Q$, then model selection can be performed via the ICL criterion. The function getBestQ()
gives the best model size $\hat Q$ that maximizes the ICL:
getBestQ(res_gauss)
Typically, the ICL curve is a concave function, as we can see by applying function plotICL
:
plotICL(res_gauss)
Likewise, for our exponential data, we obtain:
getBestQ(res_exp) getBestQ(res_gamma)
and for the ICL curves:
dfICLexp <- data.frame(ICL = sapply(res_exp, function(el) el$sbmParam$ICL), Q = sapply(res_exp, function(el) el$sbmParam$Q), model='Exp') dfICLgam <- data.frame(ICL = sapply(res_gamma, function(el) el$sbmParam$ICL), Q = sapply(res_gamma, function(el) el$sbmParam$Q), model='ExpGamma') dfICL <- rbind(dfICLexp, dfICLgam) ggplot(dfICL, aes(x=Q, y=ICL, group=model, colour=model) ) + geom_line() + xlim(1,4)
The argument sbmSize
of fitNSBM()
is used to precise the models to be explored. When sbmSize$Qmin
and sbmSize$Qmax
are specified, then only this range of models is explored. For a single model size, do e.g.
res4 <- fitNSBM(obs1$dataMatrix, sbmSize=list(Qmin=2, Qmax=2), nbCores=2)
The argument initParam
is a list and you may change default values to increase or descrease the number of initial points of the algorithm.
To directly save the output, a filename
has to be precised in the call of fitNSBM()
.
By default, computations are parallelized using the maximum number of available cores. However, the argument nbCores
can be used to customize this choice.
Node clusterings can be compared vie the adjusted Rand index. The package provides a function ARI()
to evaluate this indicator. On simulated data we can compare to the true clustering:
ARI(res_gauss[[3]]$clustering, obs1$latentZ)
To compare two node clusterings with different number of clusters:
ARI(res_gauss[[2]]$clustering, res_gauss[[3]]$clustering)
For the example for the Gamma NSBM, we find that the clusterings are perfect in both models (Exp
and ExpGamma
):
ARI(res_exp[[2]]$clustering, obs2$latentZ) ARI(res_gamma[[2]]$clustering, obs2$latentZ)
Graph inference is done by the function graphInference()
. The function provides the estimated binary graph at level alpha
using the approach of q-values in a NSBM. The arguments are the observed dataMatrix
, a clustering
, a parameter estimate theta
, a significance level alpha
and the model modelFamily
. It returns a list containing the inferred graph A
:
resGraph <- graphInference(obs1$dataMatrix, res_gauss[[2]]$clustering, res_gauss[[2]]$theta, alpha = .05, modelFamily='Gauss') resGraph$A[1:5, 1:5]
The ouptut also contains the q-values:
resGraph$qval[1:10]
which is a vector of length $n(n-1)/2$:
length(resGraph$qval)
To visualize the result, we can use again the function graphPlots()
that also returns the FDR and TDR:
plotGraphs(obs1$dataMatrix, resGraph$A, obs1$latentAdj)
The graph inference procedure is also implemented in the Gamma model, but only in case where the shape parameter of all Gamma distributions under the null and under the alternative is the same, so for instance in the model where all distributions are exponential distributions.
resGraph2 <- graphInference(obs2$dataMatrix, res_exp[[2]]$clustering, res_exp[[2]]$theta, alpha = .05, modelFamily='Gamma') resGraph2$A[1:5, 1:5]
The ouptut also contains the q-values:
resGraph2$qval[1:10]
which is a vector of length $n(n-1)/2$:
length(resGraph2$qval)
To visualize the result, we can use again the function graphPlots()
that also returns the FDR and TDR:
plotGraphs(obs2$dataMatrix, resGraph2$A, obs2$latentAdj)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.