In neurodata/graphstats: Graph Statistics

require(graphstats)
require(ggplot2)
require(igraph)
require(reshape2)

Here, we use the dimselect function to select the optimal number of dimensions for dimensionality reduction problems. The gist of our method consists of explicitly constructing a model for the eigenvalues and estimating the position of the “gap” or the “elbow”.

Scree plot

Plot the eigenvalues in descending order (often called a scree plot) and look for a “big gap” or an “elbow” in such a graph.

# Assumes distribution of eigenvalues follow uniform distribution
d1=runif(10,0,10) # First 10 eigenvalues
d2=runif(10,15,25) # Last 10 eigenvalues
d = c(d1,d2)
d = sort(d,decreasing = TRUE)
df <- data.frame(1:20, d)
colnames(df) <- c("Rank", "Eigenvalue")
g <- ggplot(df, aes(Rank, Eigenvalue))
g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + 
      labs(title="Screeplot (Distribution of Eigenvalues)") +
      scale_x_continuous("Rank", breaks = seq(1, 20, 1)) + 
      theme(axis.text.x = element_text(angle=65, vjust=0.6))

Testing/Simulated Examples

Here, we run a simulation on a stochastic blockmodel (SBM). We will generate a two-block SBM, and compute the singular values of the adjacency matrix. Here we expect the number of dimensions selected by dimselect to be the number of blocks.

set.seed(123)
# SBM Params
n <- 100
num_class1 <- n/2
num_class2 <- n - num_class1
assignments <- c(rep(1, num_class1), rep(2, num_class2))

B_sbm <- matrix(c(0.8, 0.2,
                  0.2, 0.8), nrow = 2)

# 2-block simulation.
g_sbm <- igraph::sample_sbm(n, pref.matrix=B_sbm, block.sizes=c(num_class1, num_class2))

## Embed both with ASE; get singular values from adjacency matrix;
## select dimenstion with dimselect.
A_sbm <- igraph::as_adj(g_sbm)
A = matrix(A_sbm, nrow = 100)
gs.plot.plot_matrix(A, legend.name = "connection", xlabel = "vertex", 
                    ylabel = "vertex", title = "Graph Simulated from SBM",ffactor = TRUE)

Run dimselect on the vector of singular values we get from adjacency matrix:

sigma_sbm <- svd(A, n)$d
dim_sbm <- dimselect(sigma_sbm)[1]
sprintf("The number of dimension selected by dimselect.R: %d", dim_sbm)

Let's verify if there is a huge drop off after the second singular value:

df <- data.frame(1:n, sigma_sbm)
colnames(df) <- c("Rank", "Eigenvalue")
g <- ggplot(df[1:20,], aes(Rank, Eigenvalue))
g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + 
      labs(title="Screeplot (Distribution of Singular Values)") +
      scale_x_continuous("Rank", breaks = seq(1, 20, 1)) + 
      theme(axis.text.x = element_text(angle=65, vjust=0.6))