require(graphstats)
require(ggplot2)
require(igraph)
require(reshape2)

Here, we use the dimselect function to select the optimal number of dimensions for dimensionality reduction problems. The gist of our method consists of explicitly constructing a model for the eigenvalues and estimating the position of the “gap” or the “elbow”.

Scree plot

Plot the eigenvalues in descending order (often called a scree plot) and look for a “big gap” or an “elbow” in such a graph.

# Assumes distribution of eigenvalues follow uniform distribution
d1=runif(10,0,10) # First 10 eigenvalues
d2=runif(10,15,25) # Last 10 eigenvalues
d = c(d1,d2)
d = sort(d,decreasing = TRUE)
df <- data.frame(1:20, d)
colnames(df) <- c("Rank", "Eigenvalue")
g <- ggplot(df, aes(Rank, Eigenvalue))
g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + 
      labs(title="Screeplot (Distribution of Eigenvalues)") +
      scale_x_continuous("Rank", breaks = seq(1, 20, 1)) + 
      theme(axis.text.x = element_text(angle=65, vjust=0.6))

Testing/Simulated Examples

Here, we run a simulation on a stochastic blockmodel (SBM). We will generate a two-block SBM, and compute the singular values of the adjacency matrix. Here we expect the number of dimensions selected by dimselect to be the number of blocks.

set.seed(123)
# SBM Params
n <- 100
num_class1 <- n/2
num_class2 <- n - num_class1
assignments <- c(rep(1, num_class1), rep(2, num_class2))

B_sbm <- matrix(c(0.8, 0.2,
                  0.2, 0.8), nrow = 2)

# 2-block simulation.
g_sbm <- igraph::sample_sbm(n, pref.matrix=B_sbm, block.sizes=c(num_class1, num_class2))

## Embed both with ASE; get singular values from adjacency matrix;
## select dimenstion with dimselect.
A_sbm <- igraph::as_adj(g_sbm)
A = matrix(A_sbm, nrow = 100)
gs.plot.plot_matrix(A, legend.name = "connection", xlabel = "vertex", 
                    ylabel = "vertex", title = "Graph Simulated from SBM",ffactor = TRUE)

Run dimselect on the vector of singular values we get from adjacency matrix:

sigma_sbm <- svd(A, n)$d
dim_sbm <- dimselect(sigma_sbm)[1]
sprintf("The number of dimension selected by dimselect.R: %d", dim_sbm)

Let's verify if there is a huge drop off after the second singular value:

df <- data.frame(1:n, sigma_sbm)
colnames(df) <- c("Rank", "Eigenvalue")
g <- ggplot(df[1:20,], aes(Rank, Eigenvalue))
g + geom_bar(stat="identity", width = 0.5, fill="tomato2") + 
      labs(title="Screeplot (Distribution of Singular Values)") +
      scale_x_continuous("Rank", breaks = seq(1, 20, 1)) + 
      theme(axis.text.x = element_text(angle=65, vjust=0.6))


neurodata/graphstats documentation built on May 14, 2019, 5:19 p.m.