knitr::opts_chunk$set(echo = TRUE, cache=TRUE)
In this tutorial, we discuss our estimator of a bernoulli distribution per edge for a given graph, and the strategies to identify an incoherent subgraph from the data. Using our estimators, we develop a Bayes Plugin Classifier.
Identify the sufficient parameters to characterize the distribution of connected and disconnected edges. Identify the edges that are most likely to show a class-conditional difference, the subgraph. Use the subgraph and the related estimators to produce a bayes-plugin classifier that allows us to accurately predict the class of items.
Assume that the edge weights can be characterized by a bernoulli RV; that is:
\begin{align} \mathbb{A}{uv} \sim Bern(p{uv}) \end{align}
where $p_{uv|y}$ is the probability of edge $e_{uv}$ being connected in class $y$.
Then our likelihood function is simply:
\begin{align} L_{\mathbb{A}, Y}(A_i, y; \theta) &= \prod_{(u, v) \in \mathcal{S}} Bern(w_i(e_{uv}); p_{uv | y}) \ &= \prod_{(u, v) \in \mathcal{S}} p_{uv | y}^{w_i(e_{uv})}(1 - p_{uv | y})^{1 - w_i(e_{uv})} \end{align}
where $\mathcal{S}$ is our subgraph.
Using MLE, it is easy to see that:
\begin{align} \hat{p}{uv | y} = \frac{1}{n} \sum{i | y_i = y} w_i(e_{uv}) \end{align}
where $w_i(e_{uv}) \in {0, 1}$ is the binary edge weight of edge $e_{uv}$.
Note that if $w_i(e_{uv}) = 0 \;\forall i$, then $p_{uv} = 0$, which is undesirable since we only have a finite sample (and successive samples where $w_i(e_{uv})) \neq 0$ would lead to poor model performance), and vice versa for $p_{uv} = 1$ when $w_i(e_{uv}) = 0 \;\forall i$. Then consider the smoothed estimator:
\begin{align} \hat{p}{uv | y} = \begin{cases} n_n & max{i | y_i = y}(w_i(e_{uv})) = 0 \ 1-n_n & max_{i | y_i = y}(w_i(e_{uv})) = 1 \ \hat{p}_{uv | y} & else \end{cases} \end{align}
Here, we take the maximum likelihood estimators for the prior probabilities, which assuming our data is sampled iid from our population, should suffice:
\begin{align} \hat{\pi}_y = \frac{n_y}{n} \end{align}
where $n_y = \sum_{i =1}^n \mathbb{I}{y_i = y}$.
To estimate the incoherent subgraph, we consider the following algorithm:
incoherent_subgraph(G, e):
We can use our Bernoulli probabilities to explicitly define a Bayes-Plugin classifier:
\begin{align} h_*(G; \mathcal{T}) = \textrm{argmax}{y \in Y} \prod{(u, v) \in \hat{\mathcal{S}}} \hat{p}{uv | y}^{a{uv}}(1 - \hat{p}{uv | y})^{1 - a{uv}}\hat{\pi}_y \end{align}
where $a_{uv}$ is the $(u, v)$ edge of graph $G$, and $h_*(\cdot; \mathcal{T})$ is the hypothesis of the model constructed given training set $\mathcal{T}$.
We will evaluate our model performance with the cross-validated error:
\begin{align} \hat{L}{\hat{h}(\cdot, \mathcal{T}_n)} &= \frac{1}{C} \sum{i=1}^C \frac{1}{\left| \mathcal{T}n \setminus \mathcal{T}_C \right|} \sum{G \notin \mathcal{T}_C} \mathbb{I}\left{\hat{h} \left(G; \mathcal{T}_C \right)\right} \end{align}
where $\mathcal{T}_C$ is the set of graphs that we trained our model on.
Additionally, we can estimate a $p$ value using Monte Carlo permutations. We perform this by randomly permuting our labels $n$ times, and then using the permuted labels to construct our estimators and our bayes-plugin classifier. We then feed in our testing data and similarly compute a loss for each of our $n$ permutations. We report our $p$ value as the fraction of Monte Carlo permutations that perform better than our classifier given the correctly labelled data.
During our simulations, since we are constructing simulated data, we will know ahead of time whether an edge is or is not part of the subgraph. To quantify this performance, we consider the edge-misclassification rate:
\begin{align} R_n^x = \frac{1}{\left|\mathcal{S}\right|} \sum_{(u, v) \in \mathcal{S}}\mathbb{I}\left{(u, v) \notin \hat{\mathcal{S}}\right} \end{align}
or the fraction of edges that are part of the true subgraph $\mathcal{S}$ but not the estimated subgraph $\mathcal{\hat{S}}$.
In our basic simulation, we will use 2 classes with 4x4 probability matrices. The probability matrix for each class will be identical, except for 4 randomly selected edges, in which the probability for class 1 will be .25, and the probability for class 2 will be .75.
require(subgraphing) require(ggplot2) require(reshape2) require(fmriutils) require(Rmisc) lseq <- function(from, to, n) { return(round(exp(seq(log(from), log(to), length.out = n)))) } dim = 4 # dimensions p <- array(runif(dim^2), dim=c(dim, dim)) # p is initially random edges <- sample(1:dim^2, dim, replace = FALSE) # select 4 random edges p1 <- p; p2 <- p # initialize p1 and p2 to the same array for (edge in edges) { p1[edge] <- .25 p2[edge] <- .75 } p <- array(NaN, dim=c(dim, dim, 2)) p[,,1] <- p1 p[,,2] <- p2 # visualize the two probability matrices plot_p1 <- fmriu.plot.plot_square(p[,,1], title="True P, class 1", xlabel="vertex", ylabel="vertex", legend="p") plot_p2 <- fmriu.plot.plot_square(p[,,2], title="True P, class 2", xlabel="vertex", ylabel="vertex", legend="p") plot_diff <- fmriu.plot.plot_square(abs(p[,,1] - p[,,2]), title="P, class 1 - P, class 2", xlabel="vertex", ylabel="vertex", legend="p1 - p2") sg <- array(0, dim=c(dim, dim)) sg[edges] <- 1 plot_sg <- fmriu.plot.plot_square(sg, title="True Subgraph", xlabel="vertex", ylabel="vertex", legend="edge") multiplot(plot_p1, plot_p2, plot_diff, plot_sg, cols = 2)
As we can see, it is quite immediately clear which edges are part of the subgraph, so our classifier should have no issues. We generate 10 simulated examples per class, and examine the results of of our estimators:
ns = 100 samp <- array(NaN, dim=c(dim, dim, ns*2)) samp[,,1:ns] <- sg.bern.sample_graph(p[,,1], s=ns) samp[,,(ns+1):(2*ns)] <- sg.bern.sample_graph(p[,,2], s=ns) Y <-array(NaN, dim=c(ns*2)) Y[1:ns] <- 0 Y[(ns+1):(2*ns)] <- 1 # approximate estimators and contingency table train <- sg.bern.subgraph_train(samp, Y, 4, coherent=FALSE, tstat = "fisher") # visualize the two probability matrices plot_p1 <- fmriu.plot.plot_square(train$p[,,1], title="Est P, class 1", xlabel="vertex", ylabel="vertex", legend="p") plot_p2 <- fmriu.plot.plot_square(train$p[,,2], title="Est P, class 2", xlabel="vertex", ylabel="vertex", legend="p") plot_diff <- fmriu.plot.plot_square(abs(train$p[,,1] - train$p[,,2]), title="P, class 1 - P, class 2", xlabel="vertex", ylabel="vertex", legend="p1 - p2") estsg <- array(0, dim=c(dim, dim)) estsg[train$edges] <- 1 plot_sg <- fmriu.plot.plot_square(estsg, title="Estimated Subgraph", xlabel="vertex", ylabel="vertex", legend="edge") multiplot(plot_p1, plot_p2, plot_diff, plot_sg, cols = 2)
Given 100 examples per class, we can see that our estimated Ps are very close to our true Ps (we investigate this more thoroughly in the vignette bern_graph_estimator
), and our subgraph matches the truth perfectly.
In this simulation, we will structure our trials very similarly to the previous, however our predictions will not be quite as simple this time. Our true P between class 1 and class 2 will no longer be identical for non-signal edges, as we will add gaussian noise with a cutoff at 0 and 1 (since probabilities cannot exceed 1 nor be lower than 0). We will report cross-validated error and misclassification rate as a function of the number of training examples used, and will also investigate the impact of having an estimate of 2, 4, and 8 signal edges.
ns <- lseq(10, 500, 8) nes <- c(3, 6, 9) dim <- 4 p <- array(runif(dim^2), dim=c(dim, dim)) # p is initially random edges <- sample(1:dim^2, 6, replace = FALSE) # select 6 random edges p1 <- p; p2 <- p # initialize p1 and p2 to the same array with some noise for (edge in edges) { p1[edge] <- .3 p2[edge] <- .7 } p1 <- p1 + rnorm(dim^2, mean=0, sd=.1); p2 <- p2 + rnorm(dim^2, mean=0, sd=.1) p1[p1 > 1] <- 1 - 1/(10*ns); p1[p1 < 0] <- 1/(10*ns); p2[p2 > 1] <- 1 - 1/(10*ns); p2[p2 < 0] <- 1/(10*ns) p <- array(NaN, dim=c(dim, dim, 2)) p[,,1] <- p1 p[,,2] <- p2 # visualize the two probability matrices plot_p1 <- fmriu.plot.plot_square(p[,,1], title="True P, class 1", xlabel="vertex", ylabel="vertex", legend="p") plot_p2 <- fmriu.plot.plot_square(p[,,2], title="True P, class 2", xlabel="vertex", ylabel="vertex", legend="p") plot_diff <- fmriu.plot.plot_square(abs(p[,,1] - p[,,2]), title="P, class 1 - P, class 2", xlabel="vertex", ylabel="vertex", legend="p1 - p2") sg <- array(0, dim=c(dim, dim)) sg[edges] <- 1 plot_sg <- fmriu.plot.plot_square(sg, title="True Subgraph", xlabel="vertex", ylabel="vertex", legend="edge") multiplot(plot_p1, plot_p2, plot_diff, plot_sg, cols = 2)
Again, our task appears to be fairly simple, but it should prove far more difficult than our previous task due to the fact that there now exists a class-conditional difference in the non-signal edges as well, although slight.
results <- data.frame(n=c(), nedges=c(), error=c(), miss_edge=c()) for (sim in 1:10) { p <- array(runif(dim^2), dim=c(dim, dim)) # p is initially random edges <- sample(1:dim^2, 6, replace = FALSE) # select 6 random edges p1 <- p; p2 <- p # initialize p1 and p2 to the same array with some noise for (edge in edges) { p1[edge] <- .3 p2[edge] <- .7 } p1 <- p1 + rnorm(dim^2, mean=0, sd=.1); p2 <- p2 + rnorm(dim^2, mean=0, sd=.1) p1[p1 > 1] <- 1 - 1/(10*ns); p1[p1 < 0] <- 1/(10*ns); p2[p2 > 1] <- 1 - 1/(10*ns); p2[p2 < 0] <- 1/(10*ns) p <- array(NaN, dim=c(dim, dim, 2)) p[,,1] <- p1 p[,,2] <- p2 for (n in ns) { samp <- array(NaN, dim=c(dim, dim, n*2)) samp[,,1:n] <- sg.bern.sample_graph(p[,,1], s=n) samp[,,(n+1):(2*n)] <- sg.bern.sample_graph(p[,,2], s=n) Y <-array(NaN, dim=c(n*2)) Y[1:n] <- 0 Y[(n+1):(2*n)] <- 1 for (ne in nes) { class_res <- sg.bern.xval_classifier(samp=samp, Y=Y, nedge=ne, tstat="fisher", coherent = FALSE, xval="loo") miss_edge <- 1 - 1/length(edges)*sum(edges %in% class_res$edges) results <- rbind(results, data.frame(n=n, nedges=ne, error=class_res$error, miss_edge=miss_edge)) } } }
and we plot the missed edge rate and the leave-one-out cross validated error:
results$nedges <- factor(results$nedges) me_plot <- ggplot(results, aes(x=n, y=miss_edge, color=nedges, group=nedges)) + geom_point() + stat_summary(fun.y = mean, geom = "line", size=2) + stat_summary(fun.data = mean_se, geom = "errorbar", size=2) + scale_y_continuous(limits = c(0, 1)) + ggtitle("Proportion of Edges Missed by Incoherent Subgraph Estimator") + xlab("Number of Training Examples per Class") + ylab("Missed-Edge Rate") + theme(text=element_text(size=12)) xv_plot <- ggplot(results, aes(x=n, y=error, color=nedges, group=nedges)) + geom_point() + stat_summary(fun.y = mean, geom = "line", size=2) + stat_summary(fun.data = mean_se, geom = "errorbar", size=2) + scale_y_continuous(limits = c(0, 1)) + ggtitle("Error of Model Estimated with Leave-One-Out Cross Validation") + xlab("Number of Training Examples per Class") + ylab("Cross-Validated Error") + theme(text=element_text(size=12)) multiplot(me_plot, xv_plot, cols=1)
Intuitively, we can see that more training examples asymptotically decreases our missed-edge rate and cross-validated error, and choosing more subgraph edges gives us better ability to capture the variation that may be in the data due to signal or consistent class-to-class noise (due to us adding the normally distributed 0-mean gaussian noise to our probability matrices). We see that using 3 edges versus 6 edges gives us a significant difference in our performance metrics, which makes sense as with 4 edges we are not capturing all of our signal edges completely. When we jump from 6 edges to 9 edges in our subgraph, we see only modest improvements, which makes sense as our entire subgraph is only 6 edges total, but 9 edges can still capture some of the class-conditional noise that is also present in our probability matrices.
In this example, we will explore a situation that an incoherent subgraph classifier will not be able to appropriately handle. Consider the case where we have a far more subtle class-conditional variation, and the bulk of the class-conditional variation captured by only the edges incident a smaller number of our vertices known as the signal vertices. This situation, known as a coherent subgraph, allows us to focus on the vertices that matter most, and consider from this subset the edges that matter most (see the vignette for bern_coherent_subgraph_estimation
for more details). Due to the fact that the incoherent estimator will consider all edges, our signal edges may potentially be due to noise. In this example, we will consider 9 vertex graphs, with 23 signal edges distributed about the possible edges incident only vertices 1 and 3. For each class, the probability matrix has 0-mean gaussian noise with $\sigma=.1$ to these signal edges to add class-conditional variation. Additionally, with probability $p=.1$ we will rewire (that is, change the connection from connected to unconnected, or vice versa) each edge for each graph. This means that there will be other edges at random that also may look like signal vertices, but these edges (since they are rewired at random) will be purely noise and should not improve our predictive power on successive examples. We would thus expect that our missed-edge rate and leave-one-out cross-validated error will be artificially inflated unless we can account for the fact that most of our signal edges are incident only to one vertex.
xdim <- 9 ydim <- 9 c <- 2 # number of classes p <- array(NaN, dim=c(xdim, ydim, c)) ns = 100 # define the signal edges for our signal vertices signal_edges_1 <- c(2,3,4,5, 6, 7,9) # signal edges for vertex 3 signal_edges_3 <- c(3, 4,5, 6, 7, 8, 9) # signal edges for vertex 1 p1 <- array(runif(xdim*ydim), dim=c(xdim, ydim)) p1[upper.tri(p1, diag=FALSE)] <- 0 p2 <- p1 # we will change our signal edges in vertices 1 and 3 to have a class-conditional difference # add signal to vertex 1 of random magnitude p2[1,signal_edges_1] <- p2[1, signal_edges_1] + rnorm(length(signal_edges_1), mean=0, sd=.15) # add signal to vertex 3 of random magnitude p2[3, signal_edges_3] <- p2[3, signal_edges_3] + rnorm(length(signal_edges_3), mean=0, sd=.15) p1 <- p1 + t(p1) - diag(diag(p1)) p2 <- p2 + t(p2) - diag(diag(p2)) # fix the limits to be valid probabilities and smooth p1[p1 > 1] <- 1 - 1/(10*ns); p1[p1 < 0] <- 1/(10*ns); p2[p2 > 1] <- 1 - 1/(10*ns); p2[p2 < 0] <- 1/(10*ns) p[,,1] <- p1 p[,,2] <- p2 ns = 100 samp <- array(NaN, dim=c(xdim, xdim, ns*2)) samp[,,1:ns] <- sg.bern.sample_graph(p[,,1], s=ns, rewire=.1) samp[,,(ns+1):(2*ns)] <- sg.bern.sample_graph(p[,,2], s=ns, rewire=.1) Y <-array(NaN, dim=c(ns*2)) Y[1:ns] <- 0 Y[(ns+1):(2*ns)] <- 1 plot_p1 <- fmriu.plot.plot_square(p[,,1], title="True P, class 1", xlabel="vertex", ylabel="vertex", legend="p") plot_p2 <- fmriu.plot.plot_square(p[,,2], title="True P, class 2", xlabel="vertex", ylabel="vertex", legend="p") sg <- array(0, dim=c(xdim, xdim)) plot_diff <- fmriu.plot.plot_square(abs(p[,,1] - p[,,2]), title="P, class 1 - P, class 2", xlabel="vertex", ylabel="vertex", legend="p1 - p2") sg[1, signal_edges_1] <- 1 sg[3, signal_edges_3] <- 1 sg <- sg + t(sg) - diag(diag(sg)) plot_sg <- fmriu.plot.plot_square(sg, title="True Subgraph", xlabel="vertex", ylabel="vertex", legend="edge") multiplot(plot_p1, plot_p2, plot_diff, plot_sg, cols = 2)
As we can see, the true subgraph is concentrated about vertices 1 and 3, as these vertices contain all of the class-conditional difference observed. As we show below, however, using an incoherent estimator does not necessarily return such a favorable output.
# approximate estimators and contingency table, given the prior that there are 23 signal vertices train <- sg.bern.subgraph_train(samp, Y, 26, coherent=2, tstat = "fisher") # visualize the two probability matrices plot_p1 <- fmriu.plot.plot_square(train$p[,,1], title="Est P, class 1", xlabel="vertex", ylabel="vertex", legend="p") plot_p2 <- fmriu.plot.plot_square(train$p[,,2], title="Est P, class 2", xlabel="vertex", ylabel="vertex", legend="p") plot_diff <- fmriu.plot.plot_square(abs(train$p[,,1] - train$p[,,2]), title="P, class 1 - P, class 2", xlabel="vertex", ylabel="vertex", legend="p1 - p2") estsg <- array(0, dim=c(xdim, xdim)) estsg[train$edges] <- 1 plot_sg <- fmriu.plot.plot_square(estsg, title="Estimated Subgraph", xlabel="vertex", ylabel="vertex", legend="edge") multiplot(plot_p1, plot_p2, plot_diff, plot_sg, cols = 2)
As we can see, we get some of the edges right, but make many errors, and have not really captured the idea that the signal edges are entirely present in the first and third vertices. Let's take a look at how our performance is in a full-simulation setting:
ns <- lseq(10, 500, 8) nes <- c(20, 26, 30) results <- data.frame(n=c(), nedges=c(), error=c(), miss_edge=c()) for (sim in 1:10) { print(sim) # define the signal edges for our signal vertices signal_edges_1 <- c(2, 3, 4, 5, 7, 9) # signal edges for vertex 3 signal_edges_3 <- c(4,5, 6, 8, 9) # signal edges for vertex 1 p1 <- array(runif(xdim*ydim), dim=c(xdim, ydim)) p1[upper.tri(p1, diag=FALSE)] <- 0 p2 <- p1 # we will change our signal edges in vertices 1 and 3 to have a class-conditional difference # add signal to vertex 1 of random magnitude p2[1,signal_edges_1] <- p2[1, signal_edges_1] + rnorm(length(signal_edges_1), mean=0, sd=.15) # add signal to vertex 3 of random magnitude p2[3, signal_edges_3] <- p2[3, signal_edges_3] + rnorm(length(signal_edges_3), mean=0, sd=.15) p1 <- p1 + t(p1) - diag(diag(p1)) p2 <- p2 + t(p2) - diag(diag(p2)) # fix the limits to be valid probabilities and smooth p1[p1 > 1] <- 1 - 1/(10*ns); p1[p1 < 0] <- 1/(10*ns); p2[p2 > 1] <- 1 - 1/(10*ns); p2[p2 < 0] <- 1/(10*ns) p[,,1] <- p1 p[,,2] <- p2 for (n in ns) { samp <- array(NaN, dim=c(xdim, ydim, n*2)) samp[,,1:n] <- sg.bern.sample_graph(p[,,1], s=n, rewire=.1) samp[,,(n+1):(2*n)] <- sg.bern.sample_graph(p[,,2], s=n, rewire=.1) Y <-array(NaN, dim=c(n*2)) Y[1:n] <- 0 Y[(n+1):(2*n)] <- 1 for (ne in nes) { class_res <- sg.bern.xval_classifier(samp=samp, Y=Y, nedge=ne, tstat="fisher", coherent = FALSE, xval="loo") true_edges <- c(2, 3, 4, 5, 6, 7, 9, 10, 19, 28, 37, 46, 55, 73, 21, 22, 23, 24, 25, 26, 27, 30, 39, 48, 57, 66, 75) miss_edge <- 1 - 1/length(true_edges)*sum(true_edges %in% class_res$edges) results <- rbind(results, data.frame(n=n, nedges=ne, error=class_res$error, miss_edge=miss_edge)) } } }
results$nedges <- factor(results$nedges) me_plot <- ggplot(results, aes(x=n, y=miss_edge, color=nedges, group=nedges)) + geom_point() + stat_summary(fun.y = mean, geom = "line", size=2) + stat_summary(fun.data = mean_se, geom = "errorbar", size=2) + scale_y_continuous(limits = c(0, 1)) + ggtitle("Proportion of Edges Missed by Incoherent Subgraph Estimator") + xlab("Number of Training Examples per Class") + ylab("Missed-Edge Rate") + theme(text=element_text(size=12)) xv_plot <- ggplot(results, aes(x=n, y=error, color=nedges, group=nedges)) + geom_point() + stat_summary(fun.y = mean, geom = "line", size=2) + stat_summary(fun.data = mean_se, geom = "errorbar", size=2) + scale_y_continuous(limits = c(0, 1)) + ggtitle("Error of Model Estimated with Leave-One-Out Cross Validation") + xlab("Number of Training Examples per Class") + ylab("Cross-Validated Error") + theme(text=element_text(size=12)) multiplot(me_plot, xv_plot, cols=1)
As we can see, we do a poor job of estimating signal edges, and are not able to develop a very good classifier.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.