library(devtools)
devtools::load_all()

Overview: Conditional Probability Queries

Here is an example of a simple joint distribution on two binary variables, and some queries for different evidence.

P_1 <- normalize_factor(create_factor(c(1,1,2,2), list(c(1,2), c(2,2))))
normalize_factor(factor_reduction(P_1, 2, 1))
normalize_factor(factor_reduction(P_1, 2, 2))
normalize_factor(factor_reduction(P_1, 1, 1))
normalize_factor(factor_reduction(P_1, 1, 2))

NP-hard problems:

In many cases one can effecciently get results.

Sum-Product for $P(Y \mid E = e)$: sum over product of factors to get a marginal distribution. For incorporating the evidence; first reduce the factors. Can renormalize at the end, so no need to incorporate partition fn at start.


We'll show a couple of examples using the student network, with all variables binary. D for difficulty of course (1 = easy, 2 = hard), G for grade in course (1 = low, 2 = high), I for student intelligence (1 = low, 2 = high), L for quality of letter (1 = not good, 2 = good), S for SAT score (1 = low, 2 = high).

library(igraph)
G_student <- graph_from_literal(D -+ G, I -+ G, I -+ S, G -+ L)

layout <- matrix(c(0,1, 1,0, 2,1, 3,0, 1,-1), ncol = 2, byrow = TRUE)
plot(G_student, layout = layout)

50% chance the course is hard

d_D <- create_factor(c(0.5,0.5), list(c(1,2)))
factor2df(d_D, names(V(G_student)))

90% change on intelligent student

d_I <- create_factor(c(0.1, 0.9), list(c(3,2)))
factor2df(d_I, names(V(G_student)))
d_G <- create_factor(c(0.5, 0.5, 0.8, 0.2, 0.2, 0.8, 0.4, 0.6),
                     list(c(2,2), c(1,2), c(3,2)))
factor2df(d_G, names(V(G_student)))

To see these are not unreasonable, notice that if we incorporate the efidence that the student is not smart we get

factor2df(normalize_factor(factor_reduction(d_G, 3, 1), 1), names(V(G_student)))

Compared to the following which is if the student is smart

factor2df(normalize_factor(factor_reduction(d_G, 3, 2), 1), names(V(G_student)))

And here is the grade distribution of a not smart student in a difficult course (80% probability on a low grade)

factor2df(factor_reduction(factor_reduction(d_G, 3, 1), 1, 2), names(V(G_student)))
d_S <- create_factor(c(0.8, 0.2, 0.1, 0.9), list(c(4,2), c(3,2)))
factor2df(d_S, names(V(G_student)))

For the letter you can see, a low grade leads to 80% probability of a bad latter, a high grade leads to 90% probability of a good letter.

d_L <- create_factor(c(0.8, 0.2, 0.1, 0.9), list(c(5,2), c(2,2)))
factor2df(d_L, names(V(G_student)))

With this we have distributions that look reasonable for all nodes in the student graph

ds <- list(d_D, d_I, d_G, d_S, d_L)

Now we can compute some queries with the sum product method.

First with no evidence, what is the distribution on grades?

ds_red <- ds    # no reductions needed since no evidence
d_prod <- Reduce(factor_product, ds_red)
factor2df(d_prod, names(V(G_student)))    ## full intermediate factor
a <- Reduce(factor_marginaliztion, c(1,3,4,5), d_prod)
factor2df(a, names(V(G_student)))

67% probability on a good grade. What is the student is not so smart?

ds_red <- lapply(ds, function(fact) factor_reduction(fact, 3, 1))
d_prod <- Reduce(factor_product, ds_red)
a <- normalize_factor(Reduce(factor_marginaliztion, c(1,4,5), d_prod))
factor2df(a, names(V(G_student)))

65% probability on a bad grade.

What is the student is not so smart, but the course is easy?

ds_red <- lapply(ds, function(fact) factor_reduction(factor_reduction(fact, 3, 1), 1, 1))
d_prod <- Reduce(factor_product, ds_red)
a <- normalize_factor(Reduce(factor_marginaliztion, c(1,4,5), d_prod))
factor2df(a, names(V(G_student)))

50% probability on a good grade, a 15% rise versus an arbitrary course. If the course is hard however.

ds_red <- lapply(ds, function(fact) factor_reduction(factor_reduction(fact, 3, 1), 1, 2))
d_prod <- Reduce(factor_product, ds_red)
a <- normalize_factor(Reduce(factor_marginaliztion, c(1,4,5), d_prod))
factor2df(a, names(V(G_student)))

80% probability on a bad grade.


Push summations into factor product:

Message passing over a graph

Random sampling instantiations

Overview: MAP Inference

Maximum a Posteriori

Not equal to the maximum stepwise reached by maximising marginals.

NP-hard problems:

In many cases one can efficiently get results.

Max-Product for $MAP(Y \mid E = e)$; denominator for normalizing w.r.t. evidence is constant, so can leave it out. Just work with reduced factors. Also can leave out the partition function again.

Variable Elimination Algorithm

Apply evidence, then push in sums as far as they will go, then perform the sums. Can do this in unnormalized context and normalize at the end.

library(igraph)
library(pryr)

# the standard student network
G_student <- graph_from_literal(D -+ G, I -+ G, I-+ S, G -+ L)

layout <- matrix(c(0,1, 1,0, 2,1, 3,0, 1,-1), ncol = 2, byrow = TRUE)
plot(G_student, layout = layout)
G_student

Eliminate-Var Z

Variable Elimination Alg

Complexity of Variable Elimination Algorithm

In computing $\psi_k(X_k) := \prod_{i=1}^{m} \phi_i$, per row multiply a value from each $\phi$, so get cost of $(m - 1) \cdot | \mathrm{Val}(X_k) |$ multiplications.

In computing $\tau_k(X_k - {Z}) := \sum_Z \psi_k(X_k)$, add every number in $\psi_k$ into one of values for $\tau_k$, so get a cost of $|\mathrm{Val}(X_k)|$ additions.

Write $N_k = |\mathrm{Val}(X_k)|$.

Total number of factors used in the procedure is $\leq m + n$, where $m$ is the number of factors in the initial network (each used in at most one elimination step, after which $1$ new factor is introduced), and $n$ is the number of variables (upper bound on number of elimination steps).

In Bayesian networks have one factor per variable, but in Markov networks can start with many more factors.

But the size of a factor is exponential in its scope: $N_k = O(d^{r_k})$ where $d = \max(|\mathrm{Val}(X_i)|)$, and $r_k = |X_k|$. And this is an opportunity for exponential blowup. The size of the blowup depends heavily on the elimination ordering.

Graph-Based Perspective

Moralize a graph from a Bayesian network: make edges undirected, and connect edges in v-structures (gives the induced Markov network for the given set of factors).

As we eliminate variables, remove nodes and edges. Sometimes need to add a fill edge, b/c a factor is introduced between nodes that were not connected before. All of the nbs of a varialbe that is eliminated, are connected directly after.

Induced graph $I_{\Phi, \alpha}$ over factors $\Phi$ and ordering $\alpha$:

Thm: Every factor produced during VE is a clique in the induced graph.

clique: maximal fully connected subgraph.

Thm: Every (maximal) clique in the induced graph is a factor produced during VE.

Proof: take a clique, look at first var from it to be eliminated. AFter this no new nbds are added. At the time is was eliminated it had all clique members as nbds. That means it had factors involving all its nbds, e.g. it participated in factors with all these other nodes. This means when we multiply them together, we have a factor with all of them. QED

The width of an induced graph is the number of nodes in the largest clique in the graph minus 1.

Minimal induced width of a graph $K$ is $\min_\alpha(\width(I_{K, \alpha}))$.

Provides a lower bound on bst performance of VE to a model factorizing over $K$.

Finding Elimination Orderings

Thm: For a graph $H$, determining whether there exists an elimination ordering for $H$ with induced width $\leq k$ is NP-complete.

Note: even given the optimal ordering, inference may still be exponential.

Greedy search using heuristic cost function (at each point, eliminate node with smallest cost).

Thm: the induced graph is triangulated.

Proof: Towards a contradiction, take a set in a loop size greater than 3 that is not triangulated. Look at the first variable to be eliminated. Then a bridge, the fill edge, is introduced, which then in the induced graph was already present. QED

Can find elimination ordering by finding a low-width triangulation of the original graph.



kasterma/pgmCourse documentation built on May 20, 2019, 7:40 a.m.