Next, we turn our attention to the problem of inference in graphical models. Given a probabilistic model (such as a Bayes net or an MRF), we are interested in using it to answer useful questions, e.g., determining the probability that a given email is spam. More formally, we will be focusing on two types of questions:
- Marginal inference: what is the probability of a given variable in our model after we sum everything else out (e.g. probability of spam vs non-spam)?
- Maximum a posteriori (MAP) inference: what is the most likely assignment to the variables in the model (possibly conditioned on evidence).
It turns out that inference is a challenging task. For many probabilities of interest, it will be NP-hard to answer any of these questions. Crucially, whether inference is tractable will depend on the structure of the graph that describes that probability. If a problem is intractable, we will still be able to obtain useful answers via approximate inference methods.
This chapter covers the first exact inference algorithm, variable elimination. We will discuss approximate inference in later chapters.
An illustrative example
Consider first the problem of marginal inference. Suppose for simplicity that we are given a chain Bayesian network, i.e. a probability of the form
We are interested in computing the marginal probability . We will assume for the rest of the chapter that the are discrete variables taking possible values eachThe principles behind variable elimination also extend to many continuous distributions (e.g. Gaussians), but we will not discuss these extensions here..
The naive way of performing this is to sum the probability over all the assignments to :
However, we can do much better by leveraging the factorization of our probability distribution. We may rewrite the sum in a way that “pushes in” certain variables deeper into the product.
We sum the inner terms first, starting from and ending with . Concretely, we start by computing an intermediary factor by summing out . This takes time because we must sum over for each assignment to . The resulting factor can be thought of as a table of values (though not necessarily probabilities), with one entry for each assignment to (just as factor can be represented as a table). We may then rewrite the marginal probability using as
Note that this has the same form as the initial expression, except that we are summing over one fewer variableThis technique is a special case of dynamic programming, a general algorithm design approach in which we break apart a larger problem into a sequence of smaller ones.. We may therefore compute another factor , and repeat the process until we are only left with . Since each step takes time, and we perform steps, inference now takes time, which is much better than our naive solution.
Also, at each time, we are eliminating a variable, which gives the algorithm its name.
Having established some intuitions, with a special case, we now introduce the variable elimination algorithm in its general form.
We will assume that we are given a graphical model as a product of factors
Recall that we can view a factor as a multi-dimensional table assigning a value to each assignment of a set of variables . In the context of a Bayesian network, the factors correspond to conditional probability distributions; however, this definition also makes our algorithm equally applicable to Markov Random Fields. In this latter case, the factors encode an unnormalized distribution; to compute marginals, we first calculate the partition function (also using variable elimination), then we compute marginals using the unnormalized distribution, and finally we divide the result by the partition constant to construct a valid marginal probability.
The variable elimination algorithm will repeatedly perform two factor operations: product and marginalization. We have been implicitly performing these operations in our chain example.
The factor product operation simply defines the product of two factors as
The scope of is defined as the union of the variables in the scopes of ; also denotes an assignment to the variables in the scope of defined by the restriction of to that scope. For example, we define .
Next, the marginalization operation “locally” eliminates a set of variables from a factor. If we have a factor over two sets of variables , marginalizing produces a new factor
where the sum is over all joint assignments to the set of variables .
Here, we are marginalizing out variable from factor
We use to refer to the marginalized factor. It is important to understand that this factor does not necessarily correspond to a probability distribution, even if was a CPD.
Finally, the variable elimination algorithm requires an ordering over the variables according to which variables will be “eliminated”. In our chain example, we took the ordering implied by the DAG. It is important note that:
- Different orderings may dramatically alter the running time of the variable elimination algorithm.
- It is NP-hard to find the best ordering.
We will come back to these complications later, but for now let the ordering be fixed.
The variable elimination algorithm
We are now ready to formally define the variable elimination (VE) algorithm. Essentially, we loop over the variables as ordered by and eliminate them in that ordering. Intuitively, this corresponds to choosing a sum and “pushing it in” as far as possible inside the product of the factors, as we did in the chain example.
More formally, for each variable (ordered according to ),
- Multiply all factors containing
- Marginalize out to obtain new factor
- Replace the factors in by
A former CS228 student has created an interactive web simulation for visualizing the variable elimination algorithm. Feel free to play around with it and, if you do, please submit any feedback or bugs through the Feedback button on the web app.
Let’s try to understand what these steps correspond to in our chain example. In that case, the chosen ordering was . Starting with , we collected all the factors involving , which were and . We then used them to construct a new factor . This can be seen as the results of steps 2 and 3 of the VE algorithm: first we form a large factor ; then we eliminate from that factor to produce . Then, we repeat the same procedure for , except that the factors are now .
For a slightly more complex example, recall the graphical model of a student’s grade that we introduced earlier.
Bayes net model of a student’s grade on an exam; in addition to , we also model other aspects of the problem, such as the exam’s difficulty , the student’s intelligence , his SAT score , and the quality of a reference letter from the professor who taught the course. Each variable is binary, except for , which takes 3 possible values. The probability specified by the model is of the form
Let’s suppose that we are computing and are eliminating variables in their topological ordering in the graph. First, we eliminate , which corresponds to creating a new factor . Next, we eliminate to produce a factor ; then we eliminate yielding , and so forth. Note that these operations are equivalent to summing out the factored probability distribution as follows:
Note that this example requires computing at most operations per step, since each factor is at most over 2 variables, and one variable is summed out at each step (the dimensionality in this example is either 2 or 3).
A closely related and equally important problem is computing conditional probabilities of the form
where is a probability distribution, over sets of query variables , observed evidence variables , and unobserved variables .
We can compute this probability by performing variable elimination once on and then once more on .
To compute , we simply take every factor which has scope over variables that are also found in , and we set their values as specified by . Then we perform standard variable elimination over to obtain a factor over only .
Running Time of Variable Elimination
It is very important to understand that the running time of Variable Elimination depends heavily on the structure of the graph.
In the previous example, suppose we eliminated first. Then, we would have had to transform the factors into a big factor over 3 variables, which would require time to compute. If we had a factor , then we would have had to eliminate as well, producing a single giant factor in time. Then, eliminating any variable from this factor would require almost as much work as if we had started with the original distribution, since all the variables have become coupled.
Clearly some orderings are more efficient than others. In fact, the running time of Variable Elimination is , where is the maximum size of any factor during the elimination process and is the number of variables.
Choosing variable elimination orderings
Unfortunately, choosing the optimal VE ordering is an NP-hard problem. However, in practice, we may resort to the following heuristics:
- Min-neighbors: Choose a variable with the fewest dependent variables.
- Min-weight: Choose variables to minimize the product of the cardinalities of its dependent variables.
- Min-fill: Choose vertices to minimize the size of the factor that will be added to the graph.
These methods often result in reasonably good performance in many interesting settings.