We begin with the topic of representation: how do we choose a probability distribution to model some interesting aspect of the world? Coming up with a good model is not always easy: we have seen in the introduction that a naive model for spam classification would require us to specify a number of parameters that is exponential in the number of words in the English language!
In this chapter, we will learn about one way to avoid these kinds of complications. We are going to:
- Learn an effective and general technique for parameterizing probability distributions using only a few parameters.
- See how the resulting models can be elegantly described via directed acyclic graphs (DAGs).
- Study connections between the structure of a DAG and the modeling assumptions made by the distribution that it describes; this will not only make these modeling assumptions more explicit, but will also help us design more efficient inference algorithms.
The kinds of models that we will see here are referred to as Bayesian networks. In the next chapter, we will also see a second approach, which involves undirected graphs, also known as Markov random fields (MRFs). Bayesian networks effectively show causality, whereas MRFs cannot. Thus, MRFs are preferable for problems where there is no clear causality between random variables.
Probabilistic modeling with Bayesian networks
Directed graphical models (a.k.a. Bayesian networks) are a family of probability distributions that admit a compact parametrization that can be naturally described using a directed graph.
The general idea behind this parametrization is surprisingly simple.
Recall that by the chain rule, we can write any probability as:
A compact Bayesian network is a distribution in which each factor on the right hand side depends only on a small number of ancestor variables :
For example, in a model with five variables, we may choose to approximate the factor with . In this case, we write .
When the variables are discrete (which will often be the case in the problem we will consider), we may think of the factors as probability tables, in which rows correspond to assignments to and columns correspond to values of ; the entries contain the actual probabilities . If each variable takes values and has at most ancestors, then the entire table will contain at most entries. Since we have one table per variable, the entire probability distribution can be compactly described with only parameters (compared to with a naive approach).
Distributions of this form can be naturally expressed as directed acyclic graphs, in which vertices correspond to variables and edges indicate dependency relationships. In particular we set the parents of each node to its ancestors .
As an example, consider a model of a student’s grade on an exam. This grade depends on the exam’s difficulty and the student’s intelligence ; it also affects the quality of the reference letter from the professor who taught the course. The student’s intelligence affects the SAT score as well. Each variable is binary, except for , which takes 3 possible values.
Bayes net model describing the performance of a student on an exam. The distribution can be represented a product of conditional probability distributions specified by tables. The form of these distributions is described by edges in the graph. The joint probability distribution over the 5 variables naturally factorizes as follows:
The graphical representation of this distribution is a DAG that visually specifies how random variables depend on each other. The graph clearly indicates that the letter depends on the grade, which in turn depends on the student’s intelligence and the difficulty of the exam.
Another way to interpret directed graphs is in terms of stories for how the data was generated. In the above example, to determine the quality of the reference letter, we may first sample an intelligence level and an exam difficulty; then, a student’s grade is sampled given these parameters; finally, the recommendation letter is generated based on that grade.
In the previous spam classification example, we implicitly postulated that email is generated according to a two-step process: first, we choose a spam/non-spam label ; then we sample independently whether each word is present, conditioned on that label.
Formally, a Bayesian network is a directed graph together with
- A random variable for each node .
- One conditional probability distribution (CPD) per node, specifying the probability of conditioned on its parents’ values.
Thus, a Bayesian network defines a probability distribution . Conversely, we say that a probability factorizes over a DAG if it can be decomposed into a product of factors, as specified by .
It is not hard to see that a probability represented by a Bayesian network will be valid: clearly, it will be non-negative and one can show using an induction argument (and using the fact that the CPDs are valid probabilities) that the sum over all variable assignments will be one. Conversely, we can also show by counter-example that when contains cycles, its associated probability may not sum to one.
The dependencies of a Bayes net
To summarize, Bayesian networks represent probability distributions that can be formed via products of smaller, local conditional probability distributions (one for each variable). By expressing a probability in this form, we are introducing into our model assumptions that certain variables are independent.
This raises the question: which independence assumptions are we exactly making by using a Bayesian network model with a given structure described by ? This question is important for two reasons: we should know precisely what model assumptions we are making (and whether they are correct); also, this information will help us design more efficient inference algorithms later on.
Let us use the notation to denote the set of all independencies that hold for a joint distribution . For example, if , then we say that .
Independencies described by directed graphs
It turns out that a Bayesian network very elegantly describes many independencies in ; these independencies can be recovered from the graph by looking at three types of structures.
For simplicity, let’s start by looking at a Bayes net with three nodes: , , and . In this case, essentially has only three possible structures, each of which leads to different independence assumptions. The interested reader can easily prove these results using a bit of algebra.
Bayesian networks over three variables, encoding different types of dependencies: cascade (a,b), common parent (c), and v-structure (d). Common parent. If is of the form , and is observed, then . However, if is unobserved, then . Intuitively this stems from the fact that contains all the information that determines the outcomes of and ; once it is observed, there is nothing else that affects these variables’ outcomes.
- Cascade: If equals , and is again observed, then, again . However, if is unobserved, then . Here, the intuition is again that holds all the information that determines the outcome of ; thus, it does not matter what value takes.
- V-structure (also known as explaining away): If is , then knowing couples and . In other words, if is unobserved, but if is observed.
The latter case requires additional explanation. Suppose that is a Boolean variable that indicates whether our lawn is wet one morning; and are two explanations for it being wet: either it rained (indicated by ), or the sprinkler turned on (indicated by ). If we know that the grass is wet ( is true) and the sprinkler didn’t go on ( is false), then the probability that is true must be one, because that is the only other possible explanation. Hence, and are not independent given .
These structures clearly describe the independencies encoded by a three-variable Bayesian net. We can extend them to general networks by applying them recursively over any larger graph. This leads to a notion called -separation (where stands for directed).
Let , , and be three sets of nodes in a Bayesian Network . We say that and are -separated given (i.e. the variables are observed) if and are not connected by an active path. An undirected path in is called active given observed variables if for every consecutive triple of variables on the path, one of the following holds:
- , and is unobserved
- , and is unobserved
- , and is unobserved
- , and or any of its descendants are observed.
In this example, and are -separated given .
However, are not -separated given . There is an active pass which passed through the V-structure created when is observed.
For example, in the graph below, and are -separated given . However, are not -separated given , because we can find an active path
A former CS228 student has created an interactive web simulation for testing -separation. Feel free to play around with it and, if you do, please submit any feedback or bugs through the Feedback button on the web app.
The notion of -separation is useful, because it lets us describe a large fraction of the dependencies that hold in our model. Let be a set of variables that are -separated in .
FactWe will not formally prove this, but the intuition is that if and are mutually dependent, so are . Thus we can look at adjacent nodes and propagate dependencies according to the local dependency structures outlined above.: If factorizes over , then . In this case, we say that is an -map (independence map) for .
In other words, all the independencies encoded in are sound: variables that are -separated in are truly independent in . However, the converse is not true: a distribution may factorize over , yet have independencies that are not captured in .
In a way this is almost a trivial statement. If , then this distribution still factorizes over the graph , since we can always write it as with a CPD in which the probability of does not actually vary with . However, we can construct a graph that matches the structure of by simply removing that unnecessary edge.
The representational power of directed graphs
This raises our last and perhaps most important question: can directed graphs express all the independencies of any distribution ? More formally, given a distribution , can we construct a graph such that ?
First, note that it is easy to construct a such that . A fully connected DAG is an -map for any distribution since .
A fully connected Bayesian network over four variables. There are no independencies in this model, and it is an -map for any distribution.
A more interesting question is whether we can find a minimal -map for , i.e. an -map such that the removal of even a single edge from will result in it no longer being an -map. This is quite easy: we may start with a fully connected and remove edges until is no longer an -map. One way to do this is by following the natural topological ordering of the graph, and removing node ancestors until this is no longer possible; we will revisit this pruning method towards the end of course when performing structure learning.
However, what we are truly interested in determining is whether the probability distribution admits a perfect map for which . Unfortunately, the answer is no. For example, consider the following distribution over three variables : we sample from a Bernoulli distribution, and we set (we call this the noisy-xor example). One can check using some algebra but . Thus, is an -map for , but none of the 3-node graph structures that we discussed perfectly describes , and hence this distribution doesn’t have a perfect map.
A related question is whether perfect maps are unique when they exist. Again, this is not the case, as and encode the same independencies, yet form different graphs. More generally, we say that two Bayes nets are -equivalent if they encode the same dependencies .
When are two Bayesian nets -equivalent? To answer this, let’s return to a simple example with three variables. Each of the graphs below have the same skeleton, meaning that if we drop the directionality of the arrows, we obtain the same undirected graph in each case.
The cascade-type structures (a,b) are clearly symmetric and the directionality of arrows does not matter. In fact, (a,b,c) encode exactly the same dependencies. We can change the directions of the arrows as long as we don’t turn them into a V-structure (d). When we do have a V-structure, however, we cannot change any arrows: structure (d) is the only one that describes the dependency . These examples provide intuition for the following general result on -equivalence.
Fact: If have the same skeleton and the same v-structures, then
Again, it is easy to understand intuitively why this is true. Two graphs are -equivalent if the -separation between variables is the same. We can flip the directionality of any edge, unless it forms a v-structure, and the -connectivity of the graph will be unchanged. We refer the reader to the textbook of Koller and Friedman for a full proof.