In practice, the probabilistic models that we use are often quite complex, and simple algorithms like variable elimination may be too slow for them. In fact, many interesting classes of models may not admit exact polynomial-time solutions at all, and for this reason, much research effort in machine learning is spent on developing algorithms that yield approximate solutions to the inference problem. This section begins our study of such algorithms.
There exist two main families of approximate algorithms: variational methodsVariational inference methods take their name from the calculus of variations, which deals with optimizing functions that take other functions as arguments., which formulate inference as an optimization problem, as well as sampling methods, which produce answers by repeatedly generating random numbers from a distribution of interest.
Sampling methods can be used to perform both marginal and MAP inference queries; in addition, they can compute various interesting quantities, such as expectations of random variables distributed according to a given probabilistic model. Sampling methods have historically been the main way of performing approximate inference, although over the past 15 years variational methods have emerged as viable (and often superior) alternatives.
Sampling from a probability distribution
As a warm-up, let’s think for a minute how we might sample from a multinomial distribution with possible outcomes and associated probabilities .
Sampling, in general, is not an easy problem. Our computers can only generate samples from very simple distributionsEven those samples are not truly random. They are actually taken from a deterministic sequence whose statistical properties (e.g. running averages) are indistinguishable from a truly random one. We call such sequences pseudorandom., such as the uniform distribution over . All sampling techniques involve calling some kind of simple subroutine multiple times in a clever way.
In our case, we may reduce sampling from a multinomial variable to sampling a single uniform variable by subdividing a unit interval into regions with region having size . We then sample uniformly from and return the value of the region in which our sample falls.
Bayes net model describing the performance of a student on an exam. The distribution can be represented a product of conditional probability distributions specified by tables.
Our technique for sampling from multinomials naturally extends to Bayesian networks with multinomial variables, via a method called ancestral (or forward) sampling. Given a probability specified by a Bayes net, we sample variables in topological order. We start by sampling the variables with no parents; then we sample from the next generation by conditioning these variables’ CPDs to values sampled at the first step. We proceed like this until all variables have been sampled. Importantly, in a Bayesian network over variables, forward sampling allows us to sample from the joint distribution in linear time by taking exactly 1 multinomial sample from each CPD.
In our earlier model of a student’s grade, we would first sample an exam difficulty and an intelligence level . Then, once we have samples and , we generate a student grade from . At each step, we simply perform standard multinomial sampling.
A former CS228 student has created an interactive web simulation for visualizing Bayesian network forward sampling methods. Feel free to play around with it and, if you do, please submit any feedback or bugs through the Feedback button on the web app.
“Forward sampling” can also be performed efficiently on undirected models if the model can be represented by a clique tree with a small number of variables per node. Calibrate the clique tree, which gives us the marginal distribution over each node, and choose a node to be the root. Then, marginalize over variables in the root node to get the marginal for a single variable. Once the marginal for a single variable has been sampled from the root node, the newly sampled value can be incorporated as evidence. Finish sampling other variables from the same node, each time incorporating the newly sampled nodes as evidence, i.e. and and so on. When moving down the tree to sample variables from other nodes, each node must send an updated message containing the values of the sampled variables.
Monte Carlo estimation
Sampling from a distribution lets us perform many useful tasks, including marginal and MAP inference, as well as computing integrals of the form
If does not have special structure that matches the Bayes net structure of , this integral will be impossible to perform analytically; instead, we will approximate it using a large number of samples from . Algorithms that construct solutions based on a large number of samples from a given distribution are referred to as Monte Carlo (MC) methodsThe name Monte Carlo refers to a famous casino in the city of Monaco. The term was originally coined as a codeword by physicists working on the atomic bomb as part of the secret Manhattan project..
Monte Carlo integration is an important instantiation of the general Monte Carlo principle. This technique approximates a target expectation with
where are samples drawn according to . It can be shown that
The first equation says that the MC estimate is an unbiased estimator for . The two equations together show that as ; in particular, the variance of can be made arbitrarily small with enough samples.
Graphical illustration of rejection sampling. We may compute the area of circle by drawing uniform samples from the square; the fraction of points that fall in the circle represents its area. This method breaks down if the size of the circle is small relative to the size of the square. A special case of Monte Carlo integration is rejection sampling. We may use it to compute the area of a region by sampling in a larger region with a known area and recording the fraction of samples that falls within .
For example, suppose we have a Bayesian network over the set of variables . We may use rejection sampling to compute marginal probabilities . We can rewrite the probability as
and then take the Monte Carlo approximation. This will amount to sampling many samples from and keeping ones that are consistent with the value of the marginal.
Unfortunately, rejection sampling can be very wasteful. If equals, say, 1%, then we will discard 99% of all samples.
A better way of computing such integrals is via an approach called importance sampling. The main idea is to sample from a distribution (hopefully roughly proportional to ), and then reweigh the samples in a principled way, so that their sum still approximates the desired integral.
More formally, suppose we are interested in computing . We may rewrite this integral as
where and the samples are drawn from . In other words, we may instead take samples from and reweigh them with ; the expected value of this Monte Carlo approximation will be the original integral.
The variance of this new estimator is
Note that we can set the variance to zero by choosing ; this means that if we can sample from this (and evaluate the corresponding weight), all the Monte Carlo samples will be equal and correspond to the true value of our integral. Of course, sampling from such a would be NP-hard in general, but this at least gives us an indication for what to strive for.
In the context of our previous example for computing , we may take to be the uniform distribution and apply importance sampling as follows:
where . Unlike rejection sampling, this will use all the examples; if is not too far from uniform, this will converge to the true probability after only a very small number of samples.
Normalized importance sampling
Unfortunately, unnormalized importance sampling is not suitable for estimating conditional probabilities of the form
Note that using unnormalized importance sampling, we could estimate the numerator as
where . The denominator is the same as the result we derived earlier:
If we estimate the numerator and the denominator with different and independent samples of , then the errors in the two approximations may compound. For example, if the numerator is an under-estimate and the denominator is an over-estimate, the final probability could be a severe under-estimate.
However, if we use the same set of samples for both the numerator and denominator, we avoid this issue of compounding errors. Thus, the final form of normalized importance sampling is
Unfortunately, there is one drawback to the normalized importance sampling estimator, which is that it is biased. If , then we have
Fortunately, because the numerator and denominator are both unbiased, the normalized importance sampling estimator remains asymptotically unbiased, meaning that
Markov chain Monte Carlo
Let us now turn our attention from computing expectations to performing marginal and MAP inference using sampling. We will solve these problems using a very powerful technique called Markov chain Monte CarloMarkov chain Monte Carlo is another algorithm that was developed during the Manhattan project and eventually republished in the scientific literature some decades later. It is so important, that is was recently named as one of the 10 most important algorithms of the XXth century. (MCMC).
A key concept in MCMC is that of a Markov chain. A (discrete-time) Markov chain is a sequence of random variables with each random variable taking one of possible values, intuitively representing the state of a system. The initial state is distributed according to a probability ; all subsequent states are generated from a conditional probability distribution that depends only on the previous random state, i.e. is distributed according to .
The probability is the same at every step ; this means that the transition probabilities at any time in the entire process depend only on the given state and not on the history of how we got there. This is called the Markov assumption.
A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state. It is very convenient to represent the transition probability as a matrix
If the initial state is drawn from a vector probabilities , we may represent the probability of ending up in each state after steps as
where denotes matrix exponentiation (apply the matrix operator times).
The limit (when it exists) is called a stationary distribution of the Markov chain. We will construct below Markov chain with a stationary distribution that exists and is the same for all ; we will refer to such as the stationary distribution of the chain.
A sufficient condition for a stationary distribution is called detailed balance:
It is easy to show that such a must form a stationary distribution (just sum both sides of the equation over and simplify). However, the reverse may not hold and indeed it is possible to have MCMC without satisfying detailed balance.
Existence of a stationary distribution
The high-level idea of MCMC will be to construct a Markov chain whose states will be joint assignments to the variables in the model and whose stationary distribution will equal the model probability .
In order to construct such a chain, we first need to understand when stationary distributions exist. This turns out to be true under two sufficient conditions:
- Irreducibility: It is possible to get from any state to any other state with probability > 0 in a finite number of steps.
- Aperiodicity: It is possible to return to any state at any time, i.e. there exists an such that for all and all , .
The first condition is meant to prevent absorbing states, i.e. states from which we can never leave. In the example below, if we start in states , we will never reach state 4. Conversely, if we start in state 4, then we will never reach states 1,2. If we start the chain in the middle (in state 3), then clearly it cannot have a single limiting distribution.
The second condition is necessary to rule out transition operators such as
Note that this chain alternates forever between states 1 and 2 without ever settling in a stationary distribution.
Fact: An irreducible and aperiodic finite-state Markov chain has a stationary distribution.
In the context of continuous variables, the Markov chain must be ergodic, which is slightly stronger condition than the above (and which requires irreducibility and aperiodicity). For the sake of generality, we will simply require our Markov Chain to be ergodic.
Markov Chain Monte Carlo
As we said, the idea of MCMC algorithms is to construct a Markov chain over the assignments to a probability function ; the chain will have a stationary distribution equal to itself; by running the chain for some number of time, we will thus sample from .
At a high level, MCMC algorithms will have the following structure. They take as argument a transition operator specifying a Markov chain whose stationary distribution is , and an initial assignment to the variables of . An MCMC algorithm then perform the following steps.
- Run the Markov chain from for burn-in steps.
- Run the Markov chain from for sampling steps and collect all the states that it visits.
Assuming is sufficiently large, the latter collection of states will form samples from . We may then use these samples for Monte Carlo integration (or in importance sampling). We may also use them to produce Monte Carlo estimates of marginal probabilities. Finally, we may take the sample with the highest probability and use it as an estimate of the mode (i.e. perform MAP inference).
The Metropolis-Hastings (MH) algorithm is our first way to construct Markov chains within MCMC. The MH method constructs a transition operator from two components:
- A transition kernel , specified by the user
- An acceptance probability for moves proposed by , specified by the algorithm as
At each step of the Markov chain, we choose a new point according to . Then, we either accept this proposed change (with probability ), or with probability we remain at our current state.
Notice that the acceptance probability encourages us to move towards more likely points in the distribution (imagine for example that is uniform); when suggests that we move into a low-probability region, we follow that move only a certain fraction of the time.
In practice, the distribution is taken to be something simple, like a Gaussian centered at if we are dealing with continuous variables.
Given any the MH algorithm will ensure that will be a stationary distribution of the resulting Markov Chain. More precisely, will satisfy the detailed balance condition with respect to the MH Markov chain.
To see that, first observe that if , then and thus . When , this lets us write:
which is simply the detailed balance condition. We used to denote the full transition operator of MH (obtained by applying both and ). Thus, if the MH Markov chain is ergodic, its stationary distribution will be .
A widely-used special case of the Metropolis-Hastings methods is Gibbs sampling. Given an ordered set of variables and a starting configuration , consider the following procedure.
Repeat until convergence for :
- Set .
- For each variable in the order we fixed:
We use to denote all variables in except . It is often very easy to performing each sampling step, since we only need to condition on its Markov blanket, which is typically small. Note that when we update , we immediately use its new value for sampling other variables .
Gibbs sampling can be seen as a special case of MH with proposal It is easy check that the acceptance probability simplifies to one.
Assuming the right transition operator, both Gibbs sampling and MH will eventually produce samples from their stationary distribution, which by construction is .
There exist simple ways of ensuring that this will be the case
- To ensure irreducibility, the transition operator with MH should be able to potentially move to every state. In the case of Gibbs sampling, we would like to make sure that every can get sampled from .
- To ensure aperiodicity, it is enough to let the chain transition stay in its state with some probability.
In practice, it is not difficult to ensure these requirements are met.
Running time of MCMC
A key parameter to this algorithm in the number of burn-in steps . Intuitively, this corresponds to the number of steps needed to converge to our limit (stationary) distribution. This is called the mixing time of the Markov chainThere is a technical definition of this quantity, which we will not cover here.. Unfortunately, this time may vary dramatically, and may sometimes take essentially forever. For example, if the transition matrix is
then for small it will take a very long time to reach the stationary distribution, which is close to . At each step, we will stay in the same state with overwhelming probability; very rarely, we will transition to another state, and then stay there for a very long time. The average of these states will converge to , but the convergence will be very slow.
This problem will also occur with complicated distributions that have two distinct and narrow modes; with high probability, the algorithm will sample from a given mode for a very long time. These examples are indications that sampling is a hard problem in general, and MCMC does not give us a free lunch. Nonetheless, for many real-world distributions, sampling will produce very useful solutions.
Another, perhaps more important problem, is that we may not know when to end the burn-in period, even if it is theoretically not too long. There exist many heuristics to determine whether a Markov chain has mixed; however, typically these heuristics involve plotting certain quantities and estimating them by eye; even the quantitative measures are not significantly more reliable than this approach.
In summary, even though MCMC is able to sample from the right distribution (which in turn can be used to solve any inference problem), doing so may sometimes require a very long time, and there is no easy way to judge the amount of computation that we need to spend to find a good solution.