In practice, the probabilistic models that we use are often quite complex, and simple algorithms like variable elimination may be too slow for them. In fact, many interesting classes of models may not admit exact polynomial-time solutions at all, and for this reason, much research effort in machine learning is spent on developing algorithms that yield approximate solutions to the inference problem. This section begins our study of such algorithms.
There exist two main families of approximate algorithms: variational methodsVariational inference methods take their name from the calculus of variations, which deals with optimizing functions that take other functions as arguments. , which formulate inference as an optimization problem, as well as sampling methods, which produce answers by repeatedly generating random numbers from a distribution of interest.
Sampling methods can be used to perform both marginal and MAP inference queries; in addition, they can compute various interesting quantities, such as expectations of random variables distributed according to a given probabilistic model. Sampling methods have historically been the main way of performing approximate inference, although over the past 15 years variational methods have emerged as viable (and often superior) alternatives.
Sampling from a probability distribution
As a warm-up, let’s think for a minute how we might sample from a multinomial distribution with possible outcomes and associated probabilities .
Sampling, in general, is not an easy problem. Our computers can only generate samples from very simple distributionsEven those samples are not truly random. They are actually taken from a deterministic sequence whose statistical properties (e.g. running averages) are indistinguishable from a truly random one. We call such sequences pseudorandom. , such as the uniform distribution over . All sampling techniques involve calling some kind of simple subroutine multiple times in a clever way.
In our case, we may reduce sampling from a multinomial variable to sampling a single uniform variable by subdividing a unit interval into regions with region having size . We then sample uniformly from and return the value of the region in which our sample falls.
Sampling from directed graphical models
Bayes net model describing the performance of a student on an exam. The distribution can be represented a product of conditional probability distributions specified by tables.
Our technique for sampling from multinomials naturally extends to Bayesian networks with multinomial variables, via a method called ancestral (or forward) sampling. Given a probability specified by a Bayes net, we sample variables in topological order. In other words, we start by sampling the variables with no parents; then we sample from the next generation by conditioning these variables’ CPDs to values sampled at the first step. We proceed like this until the variables have been sampled.
In our earlier model of a student’s grade, we would first sample an exam difficulty and an intelligence level . Then, once we have samples, we generate a student grade from . At each step, we simply perform standard multinomial sampling.
A former CS228 student has created an interactive web simulation for visualizing Bayesian network forward sampling methods. Feel free to play around with it and, if you do, please submit any feedback or bugs through the Feedback button on the web app.
Monte Carlo estimation
Sampling from a distribution lets us perform many useful tasks, including marginal and MAP inference, as well as computing integrals of the form
If does not have a special structure that matches the Bayes net structure of , this integral will be impossible to perform analytically; instead, we will approximate it using a large number of samples from . Algorithms that construct solutions based on a large number of samples from a given distribution are referred to as Monte Carlo (MC) methodsThe name Monte Carlo refers to a famous casino in the city of Monaco. The term was originally coined as a codeword by physicists working on the atomic bomb as part of the secret Manhattan project. .
Monte Carlo integration is an important instantiation of the general Monte Carlo principle. This technique approximates a target expectation with
where are samples drawn according to .
It is easy to show that the expected value of , the MC estimate, equals the true integral. We say that is an unbiased estimator for . Moreover as as . Finally, we can show that the variance of equals , which can be made arbitrarily small with .
Graphical illustration of rejection sampling. We may compute the area of circle by drawing uniform samples from the square; the fraction of points that fall in the circle represents its area. This method breaks down if the size of the circle is small relative to the size of the square. A special case of Monte Carlo integration is rejection sampling. We may use it to compute the area of a region by sampling in a larger region with a known area and recording the fraction of samples that falls within .
For example, we may use rejection sampling to compute marginal probabilities of the form : we may write this probability as and then take the Monte Carlo approximation. This will amount to sampling many samples from and keeping ones that are consistent with the value of the marginal.
Unfortunately, this procedure is very wasteful. If equals, say, 1%, then we will discard 99% of all samples.
A better way of computing such integrals is via an approach called importance sampling. The main idea is to sample from a distribution (hopefully roughly proportional to ), and then reweight the samples in a principled way, so that their sum still approximates the desired integral.
More formally, suppose we are interested in computing . We may rewrite this integral as
where . In other words, we may instead take samples from and reweigh them with ; the expected value of this Monte Carlo approximation will be the original integral.
Now the variance of this new estimator equals
Note that we can set the variance to zero by choosing ; this means that if we can sample from this (and evaluate the corresponding weight), all the Monte Carlo samples will be equal and correspond to the true value of our integral. Of course, sampling from such a would be NP-hard in general, but this at least gives us an indication for what to strive for.
In the context of our previous example for computing , we may take to be the uniform distribution and apply importance sampling as follows:
Unlike rejection sampling, this will use all the examples; if is not too far from uniform, this will converge to the true probability after only a very small number of samples.
Normalized importance sampling
Unfortunately, unnormalized importance sampling is not suitable for estimating conditional probabilities.
Because we would estimate the numerator using one approximation and the denominator using another approximation, the errors in the approximations may compound. For example, if the numerator is an under-estimate and the denominator is an over-estimate, the final probability could be a severe under-estimate.
To avoid this issue, we first generate samples to approximate the denominator . The numerator can then be estimated by counting the weighted number of samples with values . Since the numerator and denominator are both approximated using the same samples, the potential issue of errors compounding is avoided.
The final form of normalized importance sampling is thus
Markov chain Monte Carlo
Let us now turn our attention from computing expectations to performing marginal and MAP inference using sampling. We will solve these problems using a very powerful technique called Markov chain Monte CarloMarkov chain Monte Carlo is another algorithm that was developed during the Manhattan project and eventually republished in the scientific literature some decades later. It is so important, that is was recently named as one of the 10 most important algorithms of the XXth century. (MCMC).
A key concept in MCMC is that of a Markov chain. A (discrete-time) Markov chain is a sequence of random variables with each random variable taking one of possible values, intuitively representing the state of a system. The initial state is distributed according to a probability ; all subsequent states are generated from a conditional probability distribution that depends only on the previous random state, i.e. is distributed according to .
The probability is the same at every step ; this means that the transition probabilities at any time in the entire process depend only on the given state and not on the history of how we got there. This is called the Markov assumption.
A Markov chain over three states. The weighted directed edges indicate probabilities of transitioning to a different state. It is very convenient to represent the transition probability as a matrix
If the initial state is drawn from a vector probabilities , we may represent the probability of ending up in each state after steps as
where denotes matrix exponentiation (we apply the matrix operator times).
The limit (when it exists) is called a stationary distribution of the Markov chain. We will construct below Markov chain with a stationary distribution that exists and is the same for all ; we will refer to such as the stationary distribution* of the chain.
A sufficent condition for a stationary distribution is called detailed balance:
It is easy to show that such a must form a stationary distribution (just sum both sides of the equation over and simplify). However, the reverse may not hold and indeed it is possible to have MCMC without satisfying detailed balance.
Existence of a stationary distribution
The high-level idea of MCMC will be to construct a Markov chain whose states will be joint assignments to the variables in the model and whose stationary distribution will equal the model probability .
In order to construct such a chain, we first need to understand when stationary distributions exist. This turns out to be true under two sufficient conditions:
- Irreducibility: It is possible to get from any state to any other state with probability > 0 in a finite number of steps.
- Aperiodicity: It is possible to return to any state at any time, i.e. there exists an such that for all and all , .
The first condition is meant to prevent absorbing states, i.e. states from which we can never leave. In the example below, if we start in states , we will never reach state 4. Conversely, if we start in state 4, then we will never reach states 1,2. If we start the chain in the middle (in state 3), then clearly it cannot have a single limiting distribution.
The second condition is necessary to rule out transition operators such as
Note that this chain alternates forever between states 1 and 2 without ever settling in a stationary distribution.
Fact: An irreducible and aperiodic finite-state Markov chain has a stationary distribution.
In the context of continuous variables, the Markov chain must be ergodic, which is slightly stronger condition than the above (and which requires irreducibility and aperiodicity). For the sake of generality, we will simply require our Markov Chain to be ergodic.
Markov Chain Monte Carlo
As we said, the idea of MCMC algorithms is to construct a Markov chain over the assignments to a probability function ; the chain will have a stationary distribution equal to itself; by running the chain for some number of time, we will thus sample from .
At a high level, MCMC algorithms will have the following structure. They take as argument a transition operator specifying a Markov chain whose stationary distribution is , and an initial assignment to the variables of . An MCMC algorithm then perform the following steps.
- Run the Markov chain from for burn-in steps.
- Run the Markov chain from for sampling steps and collect all the states that it visits.
Assuming is sufficiently large, the latter collection of states will form samples from . We may then use these samples for Monte Carlo integration (or in importance sampling). We may also use them to produce Monte Carlo estimates of marginal probabilities. Finally, we may take the sample with the highest probability and use it as an estimate of the mode (i.e. perform MAP inference).
The Metropolis-Hastings (MH) algorithm is our first way to construct Markov chains within MCMC. The MH method constructs a transition operator from two components:
- A transition kernel , specified by the user
- An acceptance probability for moves proposed by , specified by the algorithm as
At each step of the Markov chain, we choose a new point according to . Then, we either accept this proposed change (with probability ), or with probability we remain at our current state.
Notice that the acceptance probability encourages us to move towards more likely points in the distribution (imagine for example that is uniform); when suggests that we move into a low-probability region, we follow that move only a certain fraction of the time.
In practice, the distribution is taken to be something simple, like a Gaussian centered at if we are dealing with continuous variables.
Given any the MH algorithm will ensure that will be a stationary distribution of the resulting Markov Chain. More precisely, will satisfy the detailed balance condition with respect to the MH Markov chain.
To see that, first observe that if , then and thus . When , this lets us write:
which is simply the detailed balance condition. We used to denote the full transition operator of MH (obtained by applying both and ). Thus, if the MH Markov chain is ergodic, its stationary distribution will be .
A widely-used special case of the Metropolis-Hastings methods is Gibbs sampling. Given an ordered set of variables and a starting configuration , consider the following procedure.
Repeat until convergence for :
- Set .
- For each variable in the order we fixed:
We use to denote all variables in except . It is often very easy to performing each sampling step, since we only need to condition on its Markov blanket, which is typically small. Note that when we update , we immediately use its new value for sampling other variables .
Gibbs sampling can be seen as a special case of MH with proposal It is easy check that the acceptance probability simplifies to one.
Assuming the right transition operator, both Gibbs sampling and MH will eventually produce samples from their stationary distribution, which by construction is .
There exist simple ways of ensuring that this will be the case
- To ensure irreducibility, the transition operator with MH should be able to potentially move to every state. In the case of Gibbs sampling, we would like to make sure that every can get sampled from .
- To ensure aperiodicity, it is enough to let the chain transition stay in its state with some probability.
In practice, it is not difficult to ensure these requirements are met.
Running time of MCMC
A key parameter to this algorithm in the number of burn-in steps . Intuitively, this corresponds to the number of steps needed to converge to our limit (stationary) distribution. This is called the mixing time of the Markov chainThere is a technical definition of this quantity, which we will not cover here. . Unfortunately, this time may vary dramatically, and may sometimes take essentially forever. For example, if the transition matrix is
then for small it will take a very long time to reach the stationary distribution, which is close to . At each step, we will stay in the same state with overwhelming probability; very rarely, we will transition to another state, and then stay there for a very long time. The average of these states will converge to , but the convergence will be very slow.
This problem will also occur with complicated distributions that have two distinct and narrow modes; with high probability, the algorithm will sample from a given mode for a very long time. These examples are indications that sampling is a hard problem in general, and MCMC does not give us a free lunch. Nonetheless, for many real-world distributions, sampling will produce very useful solutions.
Another, perhaps more important problem, is that we may not know when to end the burn-in period, even if it is theoretically not too long. There exist many heuristics to determine whether a Markov chain has mixed; however, typically these heuristics involve plotting certain quantities and estimating them by eye; even the quantitative measures are not significantly more reliable than this approach.
In summary, even though MCMC is able to sample from the right distribution (which in turn can be used to solve any inference problem), doing so may sometimes require a very long time, and there is no easy way to judge the amount of computation that we need to spend to find a good solution.