Metropolis-hastings
Recap of Markov chains
In the last lecture, we learned that if a Markov chain is irreducible and aperiodic, then the Markov chain will converge to its unique stationary distribution, regardless of the initial state. Mathematically, , a matrix where every row is equal to some vector .
Recall that is a stochastic matrix. has a left eigenvector with eigenvalue 1. has no other eigenvectors with modulus 1. Let be a basis of eigenvectors for where each satisfies . When we start with an initial , then and , and for the eigenvalues corresponding to all eigenvectors except . The speed at which we have convergence is determined by the size of the second-largest eigenvalue.
Our proof of the theorem last time was based on the fact that every time we apply , we decrease the gap between the smallest and largest value of the state vector, and hence the state vector converges to a constant vector. Note that if is larger, then the Markov chain will converge more quickly. By the definition of , the Markov chain will converge more quickly if there is no pair of states with a very low probability of transitioning between the states.
Markov Chain Monte Carlo (MCMC)
Our goal in Markov Chain Monte Carlo (MCMC) is to sample from a probability distribution . We want to construct a Markov chain that reaches the limiting distribution as fast as possible. The main question is how do we design a transition matrix so that is a left eigenvector of with eigenvalue 1.
Note that in the high-dimensional case, we cannot even store all of the entries of . We will only assume we can evaluate for any . This also means that we do not need to calculate the normalization constant .
The key assumption we will make is that the Markov chain is reversible. A Markov chain is reversible if there exists a distribution which satisfies the detailed balance conditions: , .
Theorem: If a distribution is reversible, then is a stationary distribution.
Proof: For any state , we have
Therefore, .
Since we want the stationary distribution of the Markov chain to be , it suffices to design the transition matrix so the Markov chain satisfies detailed balance with respect to .
Metropolis-Hastings
Metropolis-Hastings is a MCMC method for sampling from a probability distribution by using a proposal distribution for proposing moves and then accepting or rejecting proposed moves between states with some probability.
First, let be any proposal distribution where is the probability of proposing a move to some state given the current state . Then we will construct the transition matrix as
where is the acceptance probability for accepting a proposed move from state to state . Note that while we cannot evaluate exactly, we can evaluate ratios .
We want to show that satisfies detailed balance for all . By the definition of , without loss of generality, assume that and . Then
Therefore, satisfies detailed balance and is a stationary distribution.
Now we just need to ensure that the Markov chain is irreducible and aperiodic. This depends on our choice of the proposal distribution and target probability distribution since defines the probability of transitioning from .
The intuition behind MH is that sampling from the Markov chain requires spending the right amount of time in the right states so that our samples are accurate draws. We need to balance the transitions between higher probability states and lower probability states under with the tendency of the proposal distribution to go to certain states. Given and , MH will set so satisfies detailed balance.
Metropolis-Hastings algorithm
The procedure for the Metropolis-Hastings algorithm is as follows:
- Initialize for
- Draw a sample from
- Accept the move with probability where . If accepted, let . Otherwise, let .
- Repeat steps 2-3 to draw samples
In practice, we run a burn-in period of iterations and start collecting samples only after the burn-in period and hope that the Markov chain has converged by then. However, determining when the Markov chain has converged is a hard problem. One heuristic is to randomly initialize several Markov chains, plot some scalar function of the state of the Markov chain over time, and see if the scalar functions are similar. Note that this does not always guarantee convergence. For example, consider the case where we have a bimodal distribution or a singly peaked distribution.
The samples drawn from the Markov chain are not i.i.d. In general, samples drawn close to each other can be highly correlated since Metropolis-Hastings moves tend to be local moves. However asymptotically, the samples drawn from the Markov chain are all unbiased and all come from the right distribution. The variance will not decrease as fast as if we had independent samples though.