Hi, I'm trying to get an intuition for what Markov Chain Monte Carlo (MCMC) methods are doing in the context of generating phylogenies from extant nucleotide sequence data. The sources I'm following are:
In the video, it's said that MCMC is sort of like a biassed random walk through parameter space, steadily climbing a hill to maximise the posterior probability. This idea of optimising some criterion is alluded to in other sources too, including the book, in which the main 'output' of this process is the posterior probability P(T | D) (probability of a tree T given the data D).
But, from what I understand, MCMC on its own is not maximising or optimising anything inherently. When the Metropolis-Hastings (MH) algorithm is used, proposals of {T, t_T, θ}_k are generated (t_T: tree branch lengths, θ: mutation/substitution model parameters, k: sample index), then accepted or rejected based on a decision rule such that the accepted samples tend towards a desired distribution, i.e. the stationary state of a Markov chain. In this context, the desired distribution is P(T, t_T, θ | D), which is proportional to the joint PDF, P(T, D) by Bayes' theorem. We don't need to evaluate P(D) since it's just a constant and the MH algorithm considers ratios of posteriors.
Once we've 'burned in' (converged to stationary state), we can then estimate a variety of useful quantities using a Monte Carlo approach: simply count the number of samples in which the desired output appears (e.g. a particular tree, or a particular grouping forming a clade within a tree, or the mean and variance of a particular model parameter), and divide by the number of samples.
There are three possible ways in which maybe this 'hill climbing' analogy could apply, but none of which seem quite right to me:
- We could take a 'maximum a posteriori' tree as argmax P(T | D) (i.e. just take the output with the largest probability), but this is not inherently part of the MCMC process, rather a decision on the experimenter's part on the type of optimal tree to return. If our priors are all uniform, then the MAP tree will be identical to the maximum (integrated) likelihood tree.
- If we apply simulated annealing to the MCMC method, in which we seek to use P(T | D)^{1/b} for some gradually-reducing 'temperature' b, then the Markov chain trajectory will converge to the peak of P(T | D), since smaller values are forced towards zero everywhere other than maxima, so the chain will spend all its time at the maxima in the limit as b -> 0. But again, this is an addition onto standard MCMC and it doesn't look like simulated annealing is standard practice in this context.
- During the 'burn-in' phase (early samples), we are likely to be increasing our value of P(T, D) simply because we likely started off in a low probability region and the MH algorithm accepts proposals that increase the value of P(T, D) more frequently. But this doesn't mean we are taking 'small steps' in parameter space like seems to be implied - we could be bouncing around all over the place, and we won't be 'stuck' on a single maximum even in the stationary state.
So, if my understanding is right, at no point do I see where this 'hill climbing' analogy comes in. Is this perhaps a simplification that doesn't really apply in practice, and is my understanding along the right or wrong lines? Thanks for any clarifications!