Markov Chain Monte Carlo (MCMC) methods are very powerful Monte Carlo methods that are often used in Bayesian inference.
While "classical" Monte Carlo methods rely on computer generated samples made up of independent observations, MCMC methods are based on techniques that allow us to generate sequences of dependent observations (these sequences are Markov chains, hence the name of the methods).
Throughout this lecture we are going to assume that you are familiar with:
the basics of the Monte Carlo method (in particular, computer generated samples, the plug-in principle and Monte Carlo approximation and integration);
the basics of Markov Chains (in particular, the concepts of a stationary distribution and the main limit theorems for Markov chains).
The purpose of any Monte Carlo method is to approximate some feature (e.g.,
the mean) of a given probability distribution. This is accomplished by using a
computer generated sample of draws from the given distribution to compute a
plug-in estimate of the
feature to be approximated. In particular, suppose we are given a
random vector
with joint distribution
function
,
and we want to approximate a feature
of the distribution
.
In a Markov Chain Monte Carlo method, we generate through a computer algorithm
a sample
of
realizations of
random variables
,
...,
.
The algorithm is devised in such a way that the sequence
is a Markov Chain converging to the stationary distribution
.
The sample is used, as in standard Monte Carlo methods, to produce a plug-in
estimateof
the feature
,
where
is the empirical
distribution of the sample
(i.e., a probability distribution that assigns probability
to each one of the values
).
Unlike in standard Monte Carlo methods, the variables
,
...,
are in general not independent. We need to take this fact into account when we
assess the convergence of
to
.
For example, if
is the expected value of
,
and
is the sample
mean
we
are usually able to show that an
ergodic theorem
(a Law of Large Numbers for dependent observations) applies to the sample
mean, so that it converges to the expected value of
.
Two popular examples of MCMC algorithms are provided below:
the Metropolis-Hastings algorithm;
the Gibbs sampling algorithm.
One of the most popular MCMC algorithms is the Metropolis-Hastings (M-H) algorithm.
Denote by
the density (or mass) function of the target distribution, that is, the
distribution from which we wish to extract a sequence of draws (e.g.,
could be the posterior density in Bayesian inference).
Denote by
a conditional distribution from which we are able to extract samples of
IID draws
(
,
and the argument of the target distribution all have the same dimension).
The M-H algorithm starts from any value
belonging to the support of the target distribution and generates the
subsequent values
as follows:
generate a proposal
from the proposal distribution
;
compute the acceptance
probability
draw a uniform
random variable
(on
);
if
,
accept the proposal and set
;
otherwise, reject the proposal and set
.
It can be proved that, provided some technical conditions are met, the
sequence
thus generated is the realization of a Markov Chain that converges to the
stationary distribution
.
Furthermore, for any
function
,
where
is
any function such that the expected value
exists and is finite and
denotes almost sure
convergence as
tends to infinity. In other words, a Strong Law of Large Numbers (an ergodic
theorem) holds for the sample mean
.
The power of this algorithm lies in the fact that you need to know the
function
only up to a multiplicative constant. For example, in
Bayesian
inference it is very common to know the posterior distribution up to a
multiplicative constant because the likelihood and the prior are known but the
marginal distribution is not.
Suppose
and
we know
but not
. Then, the acceptance probability in the M-H algorithm
is
As a consequence, the acceptance probability, which is the only quantity that
depends on
,
can be computed without knowing the constant
.
This is the beauty of the Metropolis-Hastings algorithm: we can generate draws
from a distribution even if we do not fully know the density of that
distribution!
For more details see the lecture on the Metropolis-Hastings algorithm.
Another popular MCMC algorithm is the so-called Gibbs sampling algorithm.
Suppose you want to generate draws of a random vector
having joint
density
where
are
blocks of entries (or single entries) of
.
Given a block
,
denote by
the vector comprising all blocks except
.
Suppose you are able to generate draws from the
conditional distributions
,
...,
.
In MCMC jargon, these are called the full conditional distributions.
The Gibbs sampling algorithm starts from any vector
belonging to the support of the target distribution and generates the
subsequent values
as follows:
for
,
generate
from the conditional distribution with
density
In other words, at each iteration, the blocks are extracted one by one from their full conditional distributions, conditioning on the latest draws of all the other blocks.
Note that at iteration
,
when you extract the
-th
block, the latest draws of blocks
to
are those already extracted in iteration
,
while the latest draws of blocks
to
are those extracted in the previous iteration
(
).
It can be proved that Gibbs sampling is a special case of Metropolis-Hastings.
Therefore, as in M-H, the sequence
generated by the algorithm is the realization of a Markov Chain that converges
to the stationary distribution
.
Furthermore, an ergodic theorem for the sample means
holds.
A common practice is to discard the first draws of an MCMC sample. For example, if we have 110,000 draws, we discard the first 10,000 and keep the remaining 100,000.
The set of draws that are discarded is called the burn-in sample.
Why do we do this? If the starting value
is extracted from a distribution which is very different from the target
distribution
,
then also the distributions of the subsequent draws
,
,
... will be very different from
.
However, the differences will become smaller and smaller because the chain
converges to the target distribution. By discarding a burn-in sample, we
eliminate draws whose distributions are far from the target distribution and
we keep draws whose distributions are closer to the target. This reduces the
bias of any Monte Carlo approximation performed with the MCMC sample.
Please cite as:
Taboga, Marco (2021). "Markov Chain Monte Carlo (MCMC) methods", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/Markov-Chain-Monte-Carlo.
Most of the learning materials found on this website are now available in a traditional textbook format.