A hierarchical Bayesian model is a model in which the prior distribution of some of the model parameters depends on other parameters, which are also assigned a prior.
Given the observed data , in a hierarchical Bayesian model, the likelihood depends on two parameter vectors and and the prioris specified by separately specifying the conditional distribution and the distribution .
In the literature it is often required that the likelihood does not depend on , that is,
In this special case, the parameter is called hyper-parameter and the the prior is called hyper-prior.
We use a broader definition of hierarchical model, that does not necessarily include assumption (1), because it allows for a unified treatment of several interesting models.
The following examples illustrate two popular models that fall within our definition.
Suppose the sample is a vector of draws from normal distributions having different unknown means and a known common variance :
Denote by the vector of means:
Conditional on , the observations are assumed to be independent. As a consequence, the likelihood of the whole sample, conditional on , can be written as
Now, assume the means are a sample of IID draws from a normal distribution with unknown mean and known variance , so that
Finally, we assign a normal prior (with known mean and variance ) to the hyper-parameter :
The model just described is a hierarchical model. With the notation used in the definition, we have , and the added assumption that
Suppose that the sample is a vector of IID draws from a normal distribution having unknown mean and unknown variance .
The likelihood of the whole sample, conditional on and , is
Now, assume that the mean is itself normal with known mean and variance , where is a known parameter:
This is a very popular model, known as normal - inverse Gamma model.
It fits the above definition of a hierarchical model with , .
The computation of the posterior distribution is usually performed in steps: first is taken as given, and a conditional distribution for is derived; then a posterior for is computed.
The steps are as follows.
Conditional on (i.e., by keeping it fixed), compute:
the prior predictive distribution of :
the posterior distribution of :
By using from step 1, compute:
the prior predictive distribution of :
the posterior marginal distribution of :
Compute the posterior joint distribution of and :
Compute the posterior marginal distribution of :
When we are not able to carry out the integrations required to derive the predictive distributions, or when we cannot compute posteriors with Bayes' rule, then we can use other computational methods (e.g., the factorization method illustrated in the lecture on Bayesian inference). In these cases, the steps of the above procedure remain valid: we first derive posterior and predictive distributions given , by using whatever method is available to us; then, we use the conditional distributions thus derived to compute the posterior of .
In the definition above, there were only two levels: a parameter and a hyper-parameter .
The definition can be generalized to more than two levels. For example, we could have a third parameter , the likelihood and the priorwhich is specified by separately specifying the conditional distributions , and the distribution .
With more than two levels, the computation strategy is similar to that illustrated in the previous section. First, we take all parameters but one as given, and we derive the prior predictive distribution of , conditional on the parameters that have been kept fixed. Then, we use the predictive distribution thus obtained as likelihood, and we use it to obtain another prior predictive distribution for , conditional on a smaller number of parameters than in the previous step. And so on.
Most of the learning materials found on this website are now available in a traditional textbook format.