This lecture shows how to apply the basic principles of Bayesian inference to the problem of estimating the parameters (mean and variance) of a normal distribution.
Table of contents
The observed sample used to carry out inferences is a vector whose entries are independent and identically distributed draws from a normal distribution.
In this section, we are going to assume that the mean of the distribution is unknown, while its variance is known.
In the next section, also will be treated as unknown.
The probability density function of a generic draw iswhere we use the notation to highlight the fact that the density depends on the unknown parameter .
Since are independent, the likelihood is
The prior isthat is, has a normal distribution with mean and variance .
This prior is used to express the statistician's belief that the unknown parameter is most likely equal to and that values of very far from are quite unlikely (how unlikely depends on the variance ).
Given the prior and the likelihood, specified above, the posterior iswhere
Write the joint distribution as
where we have definedNote thatwhere in step we added and subtracted the sample mean and in step we used the fact that We can use this result to write where in step we have definedWe can put together the results obtained so far and getwhere is a function that depends on but not on , and is a probability density function if considered as a function of for any given (note that depends on through ). In fact, is the density of a normal distribution with mean and variance . By a standard result on the factorization of probability density functions (see also the introduction to Bayesian inference), we have that Therefore, the posterior distribution is a normal distribution with mean and variance . We have yet to figure out what is. This will be done in the next proof.
Thus, the posterior distribution of is a normal distribution with mean and variance .
Note that the posterior mean is the weighted average of two signals:
the sample mean of the observed data;
the prior mean .
The greater the precision of a signal, the higher its weight is. Both the prior and the sample mean convey some information (a signal) about . The signals are combined (linearly), but more weight is given to the signal that has higher precision (smaller variance).
The weight given to the sample mean increases with , while the weight given to the prior mean does not. As a consequence, when the sample size becomes large, more and more weight is given to the sample mean. In the limit, all weight is given to the information coming from the sample and no weight is given to the prior.
The prior predictive distribution is where is an vector of ones, and is the identity matrix.
From the previous proof we know thatwhere we have definedBy defining , we can write where in step we have used the facts thatand
and in step we have used the fact that so thatNow, note that
where in step we have used the matrix determinant lemmaNow, putting together all the pieces, we have
Thus, the prior predictive distribution of is multivariate normal with mean and covariance matrix
Under this distribution, a draw has prior mean , variance and covariance with the other draws equal to . The covariance is positive because the draws , despite being independent conditional on , all share the same mean parameter , which is random.
Assume that new observations are drawn independently from the same normal distribution from which have been extracted.
The posterior predictive distribution of the vectoris where is the identity matrix and is a vector of ones.
So, has a multivariate normal distribution with mean (where is the posterior mean of ) and covariance matrix (where is the posterior variance of ).
The derivation is almost identical to the derivation of the prior predictive distribution of . The posterior is used as a new prior. The likelihood is the same as because is independent of conditional on . Therefore, we can perform the factorizationand derive by following the same procedure we followed to derive . The main difference is that we need to replace the prior mean with the posterior mean and the prior variance with the posterior variance .
As in the previous section, the sample is assumed to be a vector of IID draws from a normal distribution.
However, we now assume that not only the mean , but also the variance is unknown.
The probability density function of a generic draw isThe notation highlights the fact that the density depends on the two unknown parameters and .
Since are independent, the likelihood is
The prior is hierarchical.
First, we assign the following prior to the mean, conditional on the variance: that is, has a standard normal distribution with mean and variance .
Note that the variance of the parameter is assumed to be proportional to the unknown variance of the data points. The constant of proportionality determines how tight the prior is, that is, how probable we deem that is very close to the prior mean .
Then, we assign the following prior to the variance:that is, has an inverse-Gamma distribution with parameters and (i.e., the precision has a Gamma distribution with parameters and ).
By the properties of the Gamma distribution, the prior mean of the precision isand its variance is
We can think of as our best guess of the precision of the data generating distribution. is the parameter that we use to express our degree of confidence in our guess about the precision. The greater , the tighter our prior about is, and the more we deem probable that is close to .
Conditional on , the posterior distribution of iswhere
This can be derived from the case where is known (see above). In that caseNow, . So,and
Thus, conditional on and , is normal with mean and variance .
Conditional on , the prior predictive distribution of is where is an vector of ones, and is the identity matrix.
This can be derived from the case where is known (see above). In that casewhere . So,
The posterior distribution of the variance iswhere
Consider the joint distribution where we have definedWe can write where is a function that depends on (via ) but not on , and is a probability density function if considered as a function of for any given (note that depends on through ). In particular, is the density of an inverse-Gamma distribution with parameters and . Thus, by a well-known result on the factorization of joint probability density functions, we have that Therefore, the posterior distribution is inverse-Gamma with parameters and . What distribution is will be shown in the next proof.
Thus, has a Gamma distribution with parameters and .
The prior predictive distribution of is that is, a multivariate Student's t distribution with mean , scale matrix and degrees of freedom.
The prior predicitve distribution has already been derived in the previous proof. We just need to do a little bit of algebra to clearly show that it is a multivariate Student's t distribution with mean , scale matrix and degrees of freedom:
The posterior distribution of the mean iswhere is the Beta function.
We have already proved that, conditional on and , is normal with mean and variance We have also proved that, conditional on , has a Gamma distribution with parameters and . Thus, we can writewhere is standard normal conditional on and , and has a Gamma distribution with parameters and . Now, note that, by the properties of the Gamma distribution,has a Gamma distribution with parameters and . We can writeButhas a standard Student's t distribution with degrees of freedom (see the lecture on the t distribution). As a consequence, has a Student's t distribution with mean , scale parameter and degrees of freedom. Thus, its density iswhere is the Beta function.
In other words, has a t distribution with mean , scale parameter and degrees of freedom.
Most of the learning materials found on this website are now available in a traditional textbook format.