# Multivariate normal distribution - Maximum Likelihood Estimation

In this lecture we show how to derive the maximum likelihood estimators of the two parameters of a multivariate normal distribution: the mean vector and the covariance matrix.

In order to understand the derivation, you need to be familiar with the concept of trace of a matrix.

## Setting

Suppose we observe the first terms of an IID sequence of -dimensional multivariate normal random vectors.

The joint probability density function of the -th term of the sequence is where:

The covariance matrix is assumed to be positive definite, so that its determinant is strictly positive.

We use , that is, the realizations of the first random vectors in the sequence, to estimate the two unknown parameters and .

## The likelihood function

The likelihood function is

Proof

Since the terms in the sequence are independent, their joint density is equal to the product of their marginal densities. As a consequence, the likelihood function can be written as

## The log-likelihood function

Proof

The log-likelihood is obtained by taking the natural logarithm of the likelihood function:

Note that the likelihood function is well-defined only if is strictly positive. This reflects the assumption made above that the true parameter is positive definite, which implies that the search for a maximum likelihood estimator of is restricted to the space of positive definite matrices.

For convenience, we can also define the log-likelihood in terms of the precision matrix :where we have used the property of the determinant

## Preliminaries

Before deriving the maximum likelihood estimators, we need to state some facts about matrices, their trace and their derivatives:

• if is a scalar, then it is equal to its trace:

• if two matrices and are such that the products and are both well defined, then

• the trace is a linear operator: if and are two matrices and and are two scalars, then

• the gradient of the trace of the product of two matrices and with respect to is

• the gradient of the natural logarithm of the determinant of is

• if is a vector and is a symmetric matrix, then

## The maximum likelihood estimators

The maximum likelihood estimators of the mean and the covariance matrix are

Proof

We need to solve the following maximization problem The first order conditions for a maximum are The gradient of the log-likelihood with respect to the mean vector is which is equal to zero only ifTherefore, the first of the two first-order conditions implies The gradient of the log-likelihood with respect to the precision matrix is By transposing the whole expression and setting it equal to zero, we getThus, the system of first order conditions is solved by

## Information matrix

We are now going to give a formula for the information matrix of the multivariate normal distribution, which will be used to derive the asymptotic covariance matrix of the maximum likelihood estimators.

Denote by the column vector of all parameters:where converts the matrix into a column vector whose entries are taken from the first column of , then from the second, and so on.

The log-likelihood of one observation from the sample can be written as

The information matrix is

Define the vector

Thus:

• if is an element of , say the -th, then the -th entry of the vector is equal to and all the other entries are equal to ;

• if is not an element of , then all the entries of the vector are equal to .

Define the matrix

Note that:

• if is an element of , say , then the -th entry of the matrix is equal to and all the other entries are equal to ;

• if is not an element of , then all the entries of the matrix are equal to .

It can be proved (see, e.g., Pistone and Malagò 2015) that the -th element of the information matrix is

## Asymptotic variance

The vectoris asymptotically normal with asymptotic mean equal toand asymptotic covariance matrix equal to

In more formal terms, converges in distribution to a multivariate normal distribution with zero mean and covariance matrix .

In other words, the distribution of the vector can be approximated by a multivariate normal distribution with mean and covariance matrix

## References

Pistone, G. and Malagò, L. (2015) " Information Geometry of the Gaussian Distribution in View of Stochastic Optimization", Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII, 150-162.

The book

Most of the learning materials found on this website are now available in a traditional textbook format.

Glossary entries
Share