In order to understand the derivation, you need to be familiar with the concept of trace of a matrix.
Suppose we observe the first terms of an IID sequence of -dimensional multivariate normal random vectors.
The joint probability density function of the -th term of the sequence is where:
is the mean vector;
is the covariance matrix.
The covariance matrix is assumed to be positive definite, so that its determinant is strictly positive.
The likelihood function is
Since the terms in the sequence are independent, their joint density is equal to the product of their marginal densities. As a consequence, the likelihood function can be written as
The log-likelihood function is
The log-likelihood is obtained by taking the natural logarithm of the likelihood function:
Note that the likelihood function is well-defined only if is strictly positive. This reflects the assumption made above that the true parameter is positive definite, which implies that the search for a maximum likelihood estimator of is restricted to the space of positive definite matrices.
Before deriving the maximum likelihood estimators, we need to state some facts about matrices, their trace and their derivatives:
if is a scalar, then it is equal to its trace:
if two matrices and are such that the products and are both well defined, then
the trace is a linear operator: if and are two matrices and and are two scalars, then
the gradient of the trace of the product of two matrices and with respect to is
the gradient of the natural logarithm of the determinant of is
if is a vector and is a symmetric matrix, then
The maximum likelihood estimators of the mean and the covariance matrix are
We need to solve the following maximization problem The first order conditions for a maximum are The gradient of the log-likelihood with respect to the mean vector is which is equal to zero only ifTherefore, the first of the two first-order conditions implies The gradient of the log-likelihood with respect to the precision matrix is By transposing the whole expression and setting it equal to zero, we getThus, the system of first order conditions is solved by
We are now going to give a formula for the information matrix of the multivariate normal distribution, which will be used to derive the asymptotic covariance matrix of the maximum likelihood estimators.
Denote by the column vector of all parameters:where converts the matrix into a column vector whose entries are taken from the first column of , then from the second, and so on.
The log-likelihood of one observation from the sample can be written as
The information matrix is
Define the vector
if is an element of , say the -th, then the -th entry of the vector is equal to and all the other entries are equal to ;
if is not an element of , then all the entries of the vector are equal to .
Define the matrix
if is an element of , say , then the -th entry of the matrix is equal to and all the other entries are equal to ;
if is not an element of , then all the entries of the matrix are equal to .
It can be proved (see, e.g., Pistone and Malagò 2015) that the -th element of the information matrix is
The vectoris asymptotically normal with asymptotic mean equal toand asymptotic covariance matrix equal to
In more formal terms, converges in distribution to a multivariate normal distribution with zero mean and covariance matrix .
In other words, the distribution of the vector can be approximated by a multivariate normal distribution with mean and covariance matrix
Pistone, G. and Malagò, L. (2015) " Information Geometry of the Gaussian Distribution in View of Stochastic Optimization", Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII, 150-162.
Most of the learning materials found on this website are now available in a traditional textbook format.