# Gaussian mixture - Maximum likelihood estimation

In this lecture we show how to perform maximum likelihood estimation of a Gaussian mixture model with the Expectation-Maximization (EM) algorithm.

At the end of the lecture we discuss practically relevant aspects of the algorithm such as the initialization of parameters and the stopping criterion.

## The sample

The sample is made of independently and identically distributed draws from a mixture of -dimensional multivariate normal distributions.

## The mixture

The joint probability density function of the -th observation iswhere:

• is the probability of the -th component of the mixture;

• the vector is the mean vector of the -th component;

• the matrix is the covariance matrix of the -th component.

The probabilities of the components of the mixture are non-negative and sum up to :

The covariance matrices are assumed to be positive definite, so that their determinants are strictly positive.

## The vector of parameters

We will denote by the vector that gathers all the parameters of the model, that is,

## The mixture as a latent variable model

We can write the Gaussian mixture model as a latent-variable model:where:

• the observable variables are conditionally multivariate normal with mean and variance :

• the latent variables have the discrete distribution for .

In the formulae above we have explicitly written the value of the latent variable on which we are conditioning (). However, in what follows we will also use the notations and if the value taken by is implicitly given by the context.

## The EM algorithm

Since we are able to write the Gaussian mixture model as a latent-variable model, we can use the EM algorithm to find the maximum likelihood estimators of its parameters.

Starting from an initial guess of the parameter vector , the algorithm produces a new estimate of the parameter vector at each iteration .

The -th iteration consists of two steps:

1. the Expectation step, where we compute the conditional probabilities of the latent variables using the vector from the previous iteration;

2. the Maximization step, where we maximize the expectation of the complete-data log-likelihood, computed with respect to the conditional probabilities found in the Expectation step. The result of the maximization is a new parameter vector .

The iterations end when a stopping criterion is met (e.g., when the increases in the log-likelihood or the changes in the parameter vector become smaller than a certain threshold). See below for more details.

## Some notation

We denote by and the vectors obtained by stacking all the observed and latent variables and respectively.

Moreover, we use double subscripts for the various parameters. The first subscript denotes the mixture component , and the second one the iteration number .

For example, at the -th iteration, the parameter vector contains an estimate of the covariance matrix of the -th component of the mixture, which we denote by

## The E step

In the E step, the conditional probabilities of the components of the mixture are calculated separately for each observation:where

Proof

The general formula for the calculation of conditional probabilities in the E step iswhere is the set of all possible values that can take. In our case, . Since the observations are independent, we haveandTherefore,where

We then use the conditional probabilities to compute the expected value of the complete log-likelihood:

Proof

The expected value can be computed as follows:where: in step we have used the standard E-step formula for computing the expectation of the complete log-likelihood (as above, is the set of all possible values that the vector of unobservable variables can take); in step we have exploited the independence of the observations; in step we have used the fact that is the expected value of under the joint distribution of given and ; therefore, we can use the marginal distribution of given and to compute the expected value of each term in the sum. Moreover, we can write

## The M step

In the M step, we solve the maximization problem

The solution is

Proof

In order to solve the problem, we need to equate to zero the gradients of with respect to the various components of the vector . We start with the probabilities of the components of the mixture, which need to satisfy the constraintWe take care of the constraint by performing the substitutionTherefore,for . We can writewhich is solved byThe first-order conditions for the mean vectors arewhich are solved byThe first-order conditions for the inverse covariance matrices (see this lecture for more details) arewhich are solved by

## Convergence and multiple starts

As we explained in the lecture on the EM algorithm, while the likelihood is guaranteed to increase at each iteration, there is no guarantee that the algorithm converges to a global maximum of the likelihood.

For this reason, we often use the multiple-starts approach:

• we run the EM algorithm several times with different random initializations of the parameter vector ;

• our final estimate of the parameter vector is that which achieves the highest value of the incomplete log-likelihood

## Initialization

Several papers discuss how to choose the starting value (see the references below).

According to our experience, when we use the multiple-starts approach, the best way to draw random initializations of is as follows.

At the beginning of each run of the EM algorithm, we repeat the following steps for :

• we draw a small random sub-sample from our sample of observations (the numerosity of the sub-sampe could be, e.g., );

• we set the starting values and equal to the sample means and variances of on the sub-sample;

• we set .

## Stopping criterion

There is a vast literature on the stopping criteria used to decide when to end the EM iterations (see below).

However, in our experience, the most robust method is to stop the iterations whenwhere is a pre-specified threshold ( generally works very well).

In other words, we stop the algorithm when none of the conditional probabilities computed in the E step changes significantly with respect to the previous iteration (note the iteration subscripts and ).

If you check the formulae in the M step, you will realize that the parameter estimates are a function of the observations and of the conditional probabilities . Therefore, insignificant changes in the latter imply that also parameter estimates are stable.

## Degenerate solutions and numerical problems

The EM algorithm can sometimes converge to degenerate solutions in which the covariance matrix of one of the components of the mixture is singular and the log-likelihood is infinite (most likely resulting in a NaN on computers).

In our experience, imposing constraints in the M step to avoid such singularities can seriously harm the convergence properties of the EM algorithm.

Our advice is to simply terminate any run of the algorithm that gives rise to singularities or NaNs and proceed to next run (with a new random initialization, according to the multiple-starts approach described above).

If we find that we often incur in singularities, we simply increase the number of runs of the algorithm.

## References

We have provided our own view about the best initialization method and stopping criterion. However, you are invited to consult the references below about these two topics.

### Initialization methods

Biernacki, C., Celeux, G. and Govaert, G., 2003. Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Computational Statistics & Data Analysis, 41(3-4), pp.561-575.

Blömer, J. and Bujna, K., 2013. Simple methods for initializing the em algorithm for gaussian mixture models. CoRR.

Kwedlo, W., 2013. A new method for random initialization of the EM algorithm for multivariate Gaussian mixture learning. In Proceedings of the 8th International Conference on Computer Recognition Systems CORES 2013 (pp. 81-90). Springer, Heidelberg.

McKenzie, P. and Alder, M., 1994. Initializing the EM algorithm for use in Gaussian mixture modelling. Pattern recognition in practice IV, pp.91-105.

Paclík, P. and Novovičová, J., 2001. Number of components and initialization in Gaussian mixture model for pattern recognition. In Artificial Neural Nets and Genetic Algorithms (pp. 406-409). Springer, Vienna.

Shireman, E., Steinley, D. and Brusco, M.J., 2017. Examining the effect of initialization strategies on the performance of Gaussian mixture modeling. Behavior research methods, 49(1), pp.282-293.

### Stopping criteria

Abbi, R., El-Darzi, E., Vasilakis, C. and Millard, P., 2008, September. Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay. In 2008 4th International IEEE Conference Intelligent Systems (Vol. 1, pp. 3-9). IEEE.

Kontaxakis, G. and Tzanakos, G., 1992, October. Study of the convergence properties of the EM algorithm-a new stopping rule. In IEEE Conference on Nuclear Science Symposium and Medical Imaging (pp. 1163-1165). IEEE.

Kontaxakis, G. and Tzanakos, G., 1993, March. Further study of a stopping rule for the EM algorithm. In 1993 IEEE Annual Northeast Bioengineering Conference (pp. 52-53). IEEE.