Classification models belong to the class of conditional models, that is, probabilistic models that specify the conditional probability distributions of the output variables given the inputs. The peculiarity of classification models is that in these models the output has a discrete probability distribution (as opposed to regression models, where the output variable is continuous).
There are two different flavors of classification models:
binary classification models, where the output variable has a Bernoulli distribution conditional on the inputs;
multinomial classification models, where the output has a Multinoulli distribution conditional on the inputs.
Remember that a Bernoulli random variable can take only two values, either 1 or 0. So, a binary model is used when the output can take only two values.
The Multinoulli distribution is more general. It can be used to model outputs that can take two or more values. If the output variable can take different values, then it is represented as a Multinoulli random vector, that is, a random vector whose realizations have all entries equal to 0, except for the entry corresponding to the realized output value, which is equal to 1
Example If the output variable is sex (male of female), then it can be represented as a Bernoulli random variable that takes value 1 for males and 0 for females. It can also be represented as a Multinoulli random vector that takes value for males andfor females.
The previous example also shows that a binary classification model (Bernoulli distribution) can always be written as a multinomial model (Multinoulli distribution).
Example If the output variable can belong to one of three classes (red, green or blue), then it can be represented as a Multinoulli random vector whose realizations are
We now introduce the main assumptions, the notation and the terminology we are going to use to present the basics of classification models.
We assume that a sample of data for is observed by the statistician. The output variables are denoted by , and the associated inputs, which are vectors, are denoted by .
The output can take values . In the case of a binary model, , and . In the case of a multinomial model, and, for , is a vector whose entries are all equal to zero except for the -th entry, which is equal to .
We assume that there are functions , ..., such thatfor and . The conditional probability depends not only on the observed output but also on a vector of parameters .
Probabilities need to be non-negative and sum up to 1 (see Probability and its properties). As a consequence, the functions must be defined in such a way thatfor any couple .
Example The logistic classification model is a binary model in which the conditional probability mass function of the output is a non-linear function of the inputs :where is a vector of coefficients and is the logistic function defined byThus, conditional on , the output has a Bernoulli distribution with probability . Using the general notation proposed above and defining , we have:It can easily be checked that the probabilities sum up to 1 for any and any .
Example The multinomial logistic classification model (also called softmax model) is a multinomial model in which the conditional probabilities of the outputs are defined for aswhere to each class corresponds a vector of coefficients . The vector of parameters isThus, conditional on , the output has a Multinoulli distribution with probabilities
The parameters of a multinomial classification model can be estimated by maximum likelihood. The likelihood of an observation can be written aswhere is the -th component of the Multinoulli vector . Note that takes value 1 when the output variable belongs to the -th class and 0 otherwise. As a consequence, only one term in the product (the term corresponding to the observed class) can be different from 1. The latter fact is illustrated by the following example.
Example When there are two classes () and the output variable belongs to the second class, we have that the realization of the Multinoulli random vector isThe two components of the vector areand the likelihood is
Denote the vector of all outputs by and the matrix of all inputs by . If we assume that the observations in the sample are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:and the log-likelihood is
The maximum likelihood estimator of the parameter solves
In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for a detailed explanation of how this can be done). Often, derivatives based algorithms are used (see the aforementioned lecture for an explanation). For several classification models (e.g., the multinomial logistic model introduced in the example above) the use of derivatives based algorithms is facilitated by the fact that the gradient (i.e., the vector of derivatives) of the functions with respect to can be computed analytically, which allows to compute analytically also the gradient of the log-likelihood function by using the chain rule:
Most of the learning materials found on this website are now available in a traditional textbook format.