Search for probability and statistics terms on Statlect

Classification models

by , PhD

Classification models belong to the class of conditional models, that is, probabilistic models that specify the conditional probability distributions of the output variables given the inputs. The peculiarity of classification models is that in these models the output has a discrete probability distribution (as opposed to regression models, where the output variable is continuous).

Table of Contents

Types of classification models

There are two different flavors of classification models:

Remember that a Bernoulli random variable can take only two values, either 1 or 0. So, a binary model is used when the output can take only two values.

The Multinoulli distribution is more general. It can be used to model outputs that can take two or more values. If the output variable can take $J$ different values, then it is represented as a $1	imes J$ Multinoulli random vector, that is, a random vector whose realizations have all entries equal to 0, except for the entry corresponding to the realized output value, which is equal to 1$.$

Example If the output variable is gender (male of female), then it can be represented as a Bernoulli random variable that takes value 1 for males and 0 for females. It can also be represented as a $1	imes 2$ Multinoulli random vector that takes value [eq1]for males and[eq2]for females.

The previous example also shows that a binary classification model (Bernoulli distribution) can always be written as a multinomial model (Multinoulli distribution).

Example If the output variable can belong to one of three classes (red, green or blue), then it can be represented as a Multinoulli random vector whose realizations are[eq3]

Main assumptions and notation

We now introduce the main assumptions, the notation and the terminology we are going to use to present the basics of classification models.

We assume that a sample of data [eq4] for $i=1,ldots ,N$ is observed by the statistician. The output variables are denoted by $y_{i}$, and the associated inputs, which are $1	imes K$ vectors, are denoted by $x_{i}$.

The output can take $J$ values [eq5]. In the case of a binary model, $J=2$, $c_{1}=1$ and $c_{2}=0$. In the case of a multinomial model, $Jgeq 2$ and, for $j=1,ldots ,J$, $c_{j}$ is a $1	imes J$ vector whose entries are all equal to zero except for the $j$-th entry, which is equal to 1.

We assume that there are $J,  $functions $f_{1}$, ...,$f_{J}$ such that[eq6]for $i=1,ldots ,N$ and $j=1,ldots ,J$. The conditional probability depends not only on the observed output but also on a vector of parameters $	heta $.

Probabilities need to be non-negative and sum up to 1 (see Probability and its properties). As a consequence, the functions $f_{j}$ must be defined in such a way that[eq7]for any couple [eq8].

Example The logistic classification model is a binary model in which the conditional probability mass function of the output $y_{i}$ is a non-linear function of the inputs $x_{i}$:[eq9]where $eta $ is a Kx1 vector of coefficients and [eq10] is the logistic function defined by[eq11]Thus, conditional on $x_{i}$, the output $y_{i}$ has a Bernoulli distribution with probability [eq12]. Using the general notation proposed above and defining $	heta =eta $, we have:[eq13]It can easily be checked that the probabilities sum up to 1 for any $x_{i}$ and any $	heta $.

Example The multinomial logistic classification model (also called softmax model) is a multinomial model in which the conditional probabilities of the outputs are defined for $j=1,ldots ,J$ as[eq14]where to each class $j$ corresponds a Kx1 vector of coefficients $eta _{j}$. The vector of parameters $	heta $ is[eq15]Thus, conditional on $x_{i}$, the output $y_{i}$ has a Multinoulli distribution with probabilities [eq16]

Estimation by maximum likelihood

The parameters of a multinomial classification model can be estimated by maximum likelihood. The likelihood of an observation [eq4] can be written as[eq18]where $y_{ij}$ is the $j$-th component of the Multinoulli vector $y_{i}$. Note that $y_{ij}$ takes value 1 when the output variable belongs to the $j$-th class and 0 otherwise. As a consequence, only one term in the product (the term corresponding to the observed class) can be different from 1. The latter fact is illustrated by the following example.

Example When there are two classes ($J=2$) and the output variable belongs to the second class, we have that the realization of the Multinoulli random vector is[eq19]The two components of the vector are[eq20]and the likelihood is[eq21]

Denote the $N	imes 1$ vector of all outputs by $y$ and the $N	imes K$ matrix of all inputs by x. If we assume that the observations [eq22] in the sample are IID, then the likelihood of the entire sample is equal to the product of the likelihoods of the single observations:[eq23]and the log-likelihood is[eq24]

The maximum likelihood estimator $widehat{	heta }$ of the parameter $	heta $ solves[eq25]

In general, there is no analytical solution of this maximization problem and a solution must be found numerically (see the lecture entitled Maximum likelihood algorithm for a detailed explanation of how this can be done). Often, derivatives based algorithms are used (see the aforementioned lecture for an explanation). For several classification models (e.g., the multinomial logistic model introduced in the example above) the use of derivatives based algorithms is facilitated by the fact that the gradient (i.e., the vector of derivatives) of the functions [eq26] with respect to $	heta $ can be computed analytically, which allows us to compute analytically also the gradient of the log-likelihood function by using the chain rule:[eq27]

How to cite

Please cite as:

Taboga, Marco (2021). "Classification models", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.