The logistic classification model (or logit model) is a binary classification model in which the conditional probability of one of the two possible realizations of the output variable is assumed to be equal to a linear combination of the input variables, transformed by the logistic function.
A logit model is often called logistic regression model. However, in these lecture notes we prefer to stick to the convention (widespread in the machine learning community) of using the term regression only for conditional models in which the output variable is continuous. So we use the term classification here because in a logit model the output is discrete.
Suppose that we observe a sample of data
for
.
Each observation in the sample is made up of:
an output variable denoted by
;
a
vector of inputs, denoted by
.
It is assumed that the output
can take only two values, either 1 or 0 (it is a
Bernoulli random
variable).
The probability that the output
is equal to 1, conditional on the inputs
,
is assumed to
be
where
is
the logistic function and
is a
vector of coefficients.
It is immediate to see that the logistic function
is always positive. Furthermore, it is increasing and
so
that it
satisfies
Thus,
is a well-defined probability because it lies between 0 and 1.
Since probabilities need to sum up to 1, the probability that the output
is equal to 0 (the only other possible realization of
)
is
Why is the logistic classification model specified in this manner? Why is the
logistic function used to transform the linear combination of inputs
?
The simple answer is that we would like to do something similar to what we do
in a linear
regression model: use a linear combination of the inputs as our prediction
of the output. However, our prediction needs to be a probability and there is
no guarantee that the linear combination
is between 0 and 1. Thus, we use the logistic function because it provides a
convenient way of transforming
and forcing it to lie in the interval between 0 and 1.
We could have used other functions that enjoy properties similar to the logistic function. As a matter of fact, other popular classification models can be obtained by simply substituting the logistic function with another function and leaving everything else in the model unchanged. For example, by substituting the logit function with the cumulative distribution function of a standard normal distribution, we obtain the so-called probit model.
Another way of thinking about the logit model is to define a latent variable
(i.e., an unobserved
variable)where
is a random error term that adds noise to the relationship between the inputs
and the variable
.
The latent variable
is then assumed to determine the output
as
follows:
From
these assumptions and the additional assumption that
has a symmetric distribution around
it follows
that
where
is the cumulative distribution
function of the error
.
It turns out that the logistic function used to define the logit model is the
cumulative distribution function of a symmetric probability distribution
called standard logistic distribution. Therefore, the logit model can be
written as a latent variable model, specified by equations (1) and (2) above,
in which the error
has a logistic distribution.
By choosing different distributions for the error
,
we obtain other binary classification models. For example, if we assume that
has a standard normal distribution, then we obtain the so-called probit model.
The vector of coefficients
is often estimated by
maximum
likelihood methods.
Assume that the observations
in the sample are IID and denote the
vector of all outputs by
and the
matrix of all inputs by
.
The latter is assumed to have full rank.
It is possible to prove (see the lecture on
Maximum
likelihood estimation of the logit model) that the maximum likelihood
estimator
(when it exists) can be obtained by performing simple
Newton-Raphson
iterations as follows:
start from a guess
(e.g.,
);
recursively update the
guess:where:
and
is an
diagonal matrix (i.e., having all off-diagonal elements equal to
)
such that the elements on its diagonal are
stop when numerical convergence is achieved, that is, when the difference
between
and
is so small as to be negligible;
set the maximum likelihood estimator
equal to the last update
(denote the last iteration by
).
The asymptotic covariance matrix of the maximum likelihood estimator
can be consistently estimated by
so
that the distribution of the estimator
is approximately normal with mean equal to
and covariance
matrix
.
If the logit model is estimated with the maximum likelihood procedure
illustrated above, any one of the classical
tests
based on maximum likelihood procedures (e.g.,
Wald,
Likelihood
Ratio, Lagrange
Multiplier) can be used to
test an
hypothesis about the vector of coefficients
.
Other tests can be constructed by exploiting the asymptotic normality of the
maximum likelihood estimator. For example, we can perform a z test to test the
null hypothesis
where
is the
-th
entry of the vector of coefficients
and
.
The test statistic
iswhere
is the
-th
entry of
and
is the
-th
entry on the diagonal of the matrix
.
As the sample size
increases,
converges in distribution to a
standard normal
distribution. The latter distribution can be used to
derive critical values and perform the
test.
We
haveBy
the asymptotic normality of the maximum likelihood estimator, the numerator
converges in
distribution to a normal random variable with mean
.
Furthermore, the consistency of our estimator of the asymptotic covariance
matrix implies
that
where
denotes convergence
in probability. By the
Continuous Mapping
theorem,
and,
by Slutsky's theorem,
converges in distribution to a standard normal random variable.
Please cite as:
Taboga, Marco (2021). "Logistic classification model (logit or logistic regression)", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/logistic-classification-model.
Most of the learning materials found on this website are now available in a traditional textbook format.