Search for probability and statistics terms on Statlect
StatLect
Index > Machine learning

Overfitting

by , PhD

Machine learning methods are very good at dealing with overfitting, that is, the tendency of statistical models to accurately fit previously seen data and to poorly predict previously unseen data.

Table of Contents

Parametric models

Remember that we have defined a predictive model as a function $f$ that takes an input $x_{t}$ as argument and returns a prediction $widetilde{y}_{t}$ of the true output $y_{t}$.

In what follows, we are going to deal with parametric models, that is, families of models [eq1] indexed by a parameter vector $	heta$.

In a parametric model, predicted outputs depend on the parameter and so do losses [eq2], the risk [eq3] and the empirical risk [eq4].

Typical examples of parametric models are:

Mathematics of overfitting

When we estimate the parameter $	heta$ naively by empirical risk minimization, we search for a solution of the problem [eq8]

But this is equivalent to [eq9]

Thus, we are minimizing the sum of two terms:

  1. the true risk of the predictive model [eq3];

  2. the term [eq11], which is a measure of the reliability of the empirical risk [eq12] as an estimate of the true risk [eq13]. The smaller this term is, the more [eq14] under-estimates the true risk (the less its reliability).

Clearly, we would like to minimize only the first term, while minimizing the second one (reliability) is detrimental. Unfortunately, we are minimizing both!

There is no general way to tell how much the second term contributes to the minimum we attain, but in certain cases its contribution can be substantial. In other words, our estimate of the expected loss, performed with the same sample used to estimate $	heta $, can be over-optimistic! This is called overfitting.

In the above formulae, we can replace [eq15] with [eq16], the average loss on a sample (so-called test sample) different from the one used to estimate $	heta$ (so-called estimation sample). By doing so, we can see that, when we perform empirical risk minimization naively, we are also maximizing the disappointment experienced on the test sample.

Another way to see the overfitting problem is that the empirical risk provides a biased estimate of the true risk when it is computed with the same sample used to train our models (in special cases the bias can be derived analytically).

Important: when the predictive model is a linear regression model and the loss function is the squared error, then naive empirical risk minimization is the same as OLS (ordinary least squares) estimation. This is one of the special cases in which the bias of the empirical risk can be derived analytically.

What machine learning does: in the typical machine learning workflow, overfitting is constantly kept under control during the minimization process, so as to avoid it as much as possible.

Overfitting and model complexity

Typically, but not necessarily, overfitting tends to be more severe when models are highly complex (i.e., the dimension of the parameter vector $	heta $ is large).

A classical example is provided by a simple linear regression of few data points on time vs a more complex linear regression on time and several of its powers (polynomial terms).

A more complex model typically has better in-sample performance...

Overfitting in-sample...

... but worse out-sample performance

... leads to inaccurate out-of-sample predictions.

Curse of dimensionality

Another reason why models with many parameters tend to overfit is that they are affected by the so-called curse of dimensionality.

For concreteness, suppose that the number of parameters is the same as the number of output variables, as in linear regression (where the number of regressors is equal to the number of regression coefficients, which are the parameters to be estimated).

The percentage of regions of the space of outputs covered by our data set decreases exponentially in the number of outputs (parameters).

Coverage of the space decreases exponentially in the number of dimensions.

As a consequence, as we increase the dimensionality of our model, we increase the probability that new data points (out-of-sample data) will belong to regions of space that were not covered by our estimation data and where things may work very differently than in the regions we covered.

This exponential decrease in coverage is called curse of dimensionality.

A related phenomenon is that the average distance between new points and the points covered by the estimation sample tends to increase with the dimension of the model.

Bias-variance decomposition

Further insights about the possible pitfalls of highly parametrized models can be derived from the so-called bias-variance decomposition.

Suppose that the loss function is the squared error.

Then, the best possible prediction of $y_{t}$ is the conditional expectation [eq17]

It can be proved that [eq18] which is called bias-variance decomposition.

Note that the variance of the prediction is caused by sampling variability (i.e., the variance is obtained by integrating with respect to the probability distribution of the sample used for training the predictive model).

Illustration of bias and variance.

It turns out that in many settings variance is an increasing function of model complexity (number of parameters), while bias is a decreasing function. As a consequence, there is a trade-off between the two.

Beyond the point of optimal balance between the two (i.e., if complexity is increased too much), performance degrades.

Illustration of the trade-off between bias and variance.

Overfitting in linear regressions estimated by OLS

In linear regression with K regressors (and zero-mean variables), the true vector of regression coefficients is [eq19]where:

The OLS estimator is [eq20]where the matrix X and the vector $y$ are obtained by stacking inputs and outputs.

Roughly speaking, $X^{	op}X$ estimates $Sigma_{xx}$ and $X^{	op}y$ estimates $sigma_{xy}$.

The number of covariances [eq21] to estimate (and potential sources of error in estimating $widehat{	heta }$) grows with $K^{2}$.

The fact that the sources of error grow with the square of the number of parameters is one of the reasons why OLS regressions with many regressors overfit data and work poorly (unless you have tons of data).

Sample size

An important thing to note is that all of the aforementioned problems are very severe with small sample sizes, but they tend to become less severe when the sample size increases:

How to cite

Please cite as:

Taboga, Marco (2021). "Overfitting", Lectures on machine learning. https://www.statlect.com/machine-learning/overfitting.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.