Search for probability and statistics terms on Statlect
StatLect

Dummy variable

by , PhD

In regression analysis, a dummy variable is a regressor that can take only two values: either 1 or 0.

Dummy variables are typically used to encode categorical features.

Table of Contents

Example

Suppose that we want to analyze how personal income is affected by:

To do so, we can specify a linear regression model as follows:[eq1]where:

Interpretation

In the previous example, $eta _{2}$ is the regression coefficient of the dummy variable. It measures by how much postgraduate education raises income on average.

In general, the regression coefficient on a dummy variable gives us the average increase in $y_{i}$ observed when the dummy is equal to 1 (with respect to the base case in which the dummy is equal to 0).

Matrix form

Let us continue with the previous example, to see how a dummy variable looks like when the data is gathered in a matrix or table.

Suppose that our sample is as follows.

[eq2]

After encoding the categorical variable with a dummy, the vector $y$ of dependent variables and the matrix of regressors X (so-called design matrix) will be[eq3]

Note that the first column contains all 1s because we have included an intercept in the regression.

Collinearity

We might be tempted to include two dummies in our regression:

  1. a first dummy that is equal to 1 if the individual has a higher degree and 0 otherwise;

  2. a second dummy that is equal to 1 if the individual does not have a higher degree and 0 otherwise.

In our previous example, the design matrix would become[eq4]

The problem with this double encoding is that our regressors become perfectly multicollinear, that is, one of the columns of X is equal to a linear combination of the other columns.

In our example, we have[eq5]

With perfect multicollinearity, the design matrix X becomes singular, which implies that we cannot estimate the regression coefficients with Ordinary Least Squares (OLS).

In fact, we can compute the OLS estimator only if X is full-rank. We can still compute estimators such as the Ridge, which do not require X to be full-rank.

More than two categories

Thus, when we have an intercept in the regression model and we want to avoid perfect multicollinearity, we create only one dummy to encode a categorical variable that has two categories.

Similarly, we create $D-1$ dummies to encode a categorical variable that has $D$ categories.

The category that is not encoded into a dummy becomes the base category.

Example with three categories

Let us make an example with three categories.

Suppose that our sample is similar to the previous one, but individuals have been divided into three groups (H, M and L) based on their education.

[eq6]

If we choose L as the base category, then we create two dummies:

  1. the first dummy $d_{1,i}$ is 1 if the category is M and 0 otherwise;

  2. the second dummy $d_{2,i}$ is 1 if the category is H and 0 otherwise.

The vector $y$ of dependent variables and the matrix of regressors X (so-called design matrix) are[eq7]

The regression equation is[eq8]

The regression coefficients of the two dummies are interpreted as follows:

Dropping the intercept

An alternative to encoding only $D-1$ of the $D$ categories as dummies is to drop the intercept and encode all the $D$ categories.

With the data in the previous example, we could have done:[eq9]where we have encoded L, M and H into three dummies $d_{1,i}$, $d_{2,i}$ and $d_{3,i}$.

The regression equation is[eq10]

The interpretation of regression coefficients changes:

One-hot encoding

In machine learning, the practice of encoding $D$ categories into $D$ dummies is often called one-hot encoding.

How to cite

Please cite as:

Taboga, Marco (2021). "Dummy variable", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/dummy-variable.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.