Search for probability and statistics terms on Statlect
StatLect

Design matrix

by , PhD

A design matrix is a matrix containing data about multiple characteristics of several individuals or objects. Each row corresponds to an individual and each column to a characteristic.

The design matrix is a fundamental mathematical object in regression analysis, for example, in linear regression models and in logit models. It is often denoted by the capital letter X.

Table of Contents

Examples

We provide here some examples of design matrices.

Example If we measure the height and weight of five individuals, we can collect the measurements in a design matrix having five rows and two columns. Each row corresponds to one of the ten individuals, the first column contains the height measurements and the second one reports the weights:[eq1]where $h_{i}$ denotes the height of the i-th individual and $w_{i}$ her weight.

Example If we collect the data about the gross domestic product (GDP) of four countries in three consecutive years, then the design matrix is the $4	imes 3$ matrix[eq2]where, for example, $X_{32}$ is the GDP of the third country in the second year.

How the design matrix is defined in linear regressions

Consider the linear regression[eq3]where $y_{i}$ is the dependent variable, $x_{i}$ is a $1	imes K$ vector containing the K explanatory variables (regressors), $eta $ is a Kx1 vector of regression coefficients, $arepsilon _{i}$ is the error term and there are $N$ observations ($i=1,\ldots ,N$).

Thus, we observe K characteristics, contained in the vector of regressors $x_{i}$, for each of the $N$ observations.

All the observations can be collected in the design matrix[eq4]where $x_{ij}$ denotes the $j$-th entry of the vector $x_{i}$, that is, the $j$-th regressor.

We can similarly stack the observations of the dependent variable and the error terms into two Kx1 vectors:[eq5]

Having defined the design matrix X and the two vectors $y$ and epsilon, we can write the regression equations in matrix form:[eq6]

This allows us to use matrix algebra to find an estimator of the regression coefficients $eta $ (see the lecture on linear regression to see how).

Rank of the design matrix

In most statistical models the design matrix is required to have full-rank, that is, its columns must be linearly independent (see, e.g., the normal linear regression model). When this requirement is not met, we say that the design matrix suffers from multicollinearity (see this lecture for details).

However, there are also regression models where the design matrix can be rank-deficient (i.e., not full-rank), for example the Ridge regression model.

More details

See the lecture on linear regression models for more details.

Keep reading the glossary

Previous entry: Cross-covariance matrix

Next entry: Discrete random variable

How to cite

Please cite as:

Taboga, Marco (2021). "Design matrix", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/glossary/design-matrix.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.