Search for probability and statistics terms on Statlect
StatLect

Linear regression with standardized variables

by , PhD

This lecture deals with standardized linear regressions, that is, regression models in which the variables are standardized.

A variable is standardized by subtracting from it its sample mean and by dividing it by its standard deviation. After being standardized, the variable has zero mean and unit standard deviation.

Table of Contents

Standardization

We are going to deal with linear regressions[eq1]where $i=1,ldots ,N$ are the observations in the sample, there are K regressors [eq2] and K regression coefficients [eq3], $y_{i}$ is the dependent variable and $arepsilon _{i}$ is the error term.

In a standardized regression all the variables have zero mean and unit standard deviation or, equivalently, unit variance. More precisely,[eq4]for $k=1,ldots ,K$.

Furthermore, we assume that also the dependent variable is standardized:[eq5]

How to obtain standardized variables

In general, a variable to be included in a regression model has not zero mean and unit variance. Denote by $x_{ik}^{u}$ such a variable (where the superscript $u$ indicates that the variable is unstandardized). Then, we standardize it before including it in the regression.

We compute the sample mean and variance of $x_{ik}^{u}$: [eq6]

Then, we compute the standardized variable $x_{ik}$ to be used in the regression:[eq7]for $i=1,ldots ,N$ and $k=1,ldots ,K$.

The same process is performed on the dependent variable $y_{i}^{u}$ if it does not have zero mean and unit variance.

No intercept

Particular care needs to be taken if the regression includes an intercept, that is, if one of the regressors is constant and equal to 1.

Clearly, the constant cannot be standardized because it has zero variance and division by zero is not allowed.

We have two possibilities:

  1. we leave the constant as it is, that is, we do not standardize it;

  2. we drop the constant from the regression.

If all the variables, including the dependent variable $y_{i}$, are standardized, as we have assumed above, then there is no need to include a constant in the regression because the OLS estimate of its coefficient would anyway be equal to zero (proof below). Therefore, in what follows we are always going to drop the constant.

Proof

Write the regression in matrix form[eq8]where $y$ is the $N	imes 1$ vector of independent variables, X the $N	imes K$ vector of regressors, $eta $ is the Kx1 matrix of regression coefficients and epsilon the $N	imes 1$ vector of error terms.

The OLS estimator of $eta $ is[eq9]

Suppose the first regressor is constant and equal to 1, and all the other regressors are standardized. Denote by $X_{-1}$ the matrix obtained by deleting the first column of X (i.e., the column containing the constant). Then, $X^{	op }X$ is block diagonal:[eq10]where the off-diagonal blocks are zero because the variables are standardized.

As a consequence, [eq11] is block diagonal:[eq12]

Furthermore,[eq13]where $Noverline{y}=0$ because $y_{i}$ is standardized.

Thus, by carrying out the multiplication of the two block matrices [eq14] and $X^{	op }y$, we get [eq15]

In other words, when we add an intercept, the OLS estimator of the other regressors does not change and the estimated intercept is always equal to zero.

Sample covariances

Standardizing the variables in the regression greatly simplifies the computation of their sample covariances and correlations.

The sample covariance between two regressors $x_{ik}$ and $x_{il}$ is[eq16]where the sample means $overline{x_{k}}$ and $overline{x_{l}}$ are zero because the two regressors are standardized.

For the same reason, the sample covariance between $y_{i}$ and $x_{ik}$ is[eq17]

Sample correlations

The sample correlation between $x_{ik}$ and $x_{il}$ is[eq18]where the sample variances $s_{k}^{2}$ and $s_{l}^{2}$ are equal to 1 because the two regressors are standardized.

By the same token, the sample correlation between $y_{i}$ and $x_{ik}$ is[eq19]

Thus, in a standardized regression, sample correlations and sample variances coincide.

OLS estimator

Denote by $y$ the $N	imes 1$ vector of independent variables and by X the $N	imes K$ matrix of regressors, so that the regression equation can be written in matrix form as[eq20]where $eta $ is the Kx1 vector of regression coefficients and epsilon is the $N	imes 1$ vector of error terms.

The OLS estimator of $eta $ is[eq21]

When all the variables are standardized, the OLS estimator can be written as a function of their sample correlations.

Denote by $x_{iullet }$ the i-th row of X. Note that the $left( k,l
ight) $-th element of $X^{	op }X$ is [eq22]

Furthermore, the k-th element of $X^{	op }y$ is [eq23]

Denote by $r_{xx}$ the sample correlation matrix of X, that is, the $K	imes K$ matrix whose $left( k,l
ight) $-th entry is equal to $r_{kl}$. Then,[eq24]

Similarly, denote by $r_{xy}$ the Kx1 vector whose k-th entry is equal to $r_{ky}$, so that[eq25]

Thus, we can write the OLS estimator as a function of the sample correlation matrices:[eq26]

Standardized coefficients

The estimated coefficients of a linear regression model with standardized variables are called standardized coefficients. They are sometimes deemed easier to interpret than the coefficients of an unstandardized regression.

Interpretation

In general, a regression coefficient $eta _{k}$ is interpreted as the effect that is produced on the dependent variable when the k-th regressor is increased by one unit.

Sometimes, for example, when we read the output of a regression estimated by someone else, we are unable to tell whether a unit increase in the regressor is a lot or little, or we are uncertain about the relevance of the effect $eta _{k}$ on the dependent variable. In these situations, standardized coefficients are easier to interpret.

In a standardized regression, a unit increase in a variable is equal to its standard deviation. Roughly speaking, the standard deviation is the average deviation of a random variable from its mean. So, when a variable differs from its mean by one standard deviation, that is in a sense a "typical" deviation. Then, a standardized coefficient $eta _{k}$ tells you what multiple or fraction of a typical deviation in $y_{i}$ is caused by a typical deviation in the k-th regressor.

Comparisons among standardized coefficients

Another benefit of standardization is that it is easier to make comparisons among regressors. In particular, if we ask what regressor has the largest impact on the dependent variable, then we have an easy answer: it is the regressor whose coefficient is the highest in absolute value. In fact, a typical deviation of that regressor from its mean will produce the largest effect, as compared to the effects produced by typical deviations of the other regressors from their mean.

How to cite

Please cite as:

Taboga, Marco (2021). "Linear regression with standardized variables", Lectures on probability theory and mathematical statistics. Kindle Direct Publishing. Online appendix. https://www.statlect.com/fundamentals-of-statistics/linear-regression-with-standardized-variables.

The books

Most of the learning materials found on this website are now available in a traditional textbook format.