A statistical model is a set of assumptions about the probability distribution that generated some observed data. In mathematical terms, the assumptions are formulated as restrictions on the set of probability distributions that could have generated the data.
We provide here some examples of statistical models.
Example Suppose that we randomly draw individuals from a certain population and measure their height. The measurements can be regarded as realizations of random variables . In principle, these random variables could have any probability distribution. If we assume that they have a normal distribution, as it is often done for height measurements, then we are formulating a statistical model: we are placing a restriction on the set of probability distributions that could have generated the data.
Example In the previous example, the random variables could in principle have some form of dependence. If we assume that they are statistically independent, then we are placing a further restriction on their joint distribution, that is, we are adding an assumption to our statistical model.
Example Suppose that for the same individuals we also collect weight measurements , and we assume that there is a linear relation between weight and height, described by a regression equationwhere and are regression coefficients and is an error term. This is a statistical model because we have placed a restriction on the set of joint distributions that could have generated the couples : we have ruled out all the joint distributions in which the two variables have a non-linear relation (e.g., quadratic).
Example If we assume that all the errors in the previous regression equation have the same variance (i.e., the errors are not heteroskedastic), then we are placing a further restriction on the set of data-generating distributions. Thus, we have yet another statistical model.
The previous examples have illustrated that a model is just a set of probability distributions that might have generated the observed data. Denote such a set by .
When the set is put into correspondence with a set of real vectors, then we have a parametric model.
The set is called parameter space and any one of its members is called a parameter.
Example If we assume, as we did in the first example above, that the height measurements come from a normal distribution, then the set is the set of all normal distributions. But a normal distribution is completely characterized by its mean and its variance . As a consequence, each member of is put in correspondence with a vector of parameters . The mean can take any real value and the variance needs to be positive. Therefore, the parameter space is .
What do we do after selecting a statistical model, that is, after restricting our attention to a set of probability distributions that could have generated the data (and to a parameter space put into correspondence with )?
The typical things we do are:
parameter estimation, that is, producing a guess of the parameter associated to the true distribution (the one that generated the data);
hypothesis testing, that is, checking that our statistical model is reasonable in the sense that the observed data is indeed compatible with at least one of the distributions belonging to .
There are countless statistical models as their number is limited only by statisticians' imagination. However, you might want to familiarize yourself with two of the most popular models:
the normal linear regression model, used to model the linear relation between a dependent variable and other explanatory variables;
the logistic classification model (or logit model), used to predict which variables may influence a binary outcome.
More details about statistical modelling can be found in the lecture on statistical inference.
Previous entry: Stationary sequence
Next entry: Support of a random variable
Most of the learning materials found on this website are now available in a traditional textbook format.