Hypothesis testing is a method of making statistical inferences by establishing an hypothesis, called null hypothesis, and using some data to decide whether to reject or not to reject the hypothesis.
As we have discussed in the lecture entitled Statistical inference, a statistical inference is a statement about the probability distribution from which a sample has been drawn. The sample can be regarded as a realization of a random vector , whose unknown joint distribution function is assumed to belong to a set of distribution functions , called statistical model.
In hypothesis testing we make a statement about a model restriction involving a subset of the original model. The statement we make is chosen between two possible statements:
reject the restriction ;
do not reject the restriction .
Roughly speaking, we start from a large set of distributions that might possibly have generated the sample and we would like to restrict our attention to a smaller set . In a test of hypothesis, we use the sample to decide whether or not to indeed restrict our attention to the smaller set .
If we have a parametric model, we can also carry out parametric tests of hypothesis.
Remember that in a parametric model the set of distribution functions is put into correspondence with a set of -dimensional real vectors called the parameter space. The elements of are called parameters and the true parameter is denoted by . The true parameter is the parameter associated with the unknown distribution function from which the sample was actually drawn. For simplicity, is assumed to be unique.
In parametric hypothesis testing we have a restriction on the parameter space and we choose one of the following two statements about the restriction:
reject the restriction ;
do not reject the restriction .
For concreteness, we will focus on parametric hypothesis testing in this lecture, but most of the things we will say apply with straightforward modifications to hypothesis testing in general.
The hypothesis that the restriction is true is called null hypothesis and it is usually denoted by :
The restriction (where is the complement of ) is often called alternative hypothesis and it is denoted by :
For some authors, "rejecting the null hypothesis " and "accepting the alternative hypothesis " are synonyms. For other authors, however, "rejecting the null hypothesis " does not necessarily imply "accepting the alternative hypothesis ". Although this is mostly a matter of language, it is possible to envision situations in which, after rejecting , a second test of hypothesis is performed whereby becomes the new null hypothesis and it is rejected (this may happen for example if the model is mis-specified). In these situations, if "rejecting the null hypothesis " and "accepting the alternative hypothesis " are treated as synonyms, then some confusion arises, because the first test leads to "accept " and the second test leads to "reject ".
Also note that some statisticians sometimes take into consideration as an alternative hypothesis a set smaller than . In these cases, the null hypothesis and the alternative hypothesis do not cover all the possibilities contemplated by the parameter space .
When we decide whether to reject a restriction or not to reject it, we can incur in two types of errors:
reject the restriction when the restriction is true; this is called an error of the first kind or a Type I error;
do not reject the restriction when the restriction is false; this is called an error of the second kind or a Type II error.
Remember that the sample is regarded as a realization of a random vector having support .
A test of hypothesis is usually carried out by explicitly or implicitly subdividing the support into two disjoint subsets. One of the two subsets, denoted by is called the critical region (or rejection region) and it is the set of all values of for which the null hypothesis is rejected:The other subset is just the complement of the critical region:and it is, of course, such that
The critical region is often implicitly defined in terms of a test statistic and a critical region for the test statistic. A test statistic is a random variable whose realization is a function of the sample . In symbols,
A critical region for is a subset of the set of real numbers and the test is performed based on the test statistic, as follows:
If the complement of the critical region is an interval, then its extremes are called critical values of the test. See this glossary entry for more details about critical values.
The power function of a test of hypothesis is the function that associates the probability of rejecting to each parameter . Denoting the critical region by , the power function is defined as follows:where the notation is used to indicate the fact that the probability is calculated using the distribution function associated to the parameter .
When , the power function tells us the probability of committing a Type I error, i.e. the probability of rejecting the null hypothesis when the null hypothesis is true. The maximum probability of committing a Type I error is, therefore,This maximum probability is called the size of the test. The size of the test is also called by some authors the level of significance of the test. However, according to other authors, who assign a slightly different meaning to the term, the level of significance of a test is an upper bound on the size of the test, i.e. a constant that, to the statistician's knowledge, satisfies
Tests of hypothesis are most commonly evaluated based on their size and power. An ideal test should have size equal to (i.e., the probability of rejecting the null hypothesis when the null hypothesis is true should be ) and power equal to when (i.e. the probability of rejecting the null hypothesis when the null hypothesis is false should be ). Of course, such an ideal test is never found in practice, but the best we can hope for is a test with a very small size and a very high probability of rejecting a false hypothesis. Nevertheless, this ideal is routinely used to choose among different tests: for example, when choosing between two tests having the same size, we will always utilize the test that has the higher power when ; also, when choosing between two tests that have the same power when , we will always utilize the test that has the smaller size.
Several other criteria, beyond power and size, are used to evaluate tests of hypothesis. We do not discuss them here, but we refer the reader to the very nice exposition in Berger and Casella (2002).
Examples of hypothesis testing can be found in the following lectures:
Hypothesis tests about the mean (examples of tests of hypothesis about the mean of an unknown distribution);
Hypothesis tests about the variance (examples of tests of hypothesis about the variance of an unknown distribution).
Berger, R. L. and G. Casella (2002) "Statistical inference", Duxbury Advanced Series.
Most of the learning materials found on this website are now available in a traditional textbook format.