In this lecture we introduce the concept of a predictive model, which lies at the heart of machine learning (ML).
   To begin with, we observe some outputs
   and
   the corresponding input vectors
   
that
   may help to predict the outputs before they are observed.
Examples:
         
         is the total amount of purchases made by a customer while visiting an online
         shop;
         
         are some characteristics of the landing page that was first seen by the
         customer;
      
         
         is inflation observed in month
         
         and
         
         is a vector of macro-economic variables known before
         
;
      
         
         is 1 if firm
         
         defaults within a year and 0 otherwise;
         
         is a vector of firm
         
's
         characteristics that may help to predict the default;
      
         
         is a measure of economic activity in province
         
;
         
         is a vector of pixel values from a satellite image of the province.
      
   Note: the subscript
   
   used to index the observations is not necessarily time.
   We use the observed inputs and outputs to build a predictive model, that is, a
   function
   
   that takes new inputs
   
as
   arguments and returns the predictions
   
of
   previously unseen
   outputs
Before diving into predictive modelling, let us learn some machine learning jargon.
The problem of learning an input-output mapping is called a supervised learning problem.
   The data used for learning is called labelled data and the
   outputs
   
   are called labels or targets.
Basically, in a supervised learning problem, the task is to learn the conditional distribution of the outputs given the inputs.
   On the contrary, in an unsupervised learning problem, there
   are no labels and the task is to learn something about the unconditional
   distribution of
   .
   The typical example are photos of
    cats and dogs:
   
   is a vector of pixel values; in supervised learning, you have labels
   
   (1 if dog, 0 if cat); in unsupervised learning, you have no labels, but you
   typically do something like clustering
   
   in the hope that the algorithm autonomously separates cats from dogs.
A supervised learning problem is called:
         a classification problem if the output variable
         
         is discrete / categorical (e.g., cat vs dog);
      
         a regression problem if the output variable
         is
         continuous (e.g., income earned).
      
   The inputs are often called features and the vector
   
   is called a feature vector.
The act of using data to find the best predictive model (e.g., by optimizing the parameters of a parametric model) is called model training.
How do we assess the quality of a predictive model?
   How do we compare predicted outputs
   
   with observed outputs
   
?
We do these things by specifying a loss function, which is always required in a machine learning problem.
A loss function quantifies the losses that we incur when we make inaccurate predictions.
Examples:
         Squared Error (SE):
         
      
         Absolute Error (AE):
         
      
         Log-loss (or cross-entropy):
         
when
         
         is binary (i.e., it can take only two values, either 0 or 1); the multivariate
         generalization is
         
when
         
         is a
         
          multinoulli
         vector (i.e., we have a categorical variable that can take only
         
         values; when it takes the
         
-th,
         then
         
         and all the other entries of the vector
         
         are zero).
      
   Ideally, the best predictive models is the one having the smallest
   statistical risk (or expected loss)
   
where
   the expected value is with respect to the
    joint distribution of
   
   and
   
.
   Since the true joint distribution of
   
   and
   
   is usually unknown, the risk is approximated by the empirical
   risk
   
where
   
   is a set of input-output pairs used for calculating the empirical risk and
   
   is its cardinality (the number of input-output pairs contained in
   
).
   Thus, the empirical risk is the  sample
   average of the losses over a set of observed data
   .
   This is the reason why we sometimes call it average loss.
   How to choose
   
   is one of the most important decisions in machine learning and we will discuss
   it at length.
For specific choices of the loss function, empirical risk has names that are well-known to statisticians:
if the loss is the Squared Error, then the empirical risk is the Mean Squared Error (MSE), and its square root is the Root Mean Squared Error (RMSE);
if the loss is the Absolute Error, then the empirical risk is the Mean Absolute Error (MAE);
if the loss is the Cross-Entropy, it can easily be proved that the empirical risk is equal to the negative average log-likelihood.
The criterion generally followed in machine learning is that of empirical risk minimization:
if we are setting the parameters of a model, we choose the parameters that minimize the empirical risk;
if we are choosing the best model in a set of models, we pick the one that has the lowest empirical risk.
Statistically speaking, it is a sound criterion because empirical risk minimizers are extremum estimators.
Please cite as:
Taboga, Marco (2021). "Predictive model", Lectures on machine learning. https://www.statlect.com/machine-learning/predictive-model.
Most of the learning materials found on this website are now available in a traditional textbook format.