Home









                   Convexity of empirical risk




     



     Empirical risk

learning rules:

ERM, SRM, MDL

Empirical  risk minimization, Structural risk minimization, Mini‐
mum description length



     Fundamental questions of learning

What is learning?

A generalization  of  Valiant’s  Probably  Approximately  Correct
learning model gives the answer.

Framework:  Domain, Label set, Training set S, The learner’s out‐
put (a prediction rule h: X‐>Y, also called a  predictor,  a  hy‐
pothesis, or a classifier), a learning algorithm A, h=A(S), error
of a classifier: the probability that the predictor fails h(x) !=
f(x)

Notation: D(A) determines how likely it is to observe a point x ∈
A, here A is a subset of X and D is the probability distribution



The error of a prediction rule, h:X‐>Y



The  subscript  (D,  f) indicates that the error is measured with
respect to the probability distribution D and the correct  label‐
ing  function f. We omit this subscript when it is clear from the
context.

It has several synonymous names: generalization error, the  risk,
or the true error of h. L means the loss of the learner.

For  ERM,  the  goal of the algorithm is to find h that minimizes









                               ‐2‐


the error with respect to the unknown D and f.

But we don’t know the D and f, the true error  is  also  unknown.
What  we can do is to minimize the training error, not the gener‐
alization error, and hope by doing this, also minimizes the  gen‐
eralization  error.  The  training error is also called empirical
error or empirical risk.



Therefore, ERM is to minimize:



overfitting: perhaps like the everyday experience that  a  person
who  provides a perfect detailed explanation for each of his sin‐
gle actions may raise suspicion.

How to rectify it?

Common solution: apply ERM learning rule over a restricted search
space. That is, ERM  with  inductive  bias.  The  learner  should
choose  in  advance (before seeing the data) a set of predictors.
This set is called a hypothesis class and is denoted  by  H.  ERM
chooses from H a h such that gives the minimum loss:






The method of introducing an inductive bias sounds promising. But
how come it enforces non‐overfitting?




A fundamental question in learning theory is, over which hypothe‐
sis classes ERM learning will not result in overfitting.

Intuitively,  choosing  a more restricted hypothesis class better
protects us against overfitting but at the same time might  cause
us a stronger inductive bias.

Now we learn about H. How do we decide the size of H?

Why?:  If  H is a finite class the ERM will not overfit, provided
it is based on a sufficiently large training  sample  (this  size
requirement will depend on the size of H).

The realizability assumption: the target function is within H.












                               ‐3‐


We  are discussing about overfitting, but why do we introduce the
realizability assumption?

Relationship between S and D: the i.i.d. assumption.

Intuitively, the training set S is a  window  through  which  the
learner  gets  partial  information about the distribution D over
the world and the labeling function, f.

Since L(h) depends on the training set, S, and that training  set
is  picked by a random process, there is randomness in the choice
of the predictor h and, consequently, in the risk L(h). Formally,
we say that it is a random variable.

It is not realistic toe expect that with full  certainty  S  will
suffice  to  direct the learner toward a good classifer (from the
point of view of D), as there is always some probability that the
sampled training data happens to be very nonrepresentative of the
underlying D.

We will therefore address the probability to  sample  a  training
set for which



is not too large.

Usually,  we  denote the probability of getting a nonorepresenta‐
tive sample by δ, and call (1‐δ) the confidence parameter of  our
prediction.

By nonrepresentative sample, what do we exactly mean?

How  bad is a sample when we say it is nonrepresentative? We need
another parameter: the accuracy parameter.

Since we cannot guarantee perfect label prediction, we  introduce
another parameter for the quality of prediction, the accuracy pa‐
rameter, commonly denoted by




h=A(S),  L_D(H_S)  is  the  loss. The probability of fail for the
generalized prediction. However, this probability is  unknown  to
us, because D and f are unknown to us. In spite of that, we still
compare it to the accuracy parameter, because we want to evaluate
the quality of our classifier. Let’s say the following event hap‐
pens:



What  is the probability of that? If the probability is low, then
we still accept that we have trained a good classifier.









                               ‐4‐


Intuitively, the larger the sample, the  better  chance  we  will
train a better classifier. But how large and what chance?

We have a learner algorithm, given S, outputs an h. We don’t know
about  D  and f, but suppose there is God who knows these, and he
computes L(h) for us. We want L(h) to be lower than the  accuracy
parameter.  But  this  is  out  of  our control. Now, suppose our
learner algorithm is fixed, and we know about D  and  f,  we  can
compute L(h) by ourselves. We also set the goal with the accuracy
parameter. The only changing variable is the sample.

What  is  the probability that the sample will lead to failure of
the learner? We are interested in upper bounding  this  probabil‐
ity. Also, we choose the size of sample to be fixed to m.




We would like to upper bound



Bad hypotheses:



Misleading samples:



Namely,  for every misleading sample, there is a "bad" hypothesis
that looks like a "good" hypothesis.

We would like to bound the probability of the event



But, since the realizability assumption implies that



It follows that the event



can only happen if for some bad hypothesis h we have L_S(h) = 0.

In other words, this event will only happen if our sample  is  in
the set of misleading samples, M.

Many  students. One never loses any point in any exam. We want to
find this student out of the group of students. Many exams.  Some
easy,  some  difficult. We choose an easy one, then many students
perform well. This easy exam is a misleading sample.  A  moderate









                               ‐5‐


student  could  make  full  score in the exam. The possibility of
mistaking a moderate student as the best one is high.

However, if the exam is difficult,  and  only  the  best  student
could make full score, then such an exam is not misleading.

If  there  is  only  one student in the group, and he is the best
student never loses points in any exam. Then it won’t matter what
sample we choose.

Therefore the size of the hypothesis class matters.

The mistake happens if the sample is in  the  set  of  misleading
samples.







Samples  that  guide our learner algorithm to output a hypothesis
that fails, that is,



belong to the set of misleading samples.










Fail upper bound, with union bound:




Dˆm is a joint probability distribution. ({}), this specifies the
event. Dˆm({}) is the probability of such an event.

Fix a bad hypothesis:

















                               ‐6‐


A well‐known inequality:









We don’t know the number of bad hypotheses, but it must be  lower
than |H|.

For each of the bad hypotheses, their loss is larger than the ac‐
curacy  parameter.  With  such  accuracy  parameter,  in order to
choose a sample to mislead our learner to pick this bad  hypothe‐
sis, we must carefully choose from the distribution. But we don’t
construct  a  sample. So for large accuracy parameter, the proba‐
bility of receiving a misleading sample is low.




For a sufficiently large m, the ERM rule over a finite hypothesis
class will be probably approximately correct:




Reasoning:




The probability of receiving a nonrepresentative sample  is  less
than  right‐side. Then the probability of receiving a representa‐
tive sample is greather than (1 ‐ rightside).   (1‐rightside)  is
monotonically  increasing,  therefore,  the greater m, the higher
the probability that we receive a representative sample. Then  we
choose m as:



We  get the probability of 1 ‐ δ. If m is larger, the probability
will be higher than 1 ‐ δ.



PAC learnability



     










                               ‐7‐


Since the training set is randomly generated, there may always be
a small chance that it will happen to be noninformative (for  ex‐
ample,  there  is  always  some chance that the training set will
contain only one domain point, sampled over and over again).

Even if we get a training sample that is informative, because  it
is  finite,  there  will  always be some details that it fails to
capture. Our accuracy parameter, allows "forgiving" the learner’s
classifier for making minor errors.


Sample complexity: the number of examples required to guarantee a
probably approximately correct solution.

There are infinite classes that are learnable as well.

What determines the PAC  learnability  of  a  class  is  not  its
finiteness but rather a combinatorial measure: VC dimension.



Learning problems beyond binary classification


     By  allowing  a variety of loss functions, we can do regres‐
sion, multiclass classification.


Agnostic PAC learning


     The realizability assumption is waived.


Various learning algorithms


     The general learning principle behind the algorithms.