Home
The running example
The underlying data is generated by: sin(2πx) + noise(Gauss‐
ian)
Probability theory provides a framework for expressing uncer‐
tainty in a precise and quantitative manner.
Decision theory allows us to exploit this probabilistic represen‐
tation in order to make predictions that are optimal according to
appropriate criteria.
Polynomial fitting:
It is linear in the unknown parameters. Such models are called
linear models.
Now, how do we train and find the values of the unknown parame‐
ters?
The values of the coefficients will be determined by fitting the
polynomial to the training data.
This can be done by minimizing an error function that measures
the misfit between the function and the training set data points.
One simple choice of error function is the sum of the squares of
the errors between the predictions for each data point and the
corresponding target values.
Minmize the error function:
We can solve the curve fitting problem by choosing the value of w
for which E(w) is as small as possible.
There remains the problem of choosing the order M of the polyno‐
mial.
By choosing a large order M, we trained an over‐fitting model.
RMS: root mean square
‐2‐
The division by N allows us to compare different sizes of data
sets on an equal footing, and the square root ensures that E_RMS
is measured on the same scale (and in the same units) as the tar‐
get variable t.
We might suppose that the best predictor for new data would be
the function sin(2πx) from which the data was generated (and this
is indeed the case)
A power series expansion of the function sin(2πx) contains terms
of all orders, so we might expect that results should improve mo‐
notonically as we increase M.
However, by increasing M, we get over‐fitting, which is counter
intuitive.
We can gain some insight into the problem by examine the vlaues
of the coefficients w* obtained from polynomials of various or‐
der.
As M increases, the magnitude of the coefficients typically gets
larger.
What is happening is that the more flexible polynomials with
larger values of M are becoming increasingly tuned to the random
noise on the target values.
For a given model complexity, the over‐fitting problem become
less severe as the size of the data set increases.
But why is least squares a good choice for an error function?
We shall see that the least squares approach to finding the model
parameters represents a specific case of maximum likelihood, and
that the over‐fitting problem can be understood as a general
property of maximum likelihood.
By adopting a Bayesian approach, the over‐fitting problem can be
avoided.
We shall see that there is no difficulty from a Bayesian perspec‐
tive in employing models for which the number of parameters
greatly exceeds the number of data points. Indeed, in a Bayesian
model the effective number of parameters adapts automatically to
the size of the data set.
One technique that is often used to control the over‐fitting phe‐
nomenon in such cases is that of regularization, which involves
‐3‐
adding a penalty term to the error function in order to discour‐
age the coefficients from reaching large values.
Different penalties:
Large penalties turn the prediction curves into smooth ones.
Note that often the coefficient w_0 is omitted from the regular‐
izer because its inclusion causes the results to depend on the
choice of origin for the target variable.
The particular case of a quadratic regularizer is called ridge
regression. In the context of neural networks, this approach is
known as weight decay.
The regularization has the desired effect of reducing the magni‐
tude of the coefficients.
The impact of the regularization term on the generalization error
can be seen by plotting the value of the RMS error for both
training and test sets against ln λ
We see that in effect λ now controls the effective complexity of
the model and hence determines the degree of over‐fitting.
The issue of model complexity is an important one.
How do we optimize the model complexity (either M or λ )?
Partitioning it into a training set, used to determine the coef‐
ficients w, and a separate validation set, also called a hold‐out
set, used to optimize the model complexity. In many cases, how‐
ever, this will prove to be too wasteful of valuable training
data.
But why is least squares a good choice for an error function?
With probability theory, we can express our uncertainty over
the value of the target variable using a probability distribu‐
tion.
‐4‐
We shall assume that, given the value of x, the corresponding
value of t has a Gaussian distribution with a mean equal to the
value y(x, w) of the polynomial curve.
Conclusion: the sum‐of‐squares error function has arisen as a
consequence of maximizing likelihood under the assumption of a
Gaussian noise distribution.
What is Bayesian approach?