Home









                               SVM




Support Vector Machines


     It  is a theoretically well motivated and practically effec‐
tive classification algorithm.

It is also called maximum margin classifier.

The theoretical foundation for SVMs is the notion of margin.


separable datasets


     There are infinitely many such separating hyperplanes. Which
hyperplane should a learning algorithm select?

SVM chooses the hyperplane with the maximum margin. Known as  the
maximum‐margin hyperplane.

The SVM solution can be viewed as the "safest" choice in the fol‐
lowing  sense: A test point is classified correctly by a separat‐
ing hyperplane with margin ρ even when it falls within a distance
ρ of the training samples sharing the same label. (Here  training
samples mean sample points?)

For  the  SVM  solution,  ρ  is  the  maximum margin and thus the
"safest" value.


Primal optimization problem

The general equation of a hyperplane



w is normal to the hyperplane.



(The set of solutions of Ax=b is obtained by translating the  so‐
lution set of Ax=0, using any particular solution p of Ax=b.)

When  A is the standard matrix of the transformation f, this says
that










                               ‐2‐


[f:d]=[f:0]+p for any p in [f:d]

When A is a 1 x n matrix, the equation Ax=d may be  written  with
an  inner  product  <n,x>. Then [f:0] = {x: <n,x>=0}, which shows
that [f:0] is the orthogonal complement of the  subspace  spanned
by  n. In the terminology of calculus and geometry, n is called a
normal vector to [f:0]. (A "normal" vector in this sense need not
have unit length.) Also, n is said to be normal to each  parallel
hyperplane [f:d], even though <n,x> is not zero when d != 0.

Example hyperplanes:

 



Note  that  this  definition of a hyperplane is invariant to non‐
zero scalar multiplication.

Hence, for a hyperplane that does not  pass  through  any  sample
point,  we can scale w and b appropriately such that min |<w,x> +
b| = 1. (w is normal to the hyperplane, but  here  we  scale  the
norm  to  1.  This  has nothing to do with the concept of normal.
This is the marginal hyperplane.)




min |<w,x> + b| = 1, these x(s) are the support vectors.  Support
vectors determine the marginal hyperplanes.



The logic sequence:

1.  We  decided to use <w,x> + b = 0 as the classifier. First, we
choose a w, and it determines the angle of the  hyperplane.  Then
by  choosing b we make the hyperplane lies in the middle, so that
the closest points on both sides are at the  same  distance.  Be‐
cause this sounds natural to do.

2.  On  both  sides  of the hyperplane, there are training points
that are closest to the hyperplane <w,x> + b = 0. These  are  the
support vectors.

3.  Support  vectors determine two hyperplanes: <w, x_1> + b = c,
<w, x_2> + b = ‐c. These two hyperplanes are invariant to  multi‐
plication, therefore, we can scale it to <w/c, x_1> + b/c = 1. In
this  way,  we  simplify  the problem and save ourselves from the
trouble of dealing with c. Optimizing over (w,b) is no  different
to  optimizing over (w/c,b/c). And we will treat (w/c,b/c) as the
new (w,b) we want to optimize on.

4. Now we have 3 hyperplanes. Our goal is to  optimize  w  and  b









                               ‐3‐


according  to  some  criteria. Therefore we need to know the dis‐
tance. The criteria is to choose the one that maximizes the  dis‐
tance, that is, the margin.

5.  We  measure the margin in the direction of the normal vector,
that is, w. The margin is the distance between the separating hy‐
perplane and its marginal hyperplane.

6. x_0 is not on the hyperplane <w, x> + b = 0. We add cw to x_0,
so that x_1 = x_0 + cw is on the hyperplane. Here  c  is  a  con‐
stant. Then <w, x_1> + b = 0. <w, x_0 + cw> + b = 0. The distance
is ||cw||. We solve the equation and found that ||cw|| = |<w,x_0>
+ b| / ||w||



The distance of any point to a hyperplane:




For canonical hyperplane, the margin ρ is given by



So  far,  we’ve  only  discussed  about margins. To classify cor‐
rectly, there is one constraint to keep.




Finally, we have transformed the original problem into  a  convex
optimization problem.  The original problem is: a hyperplane max‐
imizing  the  margin  while  correctly  classifying  all training
points.




What is the difference between | | and || || ?

| | is the absolute value. || || is the norm.

Dot product and norm:




A canonical hyperplane is the one of separating  hyperplane  with
both its marginal hyperplane |<w,x> + b| = 1.

Therefore,  maximizing  the  margin  of a canonical hyperplane is
equivalent to minimizing ||w||. In optimization problem, we  usu‐
ally minimize (1/2)||w||ˆ2.









                               ‐4‐


The objective function F: w ‐> (1/2)||w||ˆ2 is infinitely differ‐
entiable.  Its  gradient is w. Its Hessian is the identity matrix
I, whose  eigenvalues  are  strictly  positive.  Therefore  F  is
strictly convex.

The  constraints are all defined by affine functions and are thus
qualified.

Thus, in view of the results known for convex  optimization,  the
optimization  problem  admits a unique solution, an important and
favorable property that does not  hold  for  all  learning  algo‐
rithms.


Constrained optimization problem:




Gradient vectors:



If  you look at the map showing contours within Yosemite National
Park in California, you will notice that the streams flow perpen‐
dicular to the contours.  The  streams  are  following  paths  of
steepest  descent so the waters reach lower elevations as quickly
as possible. Therefore, the fastest instantaneous rate of  change
in  a  stream’s elevation above sea level has a particular direc‐
tion.



Directional derivative:



An example of computing directional derivative:



This means by moving (x,y) one unit in the direction  of  u,  the
value of f increases 5/sqrt(2)

Why do we need directional derivative?

When  we  study derivative, we try to understand the relationship
between the value of a function and its variables, that  is,  how
do they change with respect to each other? The nature of a deriv‐
ative is a limitation.

But  how  does  directional derivative related to partial deriva‐
tive?










                               ‐5‐


The nature of a partial derivative is  still  a  limitation.  But
this  time, only one variable moves, and all others are fixed. In
the case of directional derivative, x is  moving  in  all  direc‐
tions.  We  put  them  together so that we can see the difference
clearly.





Here we see that when we talk about  directional  derivative,  we
need  to  specify  a  unit  vector and the changing rate is about
changing in all directions at the same  time.  However,  when  we
talk about partial derivative, we talk about the changing in only
one direction. It looks like there is some connection between the
two.

Now  we  are  interested  in the gradients and Hessians, which of
these derivatives do we need?

From the example above, it is so much work to  compute  a  direc‐
tional  derivative,  are there any better ways to go? That is how
the gradient comes to help.





This makes sense since the directional derivative is the combina‐
tional effects of all the partial derivatives  effecting  at  the
same  time.  That is how the concepts of partial derivatives, di‐
rectional derivative and gradient relate.



The gradient has nothing to do with the  unit  vector.  The  unit
vector gains a role when directional derivative is computed.

Therefore, the directional derivative is a dot product.

We  study  directional derivative because we want to find out ex‐
trem values and saddle points of a function.

Continuous functions of two variables assume  extreme  values  on
closed, bounded domains.



A function of two variables can assume extreme values only at do‐
main  boundary  points  or  at  interior domain points where both
first partial derivatives are zero or where one or  both  of  the
first partial derivatives fail to exist. These only ensures local
maxima  or local minima. However, the vanishing of derivatives at
an interior point (a,b) does not always signal the presence of an









                               ‐6‐


extreme value. The surface that is  the  graph  of  the  function
might  be  shaped  like  a saddle right above (a,b) and cross its
tangent plane there.




As with functions of a single variable, the  key  to  identifying
the local extrema is the First Derivative Test.



To  find the local extreme values of a function of a single vari‐
able, we look for points where the graph has a horizontal tangent
line. At such points, we then look for local maxima,  local  min‐
ima, and points of inflection. For a function f(x,y) of two vari‐
ables,  we look for points where the surface z=f(x,y) has a hori‐
zontal tangent plane. At such points, we then look for local max‐
ima, local minima, and saddle points.

Local extrema are also called relative extrema.

First Derivative Test:






To solve a convex optimization, we take the gradient of the func‐
tion and set it to 0. Then why do we need to know about Hessian?



First and second derivative tests (for finding extreme values)





First and second order conditions (for determining convexity)










Partial derivative:

The calculus of several variables is similar  to  single‐variable









                               ‐7‐


calculus applied to several variables one at a time. When we hold
all  but  one of the independent variables of a function constant
and differentiate with respect to that one  variable,  we  get  a
"partial" derivative.



An equivalent expression for the partial derivative is




To  distinguish  partial derivatives from ordinary derivatives we
use the symbol



rather than the d previously used.

Partial derivative geometrically:






Hessian matrix:



Hessian is also called second derivative.

Why is Hessian important? Answer: Second‐order  conditions.  With
these conditions, we determine f’s convexity.



For  a function on R, this reduces to the simple condition f’’(x)
>= 0, which means that the derivative is nondecreasing.

Question: if we already know the gradient vector, can we  compute
Hessian matrix from that?

Second  derivative means taking derivative at a time and then an‐
other time, these 2 times don’t happen at the same  time.  There‐
fore  the  first  time  of taking the derivate of ||w||ˆ2 already
filters out all other terms unrelated to w_i.  Hence,  the  final
result  is  the identity matrix I, rather than things like w_i  +
w_j (this only happens when we treat w_i and w_j  as  independent
variables  and  all others as constant, which doesn’t make sense.
By definition of derivative, the  operation  only  works  on  one
variable at a time).











                               ‐8‐


How do we compute gradient for the square of norm?




Taking  partial derivatives for each direction in w, the gradient
vector is the vector of partial derivatives, that  is,  w  itself
(Here we see the reason why we have put the factor 1/2 in the op‐
timization problem).






Hyperplanes  will  be important for understanding the multidimen‐
sional nature of the linear programming problems.


Dual optimization problem






non‐separable datasets



the problem of linear classification


     Input space X

Output/Target space Y = {‐1, +1}

Target function f: X‐>Y

Given a hypothesis set H of functions mapping X to Y

Training sample S of size m drawn i.i.d. from X (unknown distrib‐
ution D)


The problem consists of determining a hypothesis h from H, a  bi‐
nary classifier, with small generalization error.




Question:  How do we choose H? Answer: follow Occam’s razor prin‐
ciple.










                               ‐9‐


Margin theory


     What do we want? generalization bounds.




Occam’s razor principle


     Hyperthesis sets with smaller complexity  ‐‐  e.g.,  smaller
VC‐dimension  or Rademacher complexity ‐‐ provide better learning
guarantees.

A natural hypothesis set with relatively small complexity is that
of linear classifiers



A hypothesis of the form x ‐> sign(<w,x> + b) thus  labels  posi‐
tively all points falling on one side of the hyperplane <w,x> + b
= 0 and negatively all others.

The problem is referred to as a linear classification problem.



Hyperplane


     Check geometry_of_vector_spaces.m