Home
The VC‐dimension of the set of all closed balls
d <= n+2
Tool:
Theorem: The VC dimension of the class of nonhomogenous half‐
spaces in Rˆd is d+1. (for homogenous: d)
(Schwartz & David 9.3)
Def Halfspace
Linear predictors is a family of hypothesis classes. Halfspaces
is a hypothesis class that belongs to this family. Other members
from the family: linear regression predictors, logistic regres‐
sion predictors.
See the definition of the class of affine functions. Denoted as
Ld.
All linear predictors use Ld as the basic building block. In bi‐
nary classification: we add the sign function on top of it. For
regression, we add the identity function on top of it.
To absorb b, w’ = (b, w1, ...), x’ = (1, x1, ...). See (Schwartz
& David 9.1)
The hypothesis class of halfspaces is simply putting the sign
function upon Ld. (Schwartz & David 9.1)
That is: x ‐> sign(<w,x>+b)
In the case of d=2, each halfspace is a hyperplane. The hyper‐
plane is perpendicular to the normal vector w.
The instances above the hyperplane: acute angle with w. (labeled
positively)
The instances below the hyperplane: obtuse angle with w.
Given sufficient large sample, we can learn halfspaces (in ERM
paradigm).
We need to consider: realizable case and agnostic case. For the
realizable case, there are two solutions: linear programming and
‐2‐
perceptron.
For the agnostic case: logistic regression.
See (Schwartz & David 9.1) for the concept of linear programs: to
maximize a linear function subject to linear inequalities.
Linear programs can be solved efficiently. The key point is how
to turn the ERM problem for halfspaces in the realizable case
into a linear program.
Constraints: y * <w, x> > 0.
Suppose w* is one such solution that satisfies the constraints.
We can find a w by scaling w* such that y * <w, x> >= 1.
y * <w,x> >= 1 can be rewritten as Aw >= v, where v=(1,...,1) and
Ai,j = yi * xi,j
Since we just need to find a suitable w, we don’t need to opti‐
mize over anything. For the realizable case, empirical risk is
always 0. We set a "dummy" objective. See (Schwartz & David
9.1.1)
Once we constructed A and v, the LP solver helps us find the w.
Perceptron for Halfspaces
w is initialized to the zeros vector. At each iteration, updates
w with a mislabeled example.
At iteration i, y<w,x> < 0. We then update w’ = w + yx. y<w’,x> =
y<w,x> + |x|ˆ2. The value is increased. Eventually, y<w’,x> >0.
But, will it converge? If so, what is the convergence rate?
See (Schwartz & David Theorem 9.1)
Theorem 9.1: the perceptron algorithm stops after at most (RB)ˆ2
iterations. (Stops means all examples are correctly classified.
Only for the realizable case)
Can we translate the balls as halfspaces?
||x ‐ x0|| <= r
||x||ˆ2 ‐ 2<x0, x> + ||x0||ˆ2 ‐ rˆ2 <= 0
x’ = (x, ||x||ˆ2, 1) is n+2
‐3‐
w = (‐2x0, 1, b) is n+2
=> <w, x’> <= 0 is an halfspace.
=> VCdim = d = n + 2
See (Schwartz & David Theorem 9.3) for the proof of VC dimen‐
sion of halfspaces equals to d.
The proof uses the fact that n+1 Rˆn vectors are linearly depen‐
dent.
That is, there exists a linear combination of the n+1 vectors
that equals to 0, and at least one of the weights is non‐zero.
Then, we separate the weights into positive and negative. (zero
weights do not have influence)
Suppose the halfspace (homogenous) shatters n+1 points, then
there exists w such that <w, xi> > 0 and <w, xj> < 0, where xi
corresponds to the examples with the positive weights.
This will lead to a contradition. (See Schwartz & David Theorem
9.3)
Therefore, halfspaces of n dimension can’t shatter n+1 points.
To show that halfspaces of n dimension can shatter n points, we
just need to give an example.
e1 = (1, 0, 0, ...)
e2 = (0, 1, 0, ...)
w = (y1, y2, y3, ...)
<w, ei> = yi
The halfspace shatters n points. VCdim >= n.