Home









                        Gradient Descent




     The  gradient of a function is the vector of partial deriva‐
tives of it. (multi‐variables function)

Gradient descent is  an  iterative  algorithm.  (the  computation
starts with an initial assumed value)

(See Schwartz & David 14.1)

For an iterative algorithm, the crucial point is the update rule.


     Why  do  we  need gradient descent rather than computing the
optimal value mathematically by setting the gradient to the  zero
vector?

The point with zero gradient might now exist. Even if exists, the
amount of computation could be huge.



     Why  do  we have to learn about gradient descent for machine
learning?

Our goal of learning is to minimize the  risk  function  (but  we
don’t know the distribution as well as the gradient, what we only
know is the gradient of the empirical risk. How does this work?).
Gradient descent is a method for minimizing a function.

The  key  point is, even though we don’t know the gradient of the
risk function, we can still find a  direction  whose  expectation
matches the gradient of the risk function. (SGD)

(See Schwartz & David 14.3)


     We take it for granted that in this algorithm we will take a
step in the direction of the negative of the gradient at the cur‐
rent point, but what is the principle behind such an algorithm?



     When  we apply an iterative method, we must know its conver‐
gence rate.












                               ‐2‐


     What is the advantage  of  preconditioned  gradient  descent
over gradient descent?

Gradient  descent  is a first‐order algorithm for finding a local
minimum of a differentiable function.

Preconditioning is typically used to accelerate first‐order opti‐
mization algorithms.



     Def Gradient

whose value at a point p is the "direction and  rate  of  fastest
increase" (See wiki)

"at  a point p, the direction of the gradient is the direction in
which the function increases most quickly from p, and the  magni‐
tude  of  the gradient is the rate of increase in that direction,
the greatest absolute directional derivative." (See wiki)

To maximize a function: gradient ascent.

To minimize a function: gradient descent.

The relationship between directional derivative and gradient: The
directional derivative is a dot product of the gradient  and  the
unit vector. (See Thomas Calculus: 14)



     Def: directional derivative

(See Thomas Calculus: 14)

measures the rate at which a function changes in a particular di‐
rection at a given point. (See wiki)

Notation  of the directional derivative: The derivative of f at P
in the direction of u.

To talk about a directional derivative, there is  a  function,  a
point and a direction vector.

Derivative  is a limit. When we choose the direction vector to be
a unit vector, the divisor is easy to express. (See Thomas Calcu‐
lus: 14.5 Directional Derivatives and Gradient Vectors)

Therefore, for the same function f and point P, choosing  a  dif‐
ferent direction vector means a different directional derivative.


Dot product: <u,v> = |u| * |v| * cos(arc<u,v>)










                               ‐3‐


If we choose the direction vector to be the same direction as the
gradient, then the directional derivative equals to the dot prod‐
uct  of the gradient and the direction vector, cos() = 1. The di‐
rectional derivative equals the |gradient|.

At point P, f increases most rapidly in the direction of the gra‐
dient vector at P. (since the directional derivative measures the
rate at which a function changes in a particular direction  at  a
given point).

But,  why  increases rather than decreases? See the definition of
diretional derivative in terms of limitation. lim (f(x0 +  su)  ‐
f(x0)) / s.

For the proof of the relationship between the directional deriva‐
tive  and  the  gradient vector, see Thomas Calculus 14.5: Direc‐
tional derivatives and gradient vectors.

Hence, when we do gradient descent, we take a step in  the  nega‐
tive direction of the gradient.

The  directional  derivative means if we move one unit in the di‐
rection of u, how much does the value of the function increase.