Home
Stochastic Gradient Descent
D is unknown, so it is not possible to minimize the risk
function.
The ERM rule minimizes the empirical risk over the hypothesis
class H.
Regularized risk minimization jointly minimizes the empirical
risk and a regularization function over h.
A different learning approach: SGD.
In SGD, we minimize L(w), where w is the hypotheses as a vector.
GD is an iterative optimization procedure. We improve the solu‐
tion at each step. We are minimizing the risk function, but we
don’t know D, so we don’t know the gradient of L(w).
SGD circumvents this problem. SGD takes a step along a random di‐
rection. As long as the expected value of the direction is the
negative of the gradient.
The goal is to find a predictor with a small risk on unseen data:
where l is a loss function. This loss is typically convex in the
second argument (e.g., square loss or logistic loss).
In the ERM approach we choose the predictor by minimizing the em‐
pirical risk over a parameterized set of predictors, potentially
with regularization.
In optimization, we try to minimize the following objective func‐
tion,
where Ω is the regularizer. fθ(x) is a parameterization hypothe‐
sis.
Why do we use GD rather than directly optimize over this objec‐
tive function?
‐2‐
In general, the minimizer has no closed form. Even when it has
one (e.g., linear predictor and square loss), it could be expen‐
sive to compute for large problems. We thus resort to iterative
algorithms.
Basic gradient descent algorithm
Convergence rate for convex‐Lipschitz functions.
The standard gradient descent approach for minimizing a differen‐
tiable convex function.
The gradient of a differentiable function is the vector of par‐
tial derivatives.
We start with an initial value of w (for example, w = 0).
At each iteration, take a step in the direction of the negative
of the gradient at the current point.
Intuitively, since the gradient points in the direction of the
greatest rate of increase of f around w, the algorithm makes a
small step in the opposite direction.
Subgradient
GD can be applied for nondifferentiable functions as well.