By 苏剑林 | January 08, 2019
Lately, I've been increasingly excited about combining optimization algorithms with dynamics. This is the third installment in the series on optimization algorithms and dynamics, and I have a premonition there will be a fourth—stay tuned! A brief recap of the story so far: In the first article, we pointed out that SGD is essentially a numerical method for solving ordinary differential equations (ODEs): the Euler method. In the second article, from the perspective of error analysis in numerical methods, we analyzed why the gradient can be used to adjust the learning rate, thereby explaining the principle behind using gradients to regulate learning rates in algorithms like RMSprop and Adam.
This article will provide a more unified viewpoint on these two matters and attempt to answer a more fundamental question: Why gradient descent?
(Note: The discussion in this article does not involve the momentum acceleration component.)
The previous two articles discussed the viewpoint that "gradient descent is equivalent to solving an ODE," but we haven't yet addressed why we use gradient descent in the first place. Where does it come from? That is to say, previously we were only explaining gradient descent after it was already given; we haven't yet faced the problem of the origin of gradient descent.
The basic explanation is this: the direction opposite to the gradient is the direction in which the loss decreases fastest, so we use gradient descent. People often draw diagrams like contour maps to explain why the negative gradient is the direction of steepest descent. Consequently, many critics of adaptive learning rate optimizers like RMSprop cite a simple reason: these optimizers change the direction of parameter descent, meaning the optimization no longer follows the gradient direction, which allegedly leads to poorer results.
But is this explanation sufficiently reasonable?
Before the formal discussion, let's define the problem simply:
It's worth mentioning that point 2 is not strictly necessary, but it helps with the descriptions we'll explore later. In other words, point 2 is actually just an assumption. Keep in mind that for an arbitrary function, we don't generally know its minimum value beforehand. However, in deep learning, this is basically true because we usually set the loss to be non-negative, and thanks to the powerful fitting capabilities of neural networks, the loss can largely get close enough to 0.
Now, to the main point. Suppose that during the optimization process, the parameter $\boldsymbol{\theta}$ changes according to some trajectory $\boldsymbol{\theta}(t)$, making $L(\boldsymbol{\theta})$ a function of $t$, $L(\boldsymbol{\theta}(t))$. Note that $t$ here is not real time; it is a parameter used to describe change, equivalent to the number of iterations.
Let's consider the rate of change of $L(\boldsymbol{\theta}(t))$:
\begin{equation}\frac{d}{dt}L(\boldsymbol{\theta}(t))=\left\langle\nabla_{\boldsymbol{\theta}}L,\, \dot{\boldsymbol{\theta}}\right\rangle\label{eq:ld}\end{equation}Here $\dot{\boldsymbol{\theta}}$ is $d\boldsymbol{\theta}/dt$, and $\langle\cdot\rangle$ denotes the standard inner product. We want $L$ to be as small as possible, so we naturally want the right side of the equation to be negative, and for its absolute value to be as large as possible. If we fix the magnitude of $\dot{\boldsymbol{\theta}}$, then to minimize the right side, the angle between $\nabla_{\boldsymbol{\theta}}L$ and $\dot{\boldsymbol{\theta}}$ should be 180 degrees, which means:
\begin{equation}\dot{\boldsymbol{\theta}} = -\lambda \nabla_{\boldsymbol{\theta}}L\quad (\lambda > 0)\label{eq:gd}\end{equation}This demonstrates that the direction opposite to the gradient is indeed the direction in which the loss falls fastest. According to the first article, isn't this equation exactly gradient descent? Thus, we have directly derived gradient descent. Substituting $\eqref{eq:gd}$ into $\eqref{eq:ld}$, we get:
\begin{equation}\frac{d}{dt}L(\boldsymbol{\theta}(t))=-\lambda\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert^2\label{eq:ld-1}\end{equation}This shows that as long as the learning rate is small enough (the ODE simulation is accurate enough) and $\nabla_{\boldsymbol{\theta}}L\neq \boldsymbol{0}$, then $L$ will definitely decrease until $\nabla_{\boldsymbol{\theta}}L = 0$. The position where it stops is a local minimum or a saddle point; theoretically, it cannot be a local maximum.
Furthermore, we often use stochastic gradient descent. The mini-batch approach introduces a certain amount of noise, and noise can reduce the probability of getting stuck at a saddle point (as saddle points may not be robust to perturbations). Therefore, stochastic gradient descent usually performs better than full-batch gradient descent.
In fact, if one truly understands the above derivation process, they can "cook up" many different optimization algorithms themselves.
For instance, although we've proven that the negative gradient is the direction of steepest descent, why must we always go in the direction of steepest descent? Although the negative gradient is the "righteous path," one can also "take an unconventional path." Theoretically, as long as I can guarantee a decrease, it's fine. For example, I could choose:
\begin{equation}\dot{\boldsymbol{\theta}} = -\text{sign}\left(\nabla_{\boldsymbol{\theta}}L\right)\label{eq:rmsprop-1}\end{equation}Note that $\nabla_{\boldsymbol{\theta}}L$ is a vector, and $\text{sign}\left(\nabla_{\boldsymbol{\theta}}L\right)$ refers to taking the sign function of each component, resulting in a vector with elements that are -1, 0, or 1. Consequently, equation $\eqref{eq:ld}$ becomes:
\begin{equation}\frac{d}{dt}L(\boldsymbol{\theta}(t))=-\lambda\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert_1\label{eq:ld-2}\end{equation}where $\Vert\boldsymbol{x}\Vert_1=\sum\limits_{i=1}^n |x_i|$ represents the L1 norm of the vector. This selection also guarantees that the loss is decreasing, and theoretically, it converges where $\nabla_{\boldsymbol{\theta}}L = 0$.
Furthermore, we also have (assuming non-zero gradient components):
\begin{equation}\text{sign}\left(\nabla_{\boldsymbol{\theta}}L\right)=\frac{\nabla_{\boldsymbol{\theta}}L}{\sqrt{\nabla_{\boldsymbol{\theta}}L\otimes \nabla_{\boldsymbol{\theta}}L}}\label{eq:rmsprop-2}\end{equation}Combining $\eqref{eq:rmsprop-1}$ and the second article, and incorporating a moving average, we find that this section is describing the RMSprop algorithm.
In other words, the fact that in adaptive learning rate optimizers "the learning rate becomes a vector, making the optimization direction no longer the gradient direction" is not a flaw and should not be a point for which adaptive learning rate optimizers are criticized.
However, the reality is that with careful tuning, the final performance of SGD is usually better than that of adaptive learning rate optimizers, which suggests they do have some issues. That is, if you take an unconventional path, even if you start faster than others, you might not do as well in the later stages.
What is the problem? Actually, for Adagrad, the problem is obviously that it "stops too early" because it sums historical gradients (rather than averaging them), causing the learning rate to approach 0 too quickly. For the RMSprop mentioned above, the problem is—"it simply cannot stop."
Combining $\eqref{eq:rmsprop-1}$ and $\eqref{eq:rmsprop-2}$, we get:
\begin{equation}\dot{\boldsymbol{\theta}} = -\frac{\nabla_{\boldsymbol{\theta}}L}{\sqrt{\nabla_{\boldsymbol{\theta}}L\otimes \nabla_{\boldsymbol{\theta}}L}}\label{eq:rmsprop-3}\end{equation}When does this algorithm stop? In fact, it won't stop, because as long as a gradient component is non-zero, the corresponding component of $\frac{\nabla_{\boldsymbol{\theta}}L}{\sqrt{\nabla_{\boldsymbol{\theta}}L\otimes \nabla_{\boldsymbol{\theta}}L}}$ is also non-zero (either 1 or -1). Theoretically, this algorithm has no fixed point, so it will never stop. To mitigate this, practical RMSprop implementations use two tricks: taking a moving average of the denominator and adding an epsilon (to prevent division by zero).
But these can only be considered mitigations. In ODE terms, "this ODE is not asymptotically stable," so it will inevitably keep passing by the local minimum. This is the real problem with adaptive learning rate algorithms.
As mentioned before, if you really understand this process, you can tinker with some "original" optimization algorithms yourself and analyze their convergence. Below, I'll introduce my own tinkering process, which mistakenly led me to believe I had found an optimizer that could absolutely find the global optimum~
(Before reading the following content, please ensure you have understood the previous material; otherwise, it may be misleading~)
The starting point of this tinkering is that whether it is $\eqref{eq:gd}$ (with the corresponding convergence rate $\eqref{eq:ld-1}$) or $\eqref{eq:rmsprop-3}$ (with the corresponding convergence rate $\eqref{eq:ld-2}$), even if they converge, they only guarantee $\nabla_{\boldsymbol{\theta}}L = 0$; they cannot guarantee a global minimum (i.e., they don't necessarily achieve $L(\boldsymbol{\theta})=0$). So, a simple idea is: since we already know the minimum value is zero, why not include this information?
Analogous to our previous thinking, we can consider:
\begin{equation}\dot{\boldsymbol{\theta}} = -\frac{\nabla_{\boldsymbol{\theta}}L}{\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert^2}L\label{eq:me-gd}\end{equation}In this case, equation $\eqref{eq:ld}$ becomes very simple:
\begin{equation}\frac{d}{dt}L=-L\end{equation}This is just a simple linear differential equation, and the solution is $L(t)=e^{-t}$. As $t\to+\infty$, $L(t)\to 0$, meaning the loss is guaranteed to converge to zero.
Of course not~ Let's look at equation $\eqref{eq:me-gd}$. If we reach a local minimum where $\nabla_{\boldsymbol{\theta}}L=\boldsymbol{0}$ and $L > 0$, the right side of $\eqref{eq:me-gd}$ becomes negative infinity. Theoretically, this is not an issue, but numerically it is impossible to implement. Initially, I thought this could be easily solved by adding an epsilon to the denominator. But further analysis revealed that the problem is fatal.
To see why, let's rewrite $\eqref{eq:me-gd}$ as:
\begin{equation}\dot{\boldsymbol{\theta}} = -\frac{\nabla_{\boldsymbol{\theta}}L}{\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert}L\times\frac{1}{\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert}\label{eq:me-gd-2}\end{equation}The problem is that $1/\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert$ will become infinitely large (a singularity appears). Can we use a truncation? For example, consider:
\begin{equation}\dot{\boldsymbol{\theta}} = -\frac{\nabla_{\boldsymbol{\theta}}L}{\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert}L\times\min\left(\frac{1}{\left\Vert\nabla_{\boldsymbol{\theta}}L\right\Vert},M\right)\label{eq:me-gd-3}\end{equation}where $M \gg 0$ is a constant, bypassing the singularity. Doing this does indeed allow it to bypass some local minima, like in the following example:

This function has a global minimum around $x=0.41$ where the function value can reach 0, but it also has a sub-optimal minimum at $x=3$. If we start with an initial value of $x_0=4$, standard gradient descent would basically always converge to $x=3$. However, using $\eqref{eq:me-gd-3}$, starting from $x_0=4$, after some oscillation, it eventually converges near $x=0.41$:

We can see that it initially lingers near $x=3$, and after oscillating for a while, it jumps out and arrives near $x=0.41$. The plotting code:
import numpy as np
import matplotlib.pyplot as plt
def f(x):
return x * (x - 1) * (x - 3) * (x - 3) + 1.62276
def g(x):
return -9 + 30 * x - 21 * x**2 + 4 * x**3
ts = [0]
xs = [4]
h = 0.01
H = 2500
for i in range(H):
x = xs[-1]
delta = -np.sign(g(x)) * min(abs(g(x)) / g(x)**2, 1000) * f(x)
x += delta * h
xs.append(x)
ts.append(ts[-1] + h)
print f(xs[-1])
plt.figure()
plt.clf()
plt.plot(ts, xs, 'g')
plt.xlabel('$t$')
plt.ylabel('$\\theta(t)$')
plt.show()
However, while it looks beautiful, it has little practical value. To guarantee jumping out of all local minima, $M$ must be sufficiently large (to stay close to the original $\eqref{eq:me-gd-2}$), and the number of iterations must be high. If these conditions could be met, it would actually be better to add Gaussian noise to gradient descent. As we showed in the first article, if we assume the gradient noise is Gaussian, from a probabilistic standpoint, it will always eventually reach the global minimum (also requiring many iterations). So, this pretty-looking thing doesn't have much practical utility.
(Note: I later learned that Polyak had studied step sizes of the form $\eqref{eq:me-gd}$ much earlier. Searching for "Polyak step size" will yield many related works, such as "Revisiting the Polyak step size" and "Generalized Polyak Step Size for First Order Optimization with Momentum".)
Well, after quite a bit of rambling and tinkering, I've managed to put together another article. Personally, I feel that analyzing optimization algorithms from a dynamic perspective is a very interesting endeavor. it allows you to understand the charm of optimization algorithms from a relatively relaxed angle, and it can even connect many different fields of knowledge.
The general approach to understanding optimization algorithms starts with convex optimization and then non-rigorously applies the results to non-convex situations. We study convex optimization because "convexity" is a powerful condition for many theoretical proofs. However, deep learning is almost entirely non-convex. Since it's already non-convex, meaning the advantage of complete proofs from convex optimization no longer applies, I think it's better to look at things from a more relaxed angle. This more relaxed angle is dynamical systems, or systems of ordinary differential equations.
In fact, this perspective has great potential, including the convergence analysis of GANs and the popular "Neural ODEs," both of which eventually fall into this framework. But those are topics for another time~