By 苏剑林 | December 20, 2018
In "Optimizing Algorithms from a Dynamical Perspective (I): From SGD to Momentum Acceleration," we proposed that the SGD optimization algorithm actually corresponds to the numerical solution of Ordinary Differential Equations (ODEs). From this perspective, we can naturally analyze the convergence properties of the SGD algorithm, the principles of momentum acceleration, and other content.
In this article, we continue along this line of thought to understand adaptive learning rate algorithms among optimization methods.
RMSprop
First, let's look at a classic adaptive learning rate optimization algorithm: RMSprop. Although RMSprop was not the first adaptive learning rate optimization algorithm proposed, it is a highly practical one. it serves as the foundation for more comprehensive algorithms like Adam. Through it, we can observe how adaptive learning rate optimization algorithms work.
Algorithm Overview
General gradient descent is as follows:
\begin{equation}\boldsymbol{\theta}_{n+1}=\boldsymbol{\theta}_{n} - \gamma \nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}_{n})\end{equation}
Obviously, $\gamma$ here is a hyperparameter, the learning rate, which may need different adjustments at different stages.
While RMSprop is:
\begin{equation}\begin{aligned}\boldsymbol{g}_{n+1} =& \nabla_{\boldsymbol{\theta}} L(\boldsymbol{\theta}_{n})\\
\boldsymbol{G}_{n+1}=&\lambda \boldsymbol{G}_{n} + (1 - \lambda) \boldsymbol{g}_{n+1}\otimes \boldsymbol{g}_{n+1}\\
\boldsymbol{\theta}_{n+1}=&\boldsymbol{\theta}_{n} - \frac{\tilde{\gamma}}{\sqrt{\boldsymbol{G}_{n+1} + \epsilon}}\otimes \boldsymbol{g}_{n+1}
\end{aligned}\end{equation}
Algorithm Analysis
Comparing it with naive SGD, we can see that in the update of $\boldsymbol{\theta}$, RMSprop replaces the original scalar learning rate $\gamma$ with a vector:
\begin{equation}\boldsymbol{\gamma}=\frac{\tilde{\gamma}}{\sqrt{\boldsymbol{G}_{n+1} + \epsilon}}\end{equation}
If we also view this vector as a learning rate, then RMSprop has found a scheme to allocate different learning rates to each component of the parameters.
This adjustment of the learning rate is achieved through the factor $\frac{1}{\sqrt{\boldsymbol{G}_{n+1} + \epsilon}}$, where $\boldsymbol{G}_{n+1}$ is the sliding average of the squared gradients. Essentially, the "sliding average" only makes the training process smoother; it is not the primary cause of the adjustment. The core part that acts is the "gradient" itself. That is to say, the magnitude of the gradient can be used to adjust the learning rate.
Adaptive Learning Rate
Why can the magnitude of the gradient be used to adjust the learning rate? Actually, this idea is very simple.
Minima and ODEs
Without further ado, for simplicity, let's start with a one-dimensional example: suppose we want to find a minimum of $L(\theta)$. We introduce a virtual time parameter $t$ and transform it into an ODE:
\begin{equation}\frac{d\theta}{dt}=\dot{\theta} = - L'(\theta)\end{equation}
It is not difficult to judge that a minimum of $L(\theta)$ is a stable fixed point of this equation. Starting from any $\theta_0$, by numerically solving this ODE, we can expect it to eventually converge to this fixed point, thereby obtaining a minimum.
The simplest Euler method is to use $\frac{\theta_{t+\gamma}-\theta_t}{\gamma}$ to approximate $\dot{\theta}$, resulting in:
\begin{equation}\frac{\theta_{t+\gamma}-\theta_t}{\gamma} = - L'(\theta_t)\end{equation}
which is:
\begin{equation}\theta_{t+\gamma} = \theta_t - \gamma L'(\theta_t)\end{equation}
This is gradient descent. $\theta_{t+\gamma}$ corresponds to $\theta_{n+1}$, and $\theta_t$ corresponds to $\theta_n$, meaning we move by $\gamma$ in each step.
Variable Learning Rate Idea
The question is, what balance of $\gamma$ is best? Of course, from the perspective of "approximating $\dot{\theta}$ with $\frac{\theta_{t+\gamma}-\theta_t}{\gamma}$," a smaller $\gamma$ is more accurate. However, the smaller $\gamma$ is, the more iterations are required, meaning the computational cost is larger. So, "the smaller the better" is ideal but unrealistic.
Therefore, the most appropriate scheme is: each step should be "just enough." But how do we know if it's enough?
Because we use $\frac{\theta_{t+\gamma}-\theta_t}{\gamma}$ to approximate $\dot{\theta}$, we must analyze the degree of approximation. According to Taylor series, we have:
\begin{equation}\theta_{t+\gamma} = \theta_t + \gamma \dot{\theta}_t + \mathcal{O}(\gamma^2)\end{equation}
In our case, $\dot{\theta} = - L'(\theta)$, so we have:
\begin{equation}\theta_{t+\gamma} = \theta_t - \gamma L'(\theta_t) + \mathcal{O}(\gamma^2)\end{equation}
It can be expected that when $\gamma$ is relatively small, the error term $\mathcal{O}(\gamma^2) < \gamma \left\|L'(\theta_t)\right\|$. That is to say, under certain conditions, $\gamma \left\|L'(\theta_t)\right\|$ itself is a measure of the error term. If we control $\gamma \left\|L'(\theta_t)\right\|$ within a certain range, the error is also controlled. Namely:
\begin{equation}\gamma \left\|L'(\theta_t)\right\|\leq \tilde{\gamma}\end{equation}
where $\tilde{\gamma}$ is a constant. We can even simply set $\gamma \left\|L'(\theta_t)\right\|=\tilde{\gamma}$ (temporarily ignoring the possibility of $L'(\theta_t)=0$ to observe the core idea), which means:
\begin{equation}\gamma = \frac{\tilde{\gamma}}{\left\|L'(\theta_t)\right\|}\end{equation}
In this way, we have adjusted the learning rate through the gradient.
Sliding Average Processing
Readers might complain that substituting $\gamma = \tilde{\gamma} / \left\|L'(\theta_t)\right\|$ into the original iteration results in:
\begin{equation}\theta_{t+\tilde{\gamma} / \left\|L'(\theta_t)\right\|} = \theta_t - \tilde{\gamma}\cdot\text{sign}\big[L'(\theta_t)\big]\end{equation}
You've only used the sign information of the entire gradient; isn't that too wasteful? (Too trivial: it means regardless of the gradient magnitude, $\theta$ just moves a fixed length in each iteration.)
Note that from the perspective of solving ODEs, this is actually fine. Because the solution of an ODE is a trajectory $(t,\theta(t))$, the processing above, although making $\theta$ trivial, makes $t$ non-trivial. It is equivalent to $t$ and $\theta$ exchanging roles, so it is still reasonable. However, if the concern is the optimization problem—seeking the minimum of $L(\theta)$—then the above formula is indeed a bit trivial. If $\theta$ moves only a fixed length in each iteration, it's somewhat like a grid search, which is too inefficient.
So, to improve this trivial situation while retaining the feature of using the gradient to adjust the learning rate, we can apply a "sliding average" to the gradient. The result is:
\begin{equation}\begin{aligned}G_{t+\tilde{\gamma}}=&\lambda G_{t} + (1 - \lambda) \|L'(\theta_t)\|^2\\
\gamma =& \frac{\tilde{\gamma}}{\sqrt{G_{t+\tilde{\gamma}} + \epsilon}}\\
\theta_{t+\gamma} =& \theta_t - \gamma L'(\theta_t)
\end{aligned}\end{equation}
This $\lambda$ is a constant close to 1 but less than 1. In this way, $G_t$ stays relatively stable within a certain range, while to some extent retaining the characteristics of the gradient $L'(\theta_t)$ itself. Therefore, using it to adjust the learning rate is considered a "clever" move. To avoid confusion caused by notation like $t+\tilde{\gamma}, t+\gamma$, we unify them with subscripts $n, n+1$:
\begin{equation}\begin{aligned}G_{n+1}=&\lambda G_{n} + (1 - \lambda) \|L'(\theta_n)\|^2\\
\gamma =& \frac{\tilde{\gamma}}{\sqrt{G_{n+1} + \epsilon}}\\
\theta_{n+1} =& \theta_n - \gamma L'(\theta_n)
\end{aligned}\end{equation}\label{eq:rmsprop-1}
This is the RMSprop algorithm mentioned at the beginning.
In addition, there is another important reason for the sliding average: what we actually want to estimate is the mean of the squared gradient across the whole sample. But we can only calculate one batch at a time (so-called mini-batch gradient descent). Consequently, the result calculated each time is actually biased, and the sliding average can correct this bias to some extent.
Higher-Dimensional Case Analysis
The discussions above are for the one-dimensional case. If it's multi-dimensional, how should it be generalized?
Perhaps readers think it's simple: just replace scalars with vectors, right? It's not that simple, because generalizing $\eqref{eq:rmsprop-1}$ to higher dimensions allows at least two reasonable choices:
\begin{equation}\begin{aligned}G_{n+1}=&\lambda G_{n} + (1 - \lambda) \Vert \nabla_{\boldsymbol{\theta}}L(\boldsymbol{\theta}_n)\Vert^2\\
\gamma =& \frac{\tilde{\gamma}}{\sqrt{G_{n+1} + \epsilon}}\\
\boldsymbol{\theta}_{n+1} =& \boldsymbol{\theta}_n - \gamma \nabla_{\boldsymbol{\theta}}L(\boldsymbol{\theta}_n)
\end{aligned}\end{equation}
or
\begin{equation}\begin{aligned}\boldsymbol{G}_{n+1}=&\lambda \boldsymbol{G}_{n} + (1 - \lambda)\big(\nabla_{\boldsymbol{\theta}}L(\boldsymbol{\theta}_n)\otimes \nabla_{\boldsymbol{\theta}}L(\boldsymbol{\theta}_n)\big)\\
\boldsymbol{\gamma} =& \frac{\tilde{\gamma}}{\sqrt{\boldsymbol{G}_{n+1} + \epsilon}}\\
\boldsymbol{\theta}_{n+1} =& \boldsymbol{\theta}_n - \boldsymbol{\gamma}\otimes \nabla_{\boldsymbol{\theta}}L(\boldsymbol{\theta}_n)
\end{aligned}\end{equation}\label{eq:rmsprop-n}
The former uses the total norm of the gradient to accumulate, eventually maintaining the scalar nature of the learning rate. The latter accumulates each component of the gradient separately. in this case, the adjusted learning rate becomes a vector, which is equivalent to assigning different learning rates to each parameter. From the perspective of strict theoretical analysis, the first approach is more rigorous, but from the experimental results, the second one is more effective.
The RMSprop algorithm we usually refer to refers to the latter $\eqref{eq:rmsprop-n}$. However, many enthusiasts of pure SGD "alchemy" criticize this vectorized learning rate for actually changing the direction of the gradient, resulting in inaccurate gradients and ultimately suboptimal results. Therefore, readers who do not like vectorized learning rates might want to experiment with the former.
Summary of Conclusions
This article has once again analyzed optimization algorithms from the perspective of ODEs. This time, we provided an understanding of an adaptive learning rate algorithm (RMSprop) from the perspective of error control. As for Adam, which we use more commonly, it is a combination of RMSprop and momentum acceleration, so I won't repeat it here.
Viewing the optimization problem as a problem of solving an Ordinary Differential Equation actually transforms optimization into a dynamical problem. This allows us to understand optimization algorithms from a more physical perspective (even if it's just an intuitive rather than rigorous understanding), and we can even import some theoretical results from ODEs. I will try to provide more such examples later.