Why Gradient Clipping Accelerates Training: A Concise Analysis

By 苏剑林 | June 05, 2020

This article introduces a perfect-score paper from MIT presented at ICLR 2020 titled "Why gradient clipping accelerates training: A theoretical justification for adaptivity". As the name suggests, this paper analyzes why gradient clipping can accelerate the training process in deep learning. The original paper is very long, filled with formulas, and contains many concepts from complexity research. To be honest, I was also confused by much of its content, but I was able to capture its core idea: it introduces a more relaxed constraint than the commonly used L-constraint and demonstrates the necessity of gradient clipping based on these new conditions. This article aims to provide a brief analysis of this process for the readers' reference.

Gradient Clipping

Assume the function to be minimized is $f(\theta)$, where $\theta$ represents the optimization parameters. The update formula for gradient descent is:

\begin{equation}\theta \leftarrow \theta-\eta \nabla_{\theta} f(\theta)\end{equation}

where $\eta$ is the learning rate. So-called gradient clipping involves scaling the update amount based on the norm of the gradient, for example:

\begin{equation}\theta \leftarrow \theta- \eta \nabla_{\theta} f(\theta)\times \min\left\{1, \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert}\right\}\label{eq:clip-1}\end{equation}

or

\begin{equation}\theta \leftarrow \theta- \eta \nabla_{\theta} f(\theta)\times \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert+\gamma}\label{eq:clip-2}\end{equation}

where $\gamma > 0$ is a constant. Both of these forms are considered gradient clipping. Generally speaking, they control the norm of the update so that it does not exceed a constant. The second form is also related to adaptive learning rate optimizers such as RMSProp. Furthermore, more precisely, we have the following inequality:

\begin{equation}\frac{1}{2}\min\left\{1, \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert}\right\}\leq \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert+\gamma}\leq \min\left\{1, \frac{\gamma}{\Vert \nabla_{\theta} f(\theta)\Vert}\right\}\end{equation}

In other words, the two can bound each other, so they are essentially equivalent.

L-Constraint

Many theoretical results related to optimizers assume in their proofs that the gradient of the function to be optimized $f(\theta)$ satisfies the following L-constraint:

\begin{equation}\Vert \nabla_{\theta} f(\theta + \Delta \theta) - \nabla_{\theta} f(\theta)\Vert\leq L\Vert \Delta\theta\Vert\label{eq:l-cond}\end{equation}

Since $\frac{\Vert \nabla_{\theta} f(\theta + \Delta \theta) - \nabla_{\theta} f(\theta)\Vert}{\Vert \Delta\theta\Vert}$ represents the degree of fluctuation in the gradient, it essentially measures the smoothness of $f(\theta)$. Therefore, the above constraint is also called the "L-smoothness condition (L-smooth)."

Regarding the L-constraint, it has appeared many times in this blog. You can refer to articles such as "Lipschitz Continuity in Deep Learning: Generalization and Generative Models" and "What Exactly Does BN Do? A 'Behind-Closed-Doors' Analysis". It is worth noting that different scenarios may require different L-constraints. For example, sometimes we assume the model output with respect to the input satisfies the L-constraint; sometimes we assume the model output with respect to parameters satisfies the L-constraint. What is assumed above is that the gradient of the model loss with respect to parameters satisfies the L-constraint.

If condition $\eqref{eq:l-cond}$ holds, many optimization problems are greatly simplified. Because we can prove (the proof process can be found here):

\begin{equation}f(\theta+\Delta\theta) \leq f(\theta) + \left\langle \nabla_{\theta}f(\theta), \Delta\theta\right\rangle + \frac{1}{2}L \Vert \Delta\theta\Vert^2\label{eq:neq-1}\end{equation}

For gradient descent, $\Delta\theta = -\eta \nabla_{\theta} f(\theta)$. Substituting this into the above equation gives:

\begin{equation}f(\theta+\Delta\theta) \leq f(\theta) + \left(\frac{1}{2}L\eta^2 - \eta\right) \Vert \nabla_{\theta}f(\theta)\Vert^2\end{equation}

Therefore, to ensure that every optimization step causes $f(\theta)$ to decrease, a sufficient condition is $\frac{1}{2}L\eta^2 - \eta < 0$, i.e., $\eta < \frac{2}{L}$. The minimum value of $\frac{1}{2}L\eta^2 - \eta$ is reached when $\eta^* = \frac{1}{L} < \frac{2}{L}$. So, by setting the learning rate to $\frac{1}{L}$, every iteration can ensure that $f(\theta)$ decreases, and at the fastest rate.

Relaxing the Constraint

Condition $\eqref{eq:l-cond}$ can lead to many beautiful results. However, the problem is that in many practical optimization problems, condition $\eqref{eq:l-cond}$ does not hold—for example, with the quartic function $f(\theta)=\theta^4$. This leads to a gap between theory and practice. The paper introduced in this article introduces a new, more relaxed constraint:

\begin{equation}\Vert \nabla_{\theta} f(\theta + \Delta \theta) - \nabla_{\theta} f(\theta)\Vert\leq \left(L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert\right)\Vert \Delta\theta\Vert\end{equation}

That is, the constant $L$ is replaced by a dynamic term $L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert$. The original paper calls this "$(L_0, L_1)$-smooth," which we also refer to here as the "$(L_0, L_1)$ constraint." Clearly, this condition is much more relaxed. For instance, one can verify that $\theta^4$ satisfies this condition. Therefore, theoretical results derived based on this condition will have a broader scope of application.

How did the authors propose this condition? The paper states it was observed through experiments: the paper observed that the smoothness of the loss function is "linearly correlated" with the gradient norm. But I suspect there must be some element of reverse engineering from the desired results involved; otherwise, who would be so bored as to observe the relationship between these two factors?

Under the new constraint, inequality $\eqref{eq:neq-1}$ still holds, but $L$ is replaced by the corresponding dynamic term:

\begin{equation}f(\theta+\Delta\theta) \leq f(\theta) + \left\langle \nabla_{\theta}f(\theta), \Delta\theta\right\rangle + \frac{1}{2}\left(L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert\right) \Vert \Delta\theta\Vert^2\end{equation}

Substituting $\Delta\theta = -\eta \nabla_{\theta} f(\theta)$, we get:

\begin{equation}f(\theta+\Delta\theta) \leq f(\theta) + \left(\frac{1}{2}\left(L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert\right)\eta^2 - \eta\right) \Vert \nabla_{\theta}f(\theta)\Vert^2\end{equation}

So it becomes very clear: to ensure a decrease at every step, we now require:

\begin{equation}\eta < \frac{2}{L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert}\end{equation}

And the optimal learning rate is:

\begin{equation}\eta^* = \frac{1}{L_0 + L_1\Vert \nabla_{\theta} f(\theta)\Vert}\end{equation}

This leads directly to gradient clipping as shown in $\eqref{eq:clip-2}$. Since it ensures that every step results in a decrease, it means no step is wasted during the optimization process, thereby accelerating training.

Summary

This article briefly introduced a perfect-score paper from ICLR 2020 that analyzes gradient clipping. The main idea is to introduce more relaxed and universal assumptions. Under these new conditions, the necessity of gradient clipping is revealed. Because traditional constraints are relaxed, the theoretical results have a broader scope of application. This indicates that gradient clipping is indeed a useful technique applicable in many scenarios.


Original author: 苏剑林 (Su Jianlin)
Original publication: 科学空间 (Scientific Spaces)

Translated using Gemini 3 Flash. Please refer to the original for authoritative content.