Why is the Default Norm for Gradient Clipping 1?

By 苏剑林 | January 02, 2025

As we know, Gradient Clipping is a common technique used to make model training more stable. Generally, gradient clipping is performed based on the total norm of the gradients of all parameters. This operation can be expressed as:

\begin{equation} \text{clip}(\boldsymbol{g},\tau)=\left\{\begin{aligned}&\boldsymbol{g}, &\Vert\boldsymbol{g}\Vert\leq \tau \\ &\frac{\tau}{\Vert\boldsymbol{g}\Vert}\boldsymbol{g},&\Vert\boldsymbol{g}\Vert > \tau \end{aligned}\right. \end{equation}

In this way, $\text{clip}(\boldsymbol{g},\tau)$ maintains the same direction as $\boldsymbol{g}$, but its norm does not exceed $\tau$. Note that $\Vert\boldsymbol{g}\Vert$ here is the "Global Gradient Norm," calculated by treating all parameters of the entire model as a single vector. Have you ever noticed a specific detail? Whether a model has millions of parameters or tens of billions, the value of $\tau$ is often set to 1. What does this mean? Is it simply a matter of reusing a default value, or is there a profound principle hidden behind it?

The "What"

Some readers might feel that since the default value isn't necessarily the optimal one, why bother overthinking it? It is true that $\tau=1$ might not always be the optimal choice, but the fact that it is the default for many models and performs reasonably well suggests that $\tau=1$ possesses a universal "rationality."

What does "rationality" mean here? Let's return to the $\text{clip}$ operation. If $\Vert\boldsymbol{g}\Vert$ is always smaller than $\tau$, then $\text{clip}$ degrades into an identity transform; if $\Vert\boldsymbol{g}\Vert$ is always larger than $\tau$, then $\text{clip}$ degrades into L2 normalization. In other words, for $\text{clip}$ to function as intended, $\tau$ must provide an appropriate differentiation, such that most of the $\Vert\boldsymbol{g}\Vert$ values are smaller than $\tau$, and only a small portion are larger than $\tau$. This defines the rationality of $\tau$.

Of course, one could find plenty of counterexamples, but the emphasis here is on the ubiquity of this phenomenon and the general applicability of this default setting. Therefore, meticulous readers need not get hung up on specific edge cases.

Thus, we believe that the universal rationality of $\tau=1$ implies that—regardless of the model's parameter count, its initialization, or the loss function—the total gradient norm happens to coincide with $1$ as the threshold for "outliers." This is undoubtedly an incredible property—this was exactly how the author felt when first realizing this conclusion.

The "Why"

Why such a "coincidence"? The author's answer might be a bit surprising: because only under this condition is stable model training possible.

Let's consider the loss function $\mathcal{L}(\boldsymbol{\theta})$ and the optimizer update rule $\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\, \boldsymbol{u}_t$. The change in the loss function can be approximated as:

\begin{equation} \Delta \mathcal{L} = \mathcal{L}(\boldsymbol{\theta}_{t+1}) - \mathcal{L}(\boldsymbol{\theta}_t) \approx (\boldsymbol{\theta}_{t+1} - \boldsymbol{\theta}_t)\cdot\nabla_{\boldsymbol{\theta}_t}\mathcal{L}(\boldsymbol{\theta}) = -\eta\, \boldsymbol{u}_t\cdot \boldsymbol{g}_t \end{equation}

First, consider the simplest case, SGD, where $\boldsymbol{u}_t = \boldsymbol{g}_t$ and $\Delta \mathcal{L}=-\eta\Vert\boldsymbol{g}_t\Vert^2$. That is, the amount of change in the loss function is proportional to the square of the gradient norm. We know that in both CV and NLP, pure SGD (without momentum) is a very inefficient optimizer. In the middle and late stages of training, the average loss reduction per step for most tasks is far less than the learning rate itself, meaning $|\Delta \mathcal{L}| < \eta$. From this, we can derive $\Vert\boldsymbol{g}_t\Vert < 1$. This indicates that $\Vert\boldsymbol{g}_t\Vert < 1$ is a long-term characteristic of a model that converges normally.

Of course, it is normal for $\Vert\boldsymbol{g}_t\Vert > 1$ to occur in the early stages of training, but it is rare for $\Vert\boldsymbol{g}_t\Vert \gg 1$ to occur—or rather, a good initialization should avoid the occurrence of $\Vert\boldsymbol{g}_t\Vert \gg 1$. This is the theoretical basis for techniques like DeepNorm. The reason is similar: if the gradient norm is too large, learning in the early stages will be too "aggressive," leading to premature convergence to a poor local solution. Another approach is to reduce $\eta$, which also reduces $|\Delta \mathcal{L}|$; this is why we typically use a Warmup at the beginning of training.

Incidentally, regarding the understanding of Warmup, readers can refer to the paper "Optimal Linear Decay Learning Rate Schedules and Further Refinements", which provides what the author considers to be the most rational analysis of Warmup.

What to Do

Simply put, because the change in the loss function is proportional to the square of the gradient norm, the stability of training dictates that the gradient norm cannot be too large, and it long-term performance is categorized as less than 1. If a gradient norm significantly larger than 1 occurs initially, the usual strategy is Warmup. Alternatively, one could consider a more universal strategy: setting another threshold $\mathcal{T}$ and clipping $\eta$ according to the value of $\boldsymbol{u}_t\cdot \boldsymbol{g}_t$:

\begin{equation} \eta_t = \left\{\begin{aligned}&\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t\leq \mathcal{T} \\ &\frac{\mathcal{T}}{\boldsymbol{u}_t\cdot \boldsymbol{g}_t}\eta,& \boldsymbol{u}_t\cdot \boldsymbol{g}_t > \mathcal{T} \end{aligned}\right. \end{equation}

This eliminates the need for extra Warmup settings and offers more adaptivity. For optimizers like Adam, we can perform an approximate analysis via $\boldsymbol{u}_t=\text{sign}(\boldsymbol{g}_t)$, similar to "How Should the Learning Rate Change When the Batch Size Increases?". In this case:

\begin{equation} \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \Vert\boldsymbol{g}_t\Vert_1 \end{equation}

Here, $\Vert\cdot\Vert_1$ is the L1 norm, i.e., the sum of the absolute values of the components. Since gradient components are generally less than 1, $\Vert\boldsymbol{g}_t\Vert_1 \gg \Vert\boldsymbol{g}_t\Vert$. Therefore, also out of the need for stable training, Adam's learning rate is usually significantly smaller than SGD's learning rate. Furthermore, the above equation can be rewritten as:

\begin{equation} \Delta \mathcal{L} = -\eta\, \text{sign}(\boldsymbol{g}_t)\cdot \boldsymbol{g}_t = -\eta\, \sqrt{N}\Vert\boldsymbol{g}_t\Vert \cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t) \end{equation}

Here, we assume that $\boldsymbol{g}_t$ has no zero components, so $\Vert\text{sign}(\boldsymbol{g}_t)\Vert=\sqrt{N}$, where $N$ is the total number of model parameters. Practice has found that $\Vert\boldsymbol{g}_t\Vert$ and $\cos(\text{sign}(\boldsymbol{g}_t), \boldsymbol{g}_t)$ are roughly constant across different model scales. Therefore, to keep $\Delta \mathcal{L}$ constant, $\eta$ should be inversely proportional to $\sqrt{N}$. That is, if the model parameter count increases by 4 times, the learning rate can be considered to be halved.

Fin

This article has provided some of my views and reflections on the phenomenon that "the default norm for gradient clipping is 1."