By 苏剑林 | November 18, 2024
In the previous article "When Batch Size increases, how should the learning rate change?" we discussed the scaling laws between learning rate and Batch Size from multiple perspectives. For the Adam optimizer, we used the SignSGD approximation, which is a common technique for analyzing Adam. A natural question follows: how scientifically sound is it to approximate Adam using SignSGD?
We know that the denominator of the Adam optimizer's update includes an $\epsilon$. Its original purpose was to prevent division-by-zero errors, so its value is usually set very close to zero, to the point where we typically choose to ignore it in theoretical analysis. However, in current LLM (Large Language Model) training, especially low-precision training, we often choose a relatively large $\epsilon$. This results in $\epsilon$ often exceeding the magnitude of the gradient squares in the middle and late stages of training, making the existence of $\epsilon$ 사실상 non-negligible.
Therefore, in this article, we attempt to explore how $\epsilon$ affects the Scaling Law of Adam's learning rate and Batch Size, providing a reference calculation scheme for related issues.
SoftSign
Since this is a follow-up to the previous article, I won't repeat the related background. To investigate the role of $\epsilon$, we switch from SignSGD to SoftSignSGD, where $\tilde{\boldsymbol{u}}_B = \text{sign}(\tilde{\boldsymbol{g}}_B)$ becomes $\tilde{\boldsymbol{u}}_B = \text{softsign}(\tilde{\boldsymbol{g}}_B)$, where
\begin{equation}\text{sign}(x)=\frac{x}{\sqrt{x^2}}\quad\to\quad\text{softsign}(x, \epsilon)=\frac{x}{\sqrt{x^2+\epsilon^2}}\end{equation}
This form is undoubtedly closer to Adam. But before that, we need to confirm whether $\epsilon$ is truly non-negligible to determine if there is value in further research.
In Keras's implementation of Adam, the default value of $\epsilon$ is $10^{-7}$, and in Torch, it is $10^{-8}$. At these values, the probability of the absolute value of the gradient being smaller than $\epsilon$ is not very high. However, in LLMs, a common value for $\epsilon$ is $10^{-5}$ (such as in LLAMA2). Once training enters the "right track," components where the absolute gradient is smaller than $\epsilon$ will become very common, so the impact of $\epsilon$ is significant.
This also has a certain relationship with the number of parameters in the LLM. For a model that can be trained stably, regardless of the parameter size, its gradient norm order of magnitude is roughly the same, which is determined by the stability of backpropagation (refer to "What exactly is the difficulty in training a 1000-layer Transformer?"). Therefore, for models with more parameters, the average absolute gradient value for each parameter becomes relatively smaller, making the role of $\epsilon$ more prominent.
It is worth noting that the introduction of $\epsilon$ actually provides an interpolation between Adam and SGD. This is because when $x\neq 0$
\begin{equation}\lim_{\epsilon\to \infty}\epsilon\,\text{softsign}(x, \epsilon)=\lim_{\epsilon\to \infty}\frac{x \epsilon}{\sqrt{x^2+\epsilon^2}} = x\end{equation}
So, the larger the $\epsilon$, the more Adam behaves like SGD.
(Note: The concept of SoftSign in this article originates from an ongoing collaboration between the author and Liyuan Liu from MSR and Chengyu Dong. After mutual agreement, we decided to share this part of the results first. Please stay tuned for more follow-up conclusions.)
S-shaped Approximation
After confirming the necessity of introducing $\epsilon$, we begin our analysis. During the analysis, we will repeatedly encounter S-shaped functions, so another preparatory task is to explore simple approximations for S-shaped functions.
S-shaped functions are hardly rare; the $\text{softsign}$ function introduced in the previous section is one, the $\text{erf}$ function used in the previous article's analysis is another, and there are also $\tanh$, $\text{sigmoid}$, etc. Next, we deal with an S-shaped function $S(x)$ that satisfies the following properties:
1. Globally smooth and monotonically increasing;
2. Odd function, with a range of $[-1, 1];$
3. Slope at the origin is $k > 0$.
For such functions, we consider two approximations. The first approximation is similar to $\text{softsign}$:
\begin{equation}S(x)\approx \frac{x}{\sqrt{x^2 + 1/k^2}}\end{equation}
It is probably the simplest function that retains the three properties mentioned above. The second approximation is based on the $\text{clip}$ function:
\begin{equation}S(x)\approx \text{clip}(kx, -1, 1) \triangleq \left\{\begin{aligned}&1, &kx\geq 1 \\ &kx, &-1 < kx < 1\\ &-1, &kx \leq -1\end{aligned}\right.\end{equation}
This is essentially a piecewise linear function. Although it sacrifices global smoothness, piecewise linearity makes integration much easier, as we will soon see.
Erf function and its two approximations
Mean Estimation
Without further ado, following the method of the previous article, the starting point is still
\begin{equation}\mathbb{E}[\mathcal{L}(\boldsymbol{\theta} - \eta\tilde{\boldsymbol{u}}_B)] \approx \mathcal{L}(\boldsymbol{\theta}) - \eta\mathbb{E}[\tilde{\boldsymbol{u}}_B]^{\top}\boldsymbol{g} + \frac{1}{2}\eta^2 \text{Tr}(\mathbb{E}[\tilde{\boldsymbol{u}}_B\tilde{\boldsymbol{u}}_B^{\top}]\boldsymbol{H})\end{equation}
What we need to do is estimate $\mathbb{E}[\tilde{\boldsymbol{u}}_B]$ and $\mathbb{E}[\tilde{\boldsymbol{u}}_B\tilde{\boldsymbol{u}}_B^{\top}]$.
In this section, we calculate $\mathbb{E}[\tilde{\boldsymbol{u}}_B]$. To do this, we use the $\text{clip}$ function to approximate the $\text{softsign}$ function:
\begin{equation}\text{softsign}(x, \epsilon)\approx \text{clip}(x/\epsilon, -1, 1) = \left\{\begin{aligned}&1, & x/\epsilon \geq 1 \\ & x / \epsilon, & -1 < x/\epsilon < 1 \\ &-1, & x/\epsilon \leq -1 \\ \end{aligned}\right.\end{equation}
Then we have
\begin{equation}\begin{aligned}
\mathbb{E}[\tilde{u}_B] =&\, \mathbb{E}[\text{softsign}(g + \sigma z/\sqrt{B}, \epsilon)] \approx \mathbb{E}[\text{clip}(g/\epsilon + \sigma z/\epsilon\sqrt{B},-1, 1)] \\[5pt]
=&\, \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \text{clip}(g/\epsilon + \sigma z/\epsilon\sqrt{B},-1, 1) e^{-z^2/2}dz \\[5pt]
=&\, \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{-(g+\epsilon)\sqrt{B}/\sigma} (-1)\times e^{-z^2/2}dz + \frac{1}{\sqrt{2\pi}}\int_{-(g-\epsilon)\sqrt{B}/\sigma}^{\infty} 1\times e^{-z^2/2}dz \\[5pt]
&\, \qquad\qquad + \frac{1}{\sqrt{2\pi}}\int_{-(g+\epsilon)\sqrt{B}/\sigma}^{-(g-\epsilon)\sqrt{B}/\sigma} (g/epsilon + \sigma z/\epsilon\sqrt{B})\times e^{-z^2/2}dz
\end{aligned}\end{equation}
The integral form is complex, but calculating it with Mathematica is not difficult. The result can be expressed using the $\text{erf}$ function:
\begin{equation}\frac{1}{2}\left[\text{erf}\left(\frac{a+b}{\sqrt{2}}\right)+\text{erf}\left(\frac{a-b}{\sqrt{2}}\right)\right]+\frac{a}{2b}\left[\text{erf}\left(\frac{a+b}{\sqrt{2}}\right)-\text{erf}\left(\frac{a-b}{\sqrt{2}}\right)\right]+\frac{e^{-(a+b)^2/2} - e^{-(a-b)^2/2}}{b\sqrt{2\pi}}\end{equation}
where $a = g\sqrt{B}/\sigma, b=\epsilon \sqrt{B}/\sigma$. This function looks complicated, but it happens to be an S-shaped function of $a$, with a range of $(-1, 1)$ and a slope at $a=0$ of $\text{erf}(b/\sqrt{2})/b$. Thus, using the first approximation form:
\begin{equation}\mathbb{E}[\tilde{u}_B]\approx\frac{a}{\sqrt{a^2 + b^2 / \text{erf}(b/\sqrt{2})^2}}\approx \frac{a}{\sqrt{a^2 + b^2 + \pi / 2}}=\frac{g/\sigma}{\sqrt{(g^2+\epsilon^2)/\sigma^2 + \pi / 2B}}\label{eq:E-u-approx}\end{equation}
The second approximation sign uses $\text{erf}(x)\approx x / \sqrt{x^2 + \pi / 4}$ to handle the $\text{erf}(b/\sqrt{2})$ in the denominator. One could say we are quite lucky—the final form is not too complex. Next, we have:
\begin{equation}\mathbb{E}[\tilde{\boldsymbol{u}}_B]_i \approx \frac{g_i/\sigma_i}{\sqrt{(g_i^2+\epsilon^2)/\sigma_i^2 + \pi / 2B}} = \frac{\text{softsign}(g_i, \epsilon)}{\sqrt{1 + \pi \sigma_i^2 /(g_i^2+\epsilon^2)/2B}}\approx \frac{\text{softsign}(g_i, \epsilon)}{\sqrt{1 + \pi \kappa^2/2B}} = u_i \beta\end{equation}
Just like in the previous article, the last approximation uses a mean-field approximation, where $\kappa^2$ is some average of all $\sigma_i^2 /(g_i^2+\epsilon^2)$, while $u_i = \text{softsign}(g_i, \epsilon)$ and $\beta = (1 + \pi\kappa^2 / 2B)^{-1/2}$.
Variance Estimation
The mean (first moment) is resolved; now it is the second moment's turn:
\begin{equation}\begin{aligned}
\mathbb{E}[\tilde{u}_B^2] =&\, \mathbb{E}[\text{softsign}(g + \sigma z/\sqrt{B}, \epsilon)^2] \approx \mathbb{E}[\text{clip}(g/\epsilon + \sigma z/\epsilon\sqrt{B},-1, 1)^2] \\[5pt]
=&\, \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \text{clip}(g/\epsilon + \sigma z/\epsilon\sqrt{B},-1, 1)^2 e^{-z^2/2}dz \\[5pt]
=&\, \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{-(g+\epsilon)\sqrt{B}/\sigma} (-1)^2\times e^{-z^2/2}dz + \frac{1}{\sqrt{2\pi}}\int_{-(g-\epsilon)\sqrt{B}/\sigma}^{\infty} 1^2\times e^{-z^2/2}dz \\[5pt]
&\, \qquad\qquad + \frac{1}{\sqrt{2\pi}}\int_{-(g+\epsilon)\sqrt{B}/\sigma}^{-(g-\epsilon)\sqrt{B}/\sigma} (g/\epsilon + \sigma z/\epsilon\sqrt{B})^2\times e^{-z^2/2}dz
\end{aligned}\end{equation}
The result can also be expressed with the $\text{erf}$ function but is even more lengthy, so I won't write it out here. As I said, for Mathematica, this is no trouble. When viewed as a function of $a$, the result is an inverted bell-shaped curve, symmetric about the $y$-axis, with an upper bound of $1$ and a minimum value occurring within $(0, 1)$. Referring to the approximate formula for $\mathbb{E}[\tilde{u}_B]$ in $\eqref{eq:E-u-approx}$, we choose the following approximation:
\begin{equation}\mathbb{E}[\tilde{u}_B^2] \approx 1 - \frac{b^2}{a^2 + b^2 + \pi / 2} = 1 - \frac{\epsilon^2/(g^2+\epsilon^2)}{1 + \pi\sigma^2/(g^2+\epsilon^2) / 2B}\end{equation}
To be honest, the accuracy of this approximation isn't very high; it's mainly for computational convenience, but it retains key properties: inverted bell shape, $y$-axis symmetry, upper bound of 1, result of 1 when $b=0$, and result of 0 when $b \to \infty$. We continue applying the mean-field approximation:
\begin{equation}\mathbb{E}[\tilde{\boldsymbol{u}}_B\tilde{\boldsymbol{u}}_B^{\top}]_{i,i} \approx 1 - \frac{\epsilon^2/(g_i^2+\epsilon^2)}{1 + \pi\sigma_i^2/(g_i^2+\epsilon^2) / 2B}\approx 1 - \frac{\epsilon^2/(g_i^2+\epsilon^2)}{1 + \pi\kappa^2 / 2B} = u_i^2 \beta^2 + (1 - \beta^2)\end{equation}
So $\mathbb{E}[\tilde{\boldsymbol{u}}_B\tilde{\boldsymbol{u}}_B^{\top}]_{i,j}\approx u_i u_j \beta^2 + \delta_{i,j}(1-\beta^2)$. The term $\delta_{i,j}(1-\beta^2)$ represents the covariance matrix of $\tilde{\boldsymbol{u}}$, which is $(1-\beta^2)\boldsymbol{I}$, a diagonal matrix. This is expected because one of our assumptions is the independence between components of $\tilde{\boldsymbol{u}}$, so the covariance matrix must be diagonal.
Initial Exploration of Results
From this we obtain:
\begin{equation}\eta^* \approx \frac{\mathbb{E}[\tilde{\boldsymbol{u}}_B]^{\top}\boldsymbol{g}}{\text{Tr}(\mathbb{E}[\tilde{\boldsymbol{u}}_B\tilde{\boldsymbol{u}}_B^{\top}]\boldsymbol{H})} \approx \frac{\beta\sum_i u_i g_i}{\beta^2\sum_{i,j} u_i u_j H_{i,j} + (1-\beta^2)\sum_i H_{i,i} }\end{equation}
Note that except for $\beta$, the other symbols do not depend on $B$, so the above formula already provides the dependency relationship between $\eta^*$ and $B$. Note that to ensure the existence of a minimum, we assume the positive definiteness of the $\boldsymbol{H}$ matrix, under which assumption $\sum_{i,j} u_i u_j H_{i,j} > 0$ and $\sum_i H_{i,i} > 0$ must hold.
In the previous article, we said Adam's most important characteristic is the possible occurrence of the "Surge phenomenon," where $\eta^*$ is no longer a globally monotonically increasing function of $B$. Next, we will prove that the introduction of $\epsilon > 0$ reduces the probability of the Surge phenomenon occurring, and it disappears completely when $\epsilon \to \infty$. The proof is not difficult; obviously, a necessary condition for the Surge phenomenon is:
\begin{equation}\sum_{i,j} u_i u_j H_{i,j} - \sum_i H_{i,i} > 0\end{equation}
If not, $\eta^*$ is monotonically increasing with respect to $\beta$. Since $\beta$ is monotonically increasing with respect to $B$, the entire $\eta^*$ is monotonically increasing with $B$, and no Surge phenomenon exists. Don't forget that $u_i = \text{softsign}(g_i, \epsilon)$ is a monotonically decreasing function of $\epsilon$ ($|u_i|$ decreases as $\epsilon$ increases). Therefore, as $\epsilon$ increases, $\sum_{i,j} u_i u_j H_{i,j}$ will be smaller, making the above inequality less likely to hold. Furthermore, as $\epsilon \to \infty$, $u_i$ approaches zero, the inequality cannot possibly hold, and thus the Surge phenomenon disappears.
Further, we can prove that as $\epsilon \to \infty$, the result is consistent with SGD. We only need to notice that
\begin{equation}\frac{\eta^*}{\epsilon} \approx \frac{\beta\sum_i (\epsilon u_i) g_i}{\beta^2\sum_{i,j} (\epsilon u_i)(\epsilon u_j) H_{i,j} + \epsilon^2(1-\beta^2)\sum_i H_{i,i} }\end{equation}
We have the limits:
\begin{equation}\lim_{\epsilon\to\infty} \beta = 1,\quad\lim_{\epsilon\to\infty} \epsilon u_i = g_i, \quad \lim_{\epsilon\to\infty} \epsilon^2(1-\beta^2) = \pi \sigma^2 / 2B\end{equation}
Here $\sigma^2$ is some average of all $\sigma_i^2$. Thus we get that when $\epsilon$ is large enough, we have the approximation:
\begin{equation}\frac{\eta^*}{\epsilon} \approx \frac{\sum_i g_i^2}{\sum_{i,j} g_i g_j H_{i,j} + \left(\pi \sigma^2\sum_i H_{i,i}\right)/2B }\end{equation}
The right side is exactly the result for SGD assuming the gradient covariance matrix is $(\pi\sigma^2/2B)\boldsymbol{I}$.
Article Summary
This article continues the method of the previous article, attempting to analyze the impact of Adam's $\epsilon$ on the Scaling Law between learning rate and Batch Size. The result is a form that lies between SGD and SignSGD. The larger the $\epsilon$, the closer the result is to SGD, and the lower the probability of the "Surge phenomenon" occurring. Overall, the calculation results hold no particular surprises, but the process can serve as a reference for analyzing the role of $\epsilon$.