By 苏剑林 | February 04, 2026
Recently, I came across the paper 《Why Adam Works Better with \(\beta_1=\beta_2\): The Missing Gradient Scale Invariance Principle》, which, as the name suggests, claims that Adam performs better when \(\beta_1=\beta_2\). As pointed out by a colleague, the paper 《In Search of Adam's Secret Sauce》 from last year expressed the same view. Coincidentally, 《The Effect of Mini-Batch Noise on the Implicit Bias of Adam》, which just came out yesterday, has similar findings.
\begin{equation}\text{AdamW}:=\left\{\begin{aligned}
&\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\
&\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\
&\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\
&\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\
&\boldsymbol{u}_t =\hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right.\\
&\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t + \lambda_t \boldsymbol{\theta}_{t-1})
\end{aligned}\right.\end{equation}
Many papers point towards \(\beta_1=\beta_2\). What are the theoretical benefits of this? In this article, let's explore the related derivations.
Online Estimation
In chronological order, let's first look at 《In Search of Adam's Secret Sauce》. The storyline of this paper seems to be that through experiments, it was found that the optimal solution for Adam when \(\beta_1=\beta_2\) is very close to the optimal solution without constraints. Then, it attempts to build a theoretical explanation for this result: when \(\beta_1=\beta_2=\beta\), \(\hat{\boldsymbol{m}}_t\) and \(\hat{\boldsymbol{v}}_t\) can be regarded as an online estimation of the first and second moments of the gradient. Specifically, by expanding \(\hat{\boldsymbol{m}}_t\) and \(\hat{\boldsymbol{v}}_t\), we get
\begin{equation}\hat{\boldsymbol{m}}_t = \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} \boldsymbol{g}_k,\qquad \hat{\boldsymbol{v}}_t = \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} \boldsymbol{g}_k^2\end{equation}
It is easy to prove that the sum of the coefficients \(\frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k}\) is always equal to 1. Therefore, they are the same type of weighted average for \(\boldsymbol{g}_t\) and \(\boldsymbol{g}_t^2\), respectively, and thus have the meaning of the first and second moments. Furthermore, we have
\begin{equation}\begin{aligned}
\hat{\boldsymbol{v}}_t =&\, \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} (\hat{\boldsymbol{m}}_t + \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 \\
=&\, \underbrace{\frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} \hat{\boldsymbol{m}}_t^2}_{\hat{\boldsymbol{m}}_t^2} + \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 + \underbrace{\frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} 2\hat{\boldsymbol{m}}_t (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)}_{\boldsymbol{0}} \\
=&\, \hat{\boldsymbol{m}}_t^2 + \frac{1-\beta}{1-\beta^t}\sum_{k=1}^t \beta^{t-k} (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 \\
\end{aligned}\end{equation}
The last term obviously has the form of variance, so we might as well denote it as \(\hat{\boldsymbol{\sigma}}_t^2\), i.e., \(\hat{\boldsymbol{v}}_t=\hat{\boldsymbol{m}}_t^2+\hat{\boldsymbol{\sigma}}_t^2\). This is exactly the relationship between the second moment, the mean, and the variance. Notably, the operations above involving vector addition, subtraction, multiplication, division, and power are all element-wise, meaning they are performed component by component, and the output is still a vector.
Signal-to-Noise Awareness
Under these new notations, the Adam update amount can be written as (for simplicity, assume \(\epsilon=0\)):
\begin{equation}\newcommand{sign}{\mathop{\text{sign}}}\boldsymbol{u}_t = \frac{\hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{v}}_t}} = \frac{\hat{\boldsymbol{m}}_t}{\sqrt{\hat{\boldsymbol{m}}_t^2+\hat{\boldsymbol{\sigma}}_t^2}} = \frac{\sign(\hat{\boldsymbol{m}}_t)}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}}\end{equation}
This form of update amount has several advantages. First and most obvious is that each of its components is bounded, being constrained within \([-1, 1]\), so we don't have to worry about update explosion. Secondly, \(\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2\) happens to have the form of the reciprocal of the signal-to-noise ratio, so it can also be interpreted as "signal-to-noise ratio-aware steepest descent."
According to our derivation in 《Steepest Descent on Manifolds: 1. SGD + Hypersphere》, \(\sign(\hat{\boldsymbol{m}}_t)\) can be regarded as the solution to the following optimization problem:
\begin{equation}\max_{\boldsymbol{u}} \langle\hat{\boldsymbol{m}}_t,\boldsymbol{u}\rangle\qquad \text{s.t.}\qquad \Vert\boldsymbol{u}\Vert_{\infty} = 1\end{equation}
where \(\Vert\boldsymbol{u}\Vert_{\infty} = 1\) means that the maximum absolute value of the components of \(\boldsymbol{u}\) is 1. If we treat \(\hat{\boldsymbol{m}}_t\) as a more accurate gradient, then \(\sign(\hat{\boldsymbol{m}}_t)\) is the steepest descent direction under the infinity norm. But since this bound is static, we can reasonably assume that if the gradient fluctuation is small (large signal-to-noise ratio), the region is relatively flat, and the update amount can be increased appropriately; otherwise, it should be decreased. Therefore, constructing a dynamic boundary \(\frac{1}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}}\) for each component based on the signal-to-noise ratio better reflects the effect of adaptive learning. The steepest descent direction at this point is exactly:
\begin{equation}\max_{\boldsymbol{u}} \langle\hat{\boldsymbol{m}}_t,\boldsymbol{u}\rangle\quad \text{s.t.}\quad |\boldsymbol{u}| \leq \frac{1}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}} \qquad\Rightarrow\qquad
\boldsymbol{u}^* = \frac{\sign(\hat{\boldsymbol{m}}_t)}{\sqrt{1 +\hat{\boldsymbol{\sigma}}_t^2/\hat{\boldsymbol{m}}_t^2}}\end{equation}
Here the absolute value \(| |\) and the inequality \(\leq\) are element-wise.
First-Order Expansion
Looking at 《Why Adam Works Better with \(\beta_1=\beta_2\): The Missing Gradient Scale Invariance Principle》, it treats Adam as a continuous-time ODE. However, for mini-batch optimization, gradient noise cannot be ignored. While continuous modeling as an SDE might be acceptable, an ODE is somewhat unscientific. Therefore, I believe the starting point of this paper is quite weak.
Following the spirit of the original paper, I have slightly adjusted the proof process. We write each \(\boldsymbol{g}_k\) in \(\hat{\boldsymbol{v}}_t\) as \(\hat{\boldsymbol{m}}_t + (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)\). Treating \(\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t\) as a small quantity, the first-order expansion yields:
\begin{equation}\begin{aligned}
\hat{\boldsymbol{v}}_t =&\, \frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} (\hat{\boldsymbol{m}}_t + \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t)^2 \\
\approx &\, \frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} \hat{\boldsymbol{m}}_t^2 + \frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} 2\hat{\boldsymbol{m}}_t (\boldsymbol{g}_k - \hat{\boldsymbol{m}}_t) \\
\approx &\, \hat{\boldsymbol{m}}_t^2 + 2\hat{\boldsymbol{m}}_t \left(\frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} \boldsymbol{g}_k - \hat{\boldsymbol{m}}_t\right) \\
\end{aligned}\end{equation}
We then hope \(\hat{\boldsymbol{v}}_t\) to be as close to \(\hat{\boldsymbol{m}}_t^2\) as possible, so the first-order term should be zero, leading to \(\beta_2 = \beta_1\). Why make \(\hat{\boldsymbol{v}}_t\) close to \(\hat{\boldsymbol{m}}_t^2\)? It is to make the update amount \(\boldsymbol{u}_t = \hat{\boldsymbol{m}}_t/\sqrt{\hat{\boldsymbol{v}}_t}\) closer to \(\sign(\hat{\boldsymbol{m}}_t)\). Since \(\sign\) is bounded, it can better withstand disturbances caused by scale changes in \(\boldsymbol{g}_t\), thereby improving training stability.
It should be noted that the proof here is extremely simplified compared to the original text, but it captures its core idea and makes corrections. The original paper first converts it into an ODE, which itself has certain factual errors, then uses some lax approximations to finally obtain an expansion of the form \(\boldsymbol{u}_t = \sign(\boldsymbol{g}_t)(1 + \cdots)\), which is less accurate than our direct expansion based on \(\hat{\boldsymbol{m}}_t\).
Dual Optimization
One idea common to both papers is to keep \(\boldsymbol{u}_t\) bounded to improve training stability. This inspired me to think in reverse: given \(\beta_1\), what value should \(\beta_2\) take to make \(\Vert\boldsymbol{u}_t\Vert\) as small as possible? Since \(\boldsymbol{u}_t = \hat{\boldsymbol{m}}_t/\sqrt{\hat{\boldsymbol{v}}_t}\), intuitively, one should make \(\hat{\boldsymbol{v}}_t\) as large as possible. This led me to consider the following dual optimization problem:
\begin{equation}\max_{\beta_2} \min_{\boldsymbol{g}_1,\cdots,\boldsymbol{g}_t}\underbrace{\frac{1-\beta_2}{1-\beta_2^t}\sum_{k=1}^t \beta_2^{t-k} \boldsymbol{g}_k^2}_{\hat{\boldsymbol{v}}_t},\qquad \text{s.t.}\qquad \frac{1-\beta_1}{1-\beta_1^t}\sum_{k=1}^t \beta_1^{t-k} \boldsymbol{g}_k = \hat{\boldsymbol{m}}_t\end{equation}
This formulation is not strictly rigorous, but it should be understood element-wise. The \(\min\) over \(\boldsymbol{g}_1,\cdots,\boldsymbol{g}_t\) in the objective cannot be removed; it represents choosing a \(\beta_2\) that is as optimal as possible under any gradient sequence. This optimization problem looks complex but is not difficult to solve step-by-step using the Cauchy-Schwarz inequality. Solving the inner minimization first:
\begin{equation}\sum_{k=1}^t \frac{p_k^2}{q_k}\times \sum_{k=1}^t q_k \boldsymbol{g}_k^2 \geq \left(\sum_{k=1}^t p_k \boldsymbol{g}_k\right)^2 = \hat{\boldsymbol{m}}_t^2\end{equation}
where \(p_k = \frac{1-\beta_1}{1-\beta_1^t}\beta_1^{t-k},q_k = \frac{1-\beta_2}{1-\beta_2^t}\beta_2^{t-k}\). Thus, the result of the inner minimization is \(\frac{\hat{\boldsymbol{m}}_t^2}{\sum_{k=1}^t p_k^2/q_k}\). To maximize this, we need to minimize \(\sum_{k=1}^t p_k^2/q_k\). Using Cauchy-Schwarz again:
\begin{equation}\sum_{k=1}^t p_k^2/q_k = \sum_{k=1}^t q_k \times \sum_{k=1}^t p_k^2/q_k \geq \left(\sum_{k=1}^t p_k\right)^2 = 1\end{equation}
Equality is reached when \(q_k=p_k\), which implies \(\beta_2 = \beta_1\). Therefore, overall, setting \(\beta_2=\beta_1\) offers the best training stability, guaranteeing \(\hat{\boldsymbol{v}}_t\geq \hat{\boldsymbol{m}}_t^2\). This time, we derived it directly from a dual optimization goal, rather than proving it after assuming \(\beta_1=\beta_2\).
Related Work
Some readers might be confused. The original Adam paper 《Adam: A Method for Stochastic Optimization》 recommends default values of \(\beta_1=0.9, \beta_2=0.999\). Furthermore, our previous analysis in 《Viewing Adaptive Learning Rate Optimizers from Hessian Approximation》 also suggested \(\beta_2 > \beta_1\). Don't these conclusions contradict each other?
Not necessarily. They are simply different conclusions in different contexts. In the early days, we were training small models with small batch sizes, and the noise at each step was relatively large. Setting a larger \(\beta_2\) allowed \(\hat{\boldsymbol{v}}_t\) to change more slowly, making Adam more like SGD, where updates from adjacent steps could be linearly superimposed for further noise reduction. The paper 《The Effect of Mini-Batch Noise on the Implicit Bias of Adam》 has a similar finding: \(\beta_1 < \beta_2\) for small batch sizes, while large batch sizes bring them closer.
As for 《Viewing Adaptive Learning Rate Optimizers from Hessian Approximation》, it was an approximate analysis conducted near an ideal endpoint. However, the current reality is that models and batch sizes are getting larger and larger. We find that after a model is trained for a month, the loss seems decent, but if we have enough compute and data, training for another month or two allows the loss to drop further. This means that whether we train for one month or two, the model is still far from the true optimum.
In this new context, the core logic of our training has become "stability for the sake of speed." Stability is the prerequisite. We believe that as long as we can train stably and continuously, we will inevitably achieve better results. Choosing \(\beta_1=\beta_2\) keeps the update amount bounded, satisfying our expectation for "stability." In fact, in the LLM era, the default parameters for Adam have gradually shifted from \(\beta_1=0.9, \beta_2=0.999\) to \(\beta_1=0.9, \beta_2=0.95\), which is closer to \(\beta_1=\beta_2\), supporting our deduction.
Summary
In this article, we analyzed the \(\beta_1, \beta_2\) parameters of the Adam optimizer. From a stability perspective, \(\beta_1=\beta_2\) is often a superior choice, as it can be understood as steepest descent under signal-to-noise awareness.