By 苏剑林 | September 02, 2025
As is well known, we started experimenting with applying Muon to large-scale LLM training quite early. Specifically, in "Muon Sequel: Why We Choose to Try Muon?", we proposed the "Match Adam Update RMS" trick to quickly migrate from Adam to Muon. This trick was also used in the training of Kimi K2. This technique refers to unifying the Update RMS of Muon to 0.2, which allows us to reuse Adam's learning rate and weight decay rate.
Behind this trick is our observation that Adam's Update RMS is approximately equal to 0.2, and this phenomenon is stable and reproducible. This raises an interesting question: Why is Adam's Update RMS 0.2? Can we explain it theoretically?
First, let's describe the phenomenon: From experiments, we observed that roughly after the warmup ends and the model enters formal training, Adam's Update RMS remains almost entirely between 0.2 and 0.3, and models of different sizes exhibit similar patterns. A common point among these models is that they are all trained with Adam, with parameters $\beta_1=0.9, \beta_2=0.95$. Since the commonality is so obvious, this is likely not a coincidence, so the author attempted to analyze the underlying principle.
Next, let's review the form of the Adam optimizer:
\begin{equation}\text{Adam}\color{skyblue}{\text{W}}:=\left\{\begin{aligned} &\boldsymbol{m}_t = \beta_1 \boldsymbol{m}_{t-1} + \left(1 - \beta_1\right) \boldsymbol{g}_t\\ &\boldsymbol{v}_t = \beta_2 \boldsymbol{v}_{t-1} + \left(1 - \beta_2\right) \boldsymbol{g}_t^2\\ &\hat{\boldsymbol{m}}_t = \boldsymbol{m}_t\left/\left(1 - \beta_1^t\right)\right.\\ &\hat{\boldsymbol{v}}_t = \boldsymbol{v}_t\left/\left(1 - \beta_2^t\right)\right.\\ &\boldsymbol{u}_t =\hat{\boldsymbol{m}}_t\left/\left(\sqrt{\hat{\boldsymbol{v}}_t} + \epsilon\right)\right.\\ &\boldsymbol{\theta}_t = \boldsymbol{\theta}_{t-1} - \eta_t (\boldsymbol{u}_t \color{skyblue}{ + \lambda_t \boldsymbol{\theta}_{t-1}}) \end{aligned}\right.\end{equation}Note: In this article, all vector multiplications and divisions, including squaring, default to the Hadamard product/quotient, i.e., element-wise multiplication/division.
What we want to do is prove that $\Vert\boldsymbol{u}_t\Vert_{RMS}\approx 0.2$, at least under the setting of $\beta_1=0.9, \beta_2=0.95$. Since we are concerned with the situation after stable training, we can assume $t$ is large enough such that $\beta_1^t$ and $\beta_2^t$ are sufficiently close to 0, so there is no need to distinguish between $\boldsymbol{m}_t$ and $\hat{\boldsymbol{m}}_t$, or $\boldsymbol{v}_t$ and $\hat{\boldsymbol{v}}_t$. At the same time, we assume $\epsilon$ is small enough to be ignored. Thus, we have $\boldsymbol{u}_t = \boldsymbol{m}_t/\sqrt{\boldsymbol{v}_t}$.
For $\boldsymbol{m}_t$ and $\boldsymbol{v}_t$, we can obtain the expansion formulas:
\begin{equation}\boldsymbol{m}_t = (1 - \beta_1)\sum_{i=1}^t \beta_1^{t-i}\boldsymbol{g}_i,\qquad \boldsymbol{v}_t = (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\boldsymbol{g}_i^2\end{equation}If we assume that $\boldsymbol{g}_1, \boldsymbol{g}_2, \dots, \boldsymbol{g}_t$ are all sampled from the same distribution, then we can use numerical simulation methods to estimate $\Vert\boldsymbol{u}_t\Vert_{RMS}$. Without further ado, let's try the simplest standard normal distribution $\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$. The reference code is as follows:
import numpy as np
N, T = 10000, 2000
beta1, beta2 = 0.9, 0.95
m, v = 0, 0
for i in range(T):
g = np.random.randn(N)
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
u = m / v**0.5
rms = (u**2).mean()**0.5
print(rms)
Guess what the result is? The answer is approximately 0.225, which is surprisingly similar to the experimental results! This in turn indicates that our simulation assumptions are quite consistent with the actual situation. Some readers might think this is wrong, as $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ is just pure noise—how can it be consistent? Actual training is certainly not pure noise; one can only say that the signal-to-noise ratio of a single gradient is pitiably small, so it can be simulated using pure noise.
Readers can play around with the above reference code to observe the variables influencing Update RMS. The general conclusion is: Update RMS is positively correlated with $\beta_1$ and seems unrelated to $\beta_2$. If the distribution of $\boldsymbol{g}$ has a non-zero mean (equivalent to increasing the signal-to-noise ratio of the gradient), then the Update RMS will also increase.
In this section, the author attempts to derive an approximate analytical solution for the above simulation results from a theoretical perspective. First, from the definition of RMS, to find $\Vert\boldsymbol{u}_t\Vert_{RMS}$, we first need to find $\boldsymbol{u}_t^2 = \boldsymbol{m}_t^2/\boldsymbol{v}_t$. The author's idea is to use the expectation of $\boldsymbol{u}_t^2$ as its approximation and further transform it into a mean-field approximation:
\begin{equation}\mathbb{E}[\boldsymbol{u}_t^2] = \mathbb{E}\left[\frac{\boldsymbol{m}_t^2}{\boldsymbol{v}_t}\right] \approx \frac{\mathbb{E}[\boldsymbol{m}_t^2]}{\mathbb{E}[\boldsymbol{v}_t]}\end{equation}Some readers might question the rationality of this last step of approximation. The author's suggestion is to ignore these minor details for now, just like assuming $\boldsymbol{g}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$ in the previous section—calculate it first. If the result is reasonable, then the process must be reasonable to some extent. Now we calculate the numerator and denominator separately. This time, we generally set $\mathbb{E}[\boldsymbol{g}]=\boldsymbol{\mu}$ and $\mathbb{E}[\boldsymbol{g}^2]=\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2$. The denominator is relatively simple:
\begin{equation}\begin{aligned} \mathbb{E}[\boldsymbol{v}_t] =&\, (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}\mathbb{E}[\boldsymbol{g}_i^2] \\ =&\, (1 - \beta_2)\sum_{i=1}^t \beta_2^{t-i}(\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\[6pt] =&\, (1 - \beta_2^t)(\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2) \\[8pt] =&\, \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2\qquad (t\to\infty) \end{aligned}\end{equation}As for the numerator, we can directly expand the square to calculate, or take a slight shortcut: What we want is the second moment of $\boldsymbol{m}_t$, $\mathbb{E}[\boldsymbol{m}_t^2]$, which equals $\mathbb{E}[\boldsymbol{m}_t]^2 + \mathbb{V}ar[\boldsymbol{m}_t]$. Since $\boldsymbol{m}_t$ is a weighted average of $\boldsymbol{g}_i$, it must hold that $\mathbb{E}[\boldsymbol{m}_t]=\mathbb{E}[\boldsymbol{g}_i]=\boldsymbol{\mu}$. Regarding the variance, it has quadratic additivity, so:
\begin{equation}\mathbb{V}ar[\boldsymbol{m}_t] = (1 - \beta_1)^2\sum_{i=1}^t \beta_1^{2(t-i)}\boldsymbol{\sigma}^2 = \frac{(1 - \beta_1)^2 (1 - \beta_1^{2t})}{1 - \beta_1^2}\boldsymbol{\sigma}^2= \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2\qquad (t\to\infty)\end{equation}Therefore:
\begin{equation}\mathbb{E}[\boldsymbol{u}_t^2]\approx \frac{\boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2}{\boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2}\end{equation}Since $\mathbb{E}[\boldsymbol{u}_t^2]$ is already a vector of squared values, to estimate $\Vert\boldsymbol{u}_t\Vert_{RMS}$, we only need to average the components and then take the square root. For the averaging step, let's use the mean-field approximation again (averaging the numerator and denominator separately). We finally obtain:
\begin{equation}\Vert\boldsymbol{u}_t\Vert_{RMS} \approx \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}{\Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2}} = \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}}{\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2 + 1}}\label{eq:mean-field}\end{equation}It has two influencing factors: first, $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$, which can be seen as the signal-to-noise ratio (SNR) of the gradient; second, $\beta_1$, which is one of Adam's hyperparameters. notably, the result does not depend on $\beta_2$, which matches our earlier simulation results. So, how good is this approximation? Let's consider the simplest special case where $\boldsymbol{\mu}=\boldsymbol{0}$. Then:
\begin{equation}\Vert\boldsymbol{u}_t\Vert_{RMS} \approx \sqrt{\frac{1 - \beta_1}{1 + \beta_1}}\label{eq:special-case}\end{equation}Substitute $\beta_1=0.9$, and the result is $0.2294\dots$, which is surprisingly consistent with both simulation results and practical performance! Furthermore, several comparisons with simulation results are as follows:
Simulation results vs mean-field approximation (different beta1, beta2)
It should be said that the degree of approximation is quite good, especially after $\beta_2 \geq 0.9$, where the results almost coincide with the mean-field approximation. As for the comparison considering SNR, the results are as follows:
Simulation results vs mean-field approximation (different beta1, SNR)
When the signal-to-noise ratio increases, the error of the mean-field approximation begins to grow, but it can still predict an overall trend. In fact, in actual training, the gradient signal-to-noise ratio rarely has the opportunity to be as high as 1, so mean-field can still be considered a good approximation.
If we have accepted the mean-field approximation $\eqref{eq:mean-field}$, we can conversely use it to estimate the signal-to-noise ratio of the gradient:
\begin{equation}\frac{\Vert\boldsymbol{\mu}\Vert^2}{\Vert\boldsymbol{\sigma}\Vert^2} \approx \frac{\Vert\boldsymbol{u}_t\Vert_{RMS}^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \Vert\boldsymbol{u}_t\Vert_{RMS}^2}\end{equation}In actual training, $\beta_1$ is given, and $\Vert\boldsymbol{u}_t\Vert_{RMS}$ (which is Adam's Update RMS) can also be directly estimated, so the above formula is computable. Of course, this formula only applies to Adam. Is there a more general estimation idea? Yes, there is! Don't forget that we previously estimated:
\begin{equation}\mathbb{E}[\boldsymbol{m}_t^2]\approx \boldsymbol{\mu}^2 + \frac{1 - \beta_1}{1 + \beta_1}\boldsymbol{\sigma}^2\end{equation}Then, summing its components and taking the square root, we consider it an approximation of $\Vert\boldsymbol{m}_t\Vert$:
\begin{equation}\Vert\boldsymbol{m}_t\Vert\approx \sqrt{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}\end{equation}As for the second moment, it is $\mathbb{E}[\boldsymbol{v}_t]\approx \boldsymbol{\mu}^2 + \boldsymbol{\sigma}^2$. While optimizers like Muon do not have second moments available, we notice that the result for the second moment is independent of $\beta_2$. Thus, we might consider the simplest special case—$\beta_2=0$—at which point $\boldsymbol{v}_t=\boldsymbol{g}_t^2$. Of course, this might be a bit forced, but for estimation, we use whatever is convenient. This "approximation" implies that $\Vert\boldsymbol{g}_t\Vert^2\approx \Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2$ holds, so we have:
\begin{equation}\frac{\Vert\boldsymbol{m}_t\Vert}{\Vert\boldsymbol{g}_t\Vert}\approx \sqrt{\frac{\Vert\boldsymbol{\mu}\Vert^2 + \frac{1 - \beta_1}{1 + \beta_1}\Vert\boldsymbol{\sigma}\Vert^2}{\Vert\boldsymbol{\mu}\Vert^2 + \Vert\boldsymbol{\sigma}\Vert^2}}\end{equation}The right-hand side is exactly the same as in equation $\eqref{eq:mean-field}$, so we can write:
\begin{equation}\frac{\Vert\boldsymbol{\mu}\Vert^2}{\Vert\boldsymbol{\sigma}\Vert^2} \approx \frac{\Vert\boldsymbol{m}_t\Vert^2/\Vert\boldsymbol{g}_t\Vert^2 - \frac{1 - \beta_1}{1 + \beta_1}}{1 - \Vert\boldsymbol{m}_t\Vert^2/\Vert\boldsymbol{g}_t\Vert^2}\end{equation}In other words, replacing $\Vert\boldsymbol{u}_t\Vert_{RMS}$ with $\Vert\boldsymbol{m}_t\Vert/\Vert\boldsymbol{g}_t\Vert$ gives a general logic for estimating $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$ for optimizers with momentum. Some readers might ask what to do if there is no momentum? Then there's really no way, because the $\Vert\boldsymbol{\mu}\Vert^2/\Vert\boldsymbol{\sigma}\Vert^2$ here is a statistic across optimization trajectories, and we must have some cross-trajectory statistical information to estimate it.
This article primarily explored Adam's Update RMS from two perspectives: simulation experiments and theoretical approximation. This can serve as one of the theoretical bases for aligning the Update RMS to 0.2 in the Muon optimizer.