By 苏剑林 | September 15, 2025
In the previous two articles, "Rethinking Learning Rate and Batch Size (Part 1): Status Quo" and "Rethinking Learning Rate and Batch Size (Part 2): Mean Field", we primarily proposed the mean-field method to simplify the calculations related to learning rate and batch size. At that time, we analyzed the SGD, SignSGD, and SoftSignSGD optimizers, and the primary goal was simplification, with no essentially new conclusions. However, in today's feast of optimizers, how could we miss a place for Muon? Therefore, in this article, we will attempt to calculate the relevant conclusions for Muon to see if the relationship between its learning rate and batch size exhibits any new patterns.
As is well known, the primary characteristic of Muon is its non-element-wise update rule. Therefore, the element-wise calculation methods used in "How Should the Learning Rate Change as Batch Size Increases?" and "How Does Adam's epsilon Affect the Learning Rate Scaling Law?" will be completely inapplicable. Fortunately, the mean-field method introduced in the previous article remains effective, requiring only a slight adjustment in detail.
First, we introduce some notations. Let the loss function be $\mathcal{L}(\boldsymbol{W})$, where $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ is a weight matrix (assume $n\geq m$). Let $\boldsymbol{G}$ be its gradient, and the gradient of a single sample be denoted as $\tilde{\boldsymbol{G}}$. Its mean is $\boldsymbol{G}$, and its variance is $\sigma^2$. When the batch size is $B$, the gradient is denoted as $\tilde{\boldsymbol{G}}_B$. Its mean is still $\boldsymbol{G}$, but its variance becomes $\sigma^2/B$. Note that the variance here is simply a scalar $\sigma^2$, unlike before where we considered the full covariance matrix.
The core reason for this simplification is that the random variable here is already a matrix, so its corresponding covariance matrix would actually be a 4th-order tensor, which is cumbersome to discuss. Will simplifying it to a single scalar severely compromise accuracy? Actually, no. In the previous two articles, although we considered the complete covariance matrix $\boldsymbol{\Sigma}$, a closer look revealed that the final results only depended on $\newcommand{tr}{\mathop{\text{tr}}}\tr(\boldsymbol{\Sigma})$, which is equivalent to simplifying it to a scalar from the beginning.
Similarly, let the update amount be $-\eta\tilde{\boldsymbol{\Phi}}_B$. Consider the second-order expansion of the loss function:
\begin{equation}\mathcal{L}(\boldsymbol{W} - \eta\tilde{\boldsymbol{\Phi}}_B) \approx \mathcal{L}(\boldsymbol{W}) - \eta \tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{G}) + \frac{1}{2}\eta^2\newcommand{tr}{\mathop{\text{tr}}}\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)\label{eq:loss-2}\end{equation}The first two terms should be straightforward; the more difficult to understand is the third term. Similar to the covariance matrix, the Hessian matrix $\boldsymbol{H}$ here is a 4th-order tensor, which is complex to grasp. The simplest entry point here should be the linear operator perspective—treating $\boldsymbol{H}$ as a linear operator whose input and output are both matrices. We don't need to know what $\boldsymbol{H}$ looks like, nor how $\boldsymbol{H}$ operates on $\tilde{\boldsymbol{\Phi}}_B$; we only need to know that $\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B$ is linear with respect to $\tilde{\boldsymbol{\Phi}}_B$. In this way, the objects we deal with remain matrices, without increasing the mental burden. Any linear operator that satisfies the conditions can serve as an approximation of the Hessian matrix, without needing to write out specific high-order tensor forms.
The protagonist of this article is Muon. We take $\tilde{\boldsymbol{\Phi}}_B=\newcommand{msign}{\mathop{\text{msign}}}\msign(\tilde{\boldsymbol{G}}_B)$ as its approximation for calculation. By definition, we write $\msign(\tilde{\boldsymbol{G}}_B)=\tilde{\boldsymbol{G}}_B(\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B)^{-1/2}$. From a Newton method perspective, this is equivalent to assuming $\boldsymbol{H}^{-1}\boldsymbol{X} = \eta_{\max}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{-1/2}$, hence $\boldsymbol{H}\boldsymbol{X} = \eta_{\max}^{-1}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}$, which will be used in subsequent calculations.
Taking the expectation of both sides of Eq. $\eqref{eq:loss-2}$, we get:
\begin{equation}\mathbb{E}[\mathcal{L}(\boldsymbol{W} - \eta\tilde{\boldsymbol{\Phi}}_B)] \approx \mathcal{L}(\boldsymbol{W}) - \eta \tr(\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]^{\top}\boldsymbol{G}) + \frac{1}{2}\eta^2\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)]\end{equation}First, determine $\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]$:
\begin{equation}\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]=\mathbb{E}[\tilde{\boldsymbol{G}}_B(\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B)^{-1/2}]\approx\mathbb{E}[\tilde{\boldsymbol{G}}_B](\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B])^{-1/2} = \boldsymbol{G}(\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B])^{-1/2}\end{equation}We write out $\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]$ component-wise and assume independence between different components:
\begin{equation}\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]_{i,j} = \mathbb{E}\left[\sum_{k=1}^n (\tilde{G}_B)_{k,i}(\tilde{G}_B)_{k,j}\right] = \left\{\begin{aligned} \mathbb{E}\left[\sum_{k=1}^n (\tilde{G}_B)_{k,i}^2\right] = \left(\sum_{k=1}^n G_{k,i}^2\right) + n\sigma^2/B,\quad (i=j) \\[6pt] \sum_{k=1}^n \mathbb{E}[(\tilde{G}_B)_{k,i}] \mathbb{E}[(\tilde{G}_B)_{k,j}] = \sum_{k=1}^n G_{k,i}G_{k,j},\quad (i\neq j) \end{aligned}\right.\end{equation}Combining these gives $\mathbb{E}[\tilde{\boldsymbol{G}}{}_B^{\top}\tilde{\boldsymbol{G}}_B]=\boldsymbol{G}^{\top}\boldsymbol{G} + (n\sigma^2/B) \boldsymbol{I}$, thus:
\begin{equation}\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]\approx \boldsymbol{G}(\boldsymbol{G}^{\top}\boldsymbol{G} + (n\sigma^2/B) \boldsymbol{I})^{-1/2} = \msign(\boldsymbol{G})(\boldsymbol{I} + (n\sigma^2/B) (\boldsymbol{G}^{\top}\boldsymbol{G})^{-1})^{-1/2}\end{equation}To further simplify the dependency on $B$, we approximate $\boldsymbol{G}^{\top}\boldsymbol{G}$ using $\tr(\boldsymbol{G}^{\top}\boldsymbol{G})\boldsymbol{I}/m$. That is, we only keep the diagonal part of $\boldsymbol{G}^{\top}\boldsymbol{G}$ and then replace those diagonal elements with their average. In this way, we obtain:
\begin{equation}\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]\approx \msign(\boldsymbol{G})(1 + \mathcal{B}_{\text{simple}}/B)^{-1/2}\end{equation}where $\mathcal{B}_{\text{simple}} = mn\sigma^2/\tr(\boldsymbol{G}^{\top}\boldsymbol{G})= mn\sigma^2/\Vert\boldsymbol{G}\Vert_F^2$. This is actually identical to treating $\boldsymbol{G}$ as a vector and calculating $\mathcal{B}_{\text{simple}}$ as in the previous two articles. The form of the above equation is exactly the same as for SignSGD. From this, we can guess that Muon will not exhibit many new results regarding the relationship between learning rate and batch size.
As for $\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)]$, we only calculate the hypothesis corresponding to Muon derived earlier, namely $\boldsymbol{H}\boldsymbol{X} = \eta_{\max}^{-1}\boldsymbol{X}(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2}$. Then:
\begin{equation}\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B) = \eta_{\max}^{-1}\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\tilde{\boldsymbol{\Phi}}_B(\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2})\end{equation}Notice that $\tilde{\boldsymbol{\Phi}}_B$ is the result of $\msign$, so it must be an orthogonal matrix (of full rank when $n \ge m$), thus $\tilde{\boldsymbol{\Phi}}{}_B^{\top}\tilde{\boldsymbol{\Phi}}_B=\boldsymbol{I}$. Therefore, in this case, $\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)$ is a fixed constant $\eta_{\max}^{-1}\tr((\boldsymbol{G}^{\top}\boldsymbol{G})^{1/2})=\eta_{\max}^{-1}\tr(\msign(\boldsymbol{G})^{\top}\boldsymbol{G})$. Thus we get:
\begin{equation}\eta^* \approx \frac{\tr(\mathbb{E}[\tilde{\boldsymbol{\Phi}}_B]^{\top}\boldsymbol{G})}{\mathbb{E}[\tr(\tilde{\boldsymbol{\Phi}}{}_B^{\top}\boldsymbol{H}\tilde{\boldsymbol{\Phi}}_B)]}\approx \frac{\eta_{\max}}{\sqrt{1 + \mathcal{B}_{\text{simple}}/B}}\end{equation}As expected, the result is identical in form to the SignSGD result, with no new patterns discovered.
Actually, if you think about it carefully, this is within reason. This is because SignSGD directly applies $\newcommand{sign}{\mathop{\text{sign}}}\sign$ to the gradient, whereas Muon's $\msign$ applies $\sign$ to the singular values. Intuitively, this is equivalent to applying $\sign$ in a different coordinate system. What it brings are new matrix update rules, while the learning rate $\eta^*$ and the batch size $B$ are merely scalars. Given that both are based on the core premise of $\sign$, it is highly likely that the asymptotic relationship between these scalars will not undergo significant changes.
Of course, we have only calculated for one special $\boldsymbol{H}$. If a more general $\boldsymbol{H}$ is considered, the "Surge" phenomenon might appear as it does with SignSGD, where "as batch size increases, the learning rate should instead decrease." But as we discussed in the "Reflections on Causes" section of the previous article, if a Surge phenomenon is truly observed, it might suggest that the optimizer should be changed rather than adjusting the relationship between $\eta^*$ and $B$.
In this article, we attempted to analyze Muon using a simple mean-field approximation. The conclusion is that the relationship between its learning rate and batch size is consistent with SignSGD, with no new patterns found.
,
author={Jianlin Su},
year={2025},
month={Sep},
url={\url{https://kexue.fm/archives/11285}},
}