Muon Sequel: Why did we choose to try Muon?

By 苏剑林 | Feb 27, 2025

In this post, I will interpret our latest technical report "Muon is Scalable for LLM Training". It shares a relatively large-scale practice of the Muon optimizer, which we previously introduced in "Muon Optimizer Appreciation: An Essential Leap from Vectors to Matrices". We have open-sourced the corresponding models (which we call "Moonlight", currently a 3B/16B MoE model). We found a somewhat startling conclusion: under our experimental setup, Muon achieved nearly twice the training efficiency compared to Adam.

Scaling Law and MMLU performance
The Scaling Law of Muon and the MMLU performance of Moonlight

Research on optimizers is neither too much nor too little, but why did we choose Muon as a new direction for exploration? How can an Adam optimizer with already tuned hyperparameters be quickly switched to Muon for experimentation? After scaling the model up, how does the performance difference between Muon and Adam manifest? Next, I will share our thought process.

Optimization Principle

Regarding optimizers, I actually gave a brief review in "Muon Optimizer Appreciation: An Essential Leap from Vectors to Matrices". Most optimizer improvements are essentially just small patches—not to say they are worthless, but they ultimately fail to give a profound and striking impression.

We need to start from principles closer to the essence to think about what constitutes a good optimizer. Intuitively, an ideal optimizer should have two characteristics: Stability and Speed. Specifically, every update step of an ideal optimizer should satisfy two points: 1. The perturbation to the model should be as small as possible; 2. The contribution to lowering the Loss should be as large as possible. To put it more directly, we don't want to change the model drastically (Stability), but we want to reduce the Loss significantly (Speed)—a classic case of "wanting both."

How do we translate these two characteristics into mathematical language? Stability can be understood as a constraint on the update amount, while Speed can be understood as finding the update amount that makes the loss function decrease fastest. Thus, this can be transformed into a constrained optimization problem. Using the notation from previously, for a matrix parameter $\boldsymbol{W}\in\mathbb{R}^{n\times m}$, its gradient is $\boldsymbol{G}\in\mathbb{R}^{n\times m}$. When the parameter changes from $\boldsymbol{W}$ to $\boldsymbol{W}+\Delta\boldsymbol{W}$, the variation in the loss function is: \begin{equation}\text{Tr}(\boldsymbol{G}^{\top}\Delta\boldsymbol{W})\end{equation} Then, finding the the update amount for Speed under the premise of Stability can be expressed as: \begin{equation}\mathop{\text{argmin}}_{\Delta\boldsymbol{W}}\text{Tr}(\boldsymbol{G}^{\top}\Delta\boldsymbol{W})\quad\text{s.t.}\quad \rho(\Delta\boldsymbol{W})\leq \eta\label{eq:least-action}\end{equation} Here $\rho(\Delta\boldsymbol{W})\geq 0$ is a certain indicator of Stability (smaller values indicate more stability), and $\eta$ is a constant smaller than 1 representing our requirement for stability; we will later see that it is actually the learning rate of the optimizer. If readers don't mind, we can borrow a concept from theoretical physics and call the above principle the "Least Action Principle" for optimizers.

Matrix Norm

The only uncertainty in Eq. $\eqref{eq:least-action}$ is the metric for stability $\rho(\Delta\boldsymbol{W})$. Once $\rho(\Delta\boldsymbol{W})$ is selected, $\Delta\boldsymbol{W}$ can be explicitly solved (at least theoretically). To some extent, we can consider that the essential difference between different optimizers is that their definitions of Stability are different.

Many readers, when they first learned SGD, must have seen statements like "the negative gradient direction is the direction of fastest local descent of the function value." Looked at through this framework, it is actually choosing the metric of Stability as the Frobenius norm $\Vert\Delta\boldsymbol{W}\Vert_F$. That is to say, the "fastest descent direction" is not invariable; it can only be determined after choosing a metric. Changing to a different norm might mean the negative gradient direction is no longer the fastest.

The next natural question is: which norm most appropriately measures Stability? If we impose a constraint that is too strong, it will be stable, but the optimizer will struggle and only converge to a suboptimal solution; conversely, if the constraint is weakened and the optimizer "let loose," the training process will become extremely uncontrollable. Therefore, the ideal scenario is to find the most accurate indicator of Stability. Considering that neural networks are dominated by matrix multiplications, let's take $\boldsymbol{y}=\boldsymbol{x}\boldsymbol{W}$ as an example: \begin{equation}\Vert\Delta \boldsymbol{y}\Vert = \Vert\boldsymbol{x}(\boldsymbol{W} + \Delta\boldsymbol{W}) - \boldsymbol{x}\boldsymbol{W}\Vert = \Vert\boldsymbol{x} \Delta\boldsymbol{W}\Vert\leq \rho(\Delta\boldsymbol{W}) \Vert\boldsymbol{x}\Vert\end{equation} The meaning of the above equation is that when parameters change from $\boldsymbol{W}$ to $\boldsymbol{W}+\Delta\boldsymbol{W}$, the change in model output is $\Delta\boldsymbol{y}$. We hope the magnitude of this change can be controlled by $\Vert\boldsymbol{x}\Vert$ and a function $\rho(\Delta\boldsymbol{W})$ related to $\Delta\boldsymbol{W}$; we use this function as our indicator of stability. From linear algebra, we know that the most accurate value for $\rho(\Delta\boldsymbol{W})$ is the spectral norm of $\Delta\boldsymbol{W}$, denoted as $\Vert\Delta\boldsymbol{W}\Vert_2$. Substituting this into Eq. $\eqref{eq:least-action}$, we get: \begin{equation}\mathop{\text{argmin}}_{\Delta\boldsymbol{W}}\text{Tr}(\boldsymbol{G}^{\top}\Delta\boldsymbol{W})\quad\text{s.t.}\quad \Vert\Delta\boldsymbol{W}\Vert_2\leq \eta\end{equation} Solving this optimization problem results in Muon with $\beta=0$: \begin{equation}\Delta\boldsymbol{W} = -\eta\, \text{msign}(\boldsymbol{G}) = -\eta\,\boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top}, \quad \boldsymbol{U},\boldsymbol{\Sigma},\boldsymbol{V}^{\top} = \mathop{\text{SVD}}(\boldsymbol{G})\end{equation} When $\beta > 0$, $\boldsymbol{G}$ is replaced by the momentum $\boldsymbol{M}$. Since $\boldsymbol{M}$ can be seen as a smoother estimate of the gradient, the conclusion of the above equation still holds. Therefore, we can conclude that "Muon is the steepest descent under the spectral norm." As for the Newton-Schulz iteration and the like, they are computational approximations, which we won't detail here. The detailed derivation was already given in "Muon Optimizer Appreciation: An Essential Leap from Vectors to Matrices" and won't be repeated.

Weight Decay

At this point, we can answer the first question: Why choose to try Muon? Because like SGD, Muon gives the direction of steepest descent, but its spectral norm constraint is more accurate than SGD's $F$-norm, so it has better potential. On the other hand, improving an optimizer from the perspective of "choosing the most appropriate constraint for different parameters" seems more essential than various patch-like modifications.

Of course, potential does not equal capability, and there are "traps" in verifying Muon on larger models. First is the Weight Decay issue. Although we included Weight Decay when introducing Muon in the previous post, when the author first proposed Muon, it wasn't there. We initially followed the official implementation, only to find that Muon converged fast in the early stages but was soon caught up by Adam, and various "internal metrics" showed signs of crashing.

We soon realized this might be a Weight Decay issue, so we added it: \begin{equation}\Delta\boldsymbol{W} = -\eta\, [\text{msign}(\boldsymbol{M})+ \lambda \boldsymbol{W}]\end{equation} Continuing the experiment, as expected, Muon consistently maintained a lead over Adam, as shown in Figure 2 of the paper:

Effect of Weight Decay Comparison
Comparison of results with and without Weight Decay

What role does Weight Decay play? Post-analysis suggests that the key might be keeping the parameter norm bounded: \begin{align} \Vert\boldsymbol{W}_t\Vert = &\, \Vert\boldsymbol{W}_{t-1} - \eta_t (\boldsymbol{\Phi}_t + \lambda \boldsymbol{W}_{t-1})\Vert \nonumber\\[5pt] = &\, \Vert(1 - \eta_t \lambda)\boldsymbol{W}_{t-1} - \eta_t \lambda (\boldsymbol{\Phi}_t/\lambda)\Vert \nonumber\\[5pt] \leq &\,(1 - \eta_t \lambda)\Vert\boldsymbol{W}_{t-1}\Vert + \eta_t \lambda \Vert\boldsymbol{\Phi}_t/\lambda\Vert \nonumber\\[5pt] \leq &\, \max(\Vert\boldsymbol{W}_{t-1}\Vert,\Vert\boldsymbol{\Phi}_t/\lambda\Vert) \end{align} Here $\Vert\cdot\Vert$ is any matrix norm; the above inequality holds for any matrix norm. $\boldsymbol{\Phi}_t$ is the update vector provided by the optimizer, which for Muon is $\text{msign}(\boldsymbol{M})$. When we take the spectral norm, we have $\Vert\text{msign}(\boldsymbol{M})\Vert_2 = 1$, thus for Muon: \begin{equation} \Vert\boldsymbol{W}_t\Vert_2 \leq \max(\Vert\boldsymbol{W}_{t-1}\Vert_2,1/\lambda)\leq\cdots \leq \max(\Vert\boldsymbol{W}_0\Vert_2,1/\lambda)\end{equation} This ensures the "internal health" of the model, because $\Vert\boldsymbol{x}\boldsymbol{W}\Vert\leq \Vert\boldsymbol{x}\Vert\Vert\boldsymbol{W}\Vert_2$. By controlling $\Vert\boldsymbol{W}\Vert_2$, $\Vert\boldsymbol{x}\boldsymbol{W}\Vert$ is also controlled, meaning there is no risk of explosion, which is particularly important for issues like Attention Logits explosion. Of course, this upper bound is quite loose in most cases; in practice, the spectral norm of parameters is usually much smaller than this bound. This inequality simply demonstrates that Weight Decay has the property of controlling the norm.

RMS Alignment

When we decide to try a new optimizer, a major headache is how to quickly find hyperparameters close to the optimum, such as Muon's at least two hyperparameters: learning rate $\eta_t$ and decay rate $\lambda$. Grid search is an option but is time-consuming. Here, we propose a hyperparameter transfer idea based on Update RMS alignment, allowing hyperparameters tuned for Adam to be applied to other optimizers.

First, for a matrix $\boldsymbol{W}\in\mathbb{R}^{n\times m}$, its RMS (Root Mean Square) is defined as: \begin{equation}\text{RMS}(\boldsymbol{W}) = \frac{\Vert \boldsymbol{W}\Vert_F}{\sqrt{nm}} = \sqrt{\frac{1}{nm}\sum_{i=1}^n\sum_{j=1}^m W_{i,j}^2}\end{equation} Simply put, RMS measures the average size of each element in the matrix. We observe that the RMS of Adam's update amount is relatively stable, usually between 0.2 and 0.4, which is why theoretical analyses often use SignSGD as an approximation of Adam. Based on this, we suggest aligning the Update RMS of the new optimizer to 0.2 via RMS Norm: \begin{gather} \boldsymbol{W}_t =\boldsymbol{W}_{t-1} - \eta_t (\boldsymbol{\Phi}_t + \lambda \boldsymbol{W}_{t-1}) \nonumber\\[6pt] \downarrow \notag\\[6pt] \boldsymbol{W}_t = \boldsymbol{W}_{t-1} - \eta_t (0.2\, \boldsymbol{\Phi}_t/\text{RMS}(\boldsymbol{\Phi}_t) + \lambda \boldsymbol{W}_{t-1}) \end{gather} In this way, we can reuse Adam's $\eta_t$ and $\lambda$ to achieve roughly the same update magnitude for parameters at each step. Practice shows that by using this simple strategy to migrate from Adam to Muon, one can train results significantly better than Adam, close to the results of further fine-grained searching of Muon's hyperparameters. Specifically, $\text{RMS}(\boldsymbol{\Phi}_t)=\text{RMS}(\boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top})$ for Muon can even be calculated analytically: \begin{equation}nm\,\text{RMS}(\boldsymbol{\Phi}_t)^2 = \sum_{i=1}^n\sum_{j=1}^m \sum_{k=1}^r U_{i,k}^2V_{k,j}^2 = \sum_{k=1}^r\left(\sum_{i=1}^n U_{i,k}^2\right)\left(\sum_{j=1}^m V_{k,j}^2\right) = \sum_{k=1}^r 1 = r\end{equation} That is, $\text{RMS}(\boldsymbol{\Phi}_t) = \sqrt{r/nm}$. In practice, the probability of a matrix being strictly low-rank is small, so we can assume $r = \min(n,m)$, thus $\text{RMS}(\boldsymbol{\Phi}_t) = \sqrt{1/\max(n,m)}$. Therefore, we ended up not using RMS Norm but rather its equivalent analytical version: \begin{equation}\boldsymbol{W}_t = \boldsymbol{W}_{t-1} - \eta_t (0.2\, \boldsymbol{\Phi}_t\,\sqrt{\max(n,m)} + \lambda \boldsymbol{W}_{t-1})\label{eq:muon-rms-aligned}\end{equation} The final equation, Eq. $\eqref{eq:muon-rms-aligned}$, indicates that in Muon, it is not suitable for all parameters to share the same learning rate. For example, Moonlight is an MoE model, and many matrix parameters have shapes that deviate from square matrices; $\max(n,m)$ spans a large range. Using a single learning rate would inevitably lead to synchronization problems where some parameters learn too fast or too slow, thus affecting the final result.

Experimental Analysis

We compared Adam and Muon on a 2.4B/16B scale MoE and found that Muon has obvious advantages in terms of convergence speed and final results. For detailed comparison results, I recommend reading the original paper; I will only share some excerpts here.

Github: https://github.com/MoonshotAI/Moonlight

First is a relatively objective control table, including a comparison between our own controlled-variable training of Muon and Adam, as well as a comparison with a model of the same architecture trained by an external party (DeepSeek) using Adam (to facilitate comparison, the architecture of Moonlight is identical to DSV3-Small), demonstrating the unique advantage of Muon:

Muon (Moonlight) vs Adam comparison
Comparison between Muon (Moonlight) vs Adam (Moonlight-A and DSV3-small)

What is different about a model trained with Muon? Since we earlier said Muon is the steepest descent under the spectral norm, and the spectral norm is the largest singular value, we thought of monitoring and analyzing singular values. Sure enough, we found some interesting signals. The parameters trained by Muon have a relatively more uniform distribution of singular values. We use singular value entropy to quantitatively describe this phenomenon: \begin{equation}H(\boldsymbol{\sigma}) = -\frac{1}{\log n}\sum_{i=1}^n \frac{\sigma_i^2}{\sum_{j=1}^n\sigma_j^2}\log \frac{\sigma_i^2}{\sum_{j=1}^n\sigma_j^2}\end{equation} Here $\boldsymbol{\sigma}=(\sigma_1,\sigma_2,\cdots,\sigma_n)$ is the set of all singular values of a parameter. Parameters trained with Muon have higher entropy, meaning the singular value distribution is more uniform, which implies the parameter is harder to compress. This suggests that Muon better utilizes the potential of the parameters:

Weights trained by Muon have higher singular value entropy

Another interesting finding is that when we use Muon for Supervised Fine-Tuning (SFT), we might get a suboptimal solution if the pre-training did not use Muon. Specifically, if both pre-training and fine-tuning use Muon, then the performance is best. However, for the other three combinations (Adam+Muon, Muon+Adam, Adam+Adam), the relative advantages did not show a clear pattern.

Pre-train/SFT Muon/Adam combinations
Combination tests of Muon/Adam for pre-training/fine-tuning

SFT on open source models with Muon/Adam
Attempts to fine-tune open-source models with Muon/Adam

This phenomenon suggests that some special initializations are unfavorable for Muon. Conversely, some initializations may be more favorable for Muon. We are still exploring the deeper underlying principles.

Further Thoughts

Overall, in our experiments, Muon's performance appears very competitive compared to Adam. As a new optimizer whose form differs significantly from Adam, this performance of Muon isn't just "noteworthy"; it also indicates that it might have captured some essential properties.

Previously, a viewpoint circulated in the community: Adam performs well because mainstream model architecture improvements are "overfitting" Adam. This viewpoint probably originated from "Neural Networks (Maybe) Evolved to Make Adam The Best Optimizer". It sounds a bit absurd, but it is actually profound. Think about it: when we try to improve a model, we train it with Adam to see the result. If the result is good, we keep the change; otherwise, we discard it. But is this "good effect" because it is essentially better, or because it matches Adam better?

This is food for thought. While not all work is like this, certainly at least some work shows better results because it pairs better with Adam. Over time, model architectures will evolve in a direction that favors Adam. In this context, the fact that an optimizer significantly different from Adam can "break through" is especially worth noting and contemplating. Note that neither I nor my company are the proposers of Muon, so these comments are purely "heartfelt words" and not self-promotion.

What's next for Muon? There is actually quite a lot of work to be done. For example, the issue mentioned above where "Adam pre-training + Muon fine-tuning" performs poorly—further analysis is necessary and valuable. After all, open-source model weights are currently basically trained with Adam; if Muon doesn't work for fine-tuning, it will inevitably affect its adoption. Of course, we can also take this opportunity to deepen our understanding of Muon (learning through bugs).

There is also an extension to consider: Muon is based on the spectral norm, and the spectral norm is the largest singular value. In fact, we can construct a series of norms based on singular values, such as Schatten norms. Generalizing it to these broader norms and tuning parameters could theoretically offer a chance for even better results. Furthermore, after the release of Moonlight, some readers asked how to design µP (maximal update parametrization) under Muon. This is also an urgent problem to be solved.

Summary

This post introduced our relatively large-scale practice of the Muon optimizer (Moonlight) and shared our latest thoughts on the Muon optimizer.