Muon Optimizer Guide: Quick Start and Key Details

By 苏剑林 | November 19, 2025

Recently, many readers have likely come across news regarding the Muon optimizer. In fact, Muon was first proposed about a year ago, around October last year, by Keller Jordan on Twitter. However, within just this year, Muon has already undergone the test of training models with tens of billions, hundreds of billions, and even trillions of parameters, indicating it is a highly competitive optimizer.

Nowadays, Muon has been built into training frameworks like Torch and Keras, and even large-scale frameworks like Megatron are gradually starting to support it, meaning it has gained widespread industry recognition. However, for readers only familiar with Adam, how to quickly and effectively switch to Muon may still be a confusing matter. Therefore, this article attempts to provide a quick-start tutorial.

Brief Introduction

The formal proposer of Muon is Keller Jordan, who currently works at OpenAI. As mentioned at the beginning, Muon was first published on Twitter, and even now, the author has only written a blog post "Muon: An optimizer for hidden layers in neural networks" rather than a formal paper. The author's view is that "whether it is written as a paper has nothing to do with whether the optimizer is effective [original text]".

Muon is an optimizer specifically tailored for matrix parameters. There are other related works with similar characteristics, such as Shampoo and the earlier Stochastic Spectral Descent, etc. Many works can more or less be associated with Muon, but none completely cover Muon, so in my view, Muon is a brand-new work.

In China, the first article to popularize Muon was likely my blog post "Muon Optimizer Appreciation: A Fundamental Leap from Vectors to Matrices", and the first verification of Muon on a relatively large scale was likely the Moonlight we released in February. The Muon variant proposed in Moonlight was later used in the trillion-parameter K2. Following K2, GLM-4.5 also utilized this Muon variant.

Consistent with what Jeremy Bernstein, one of the authors of Muon, said in his blog "Deriving Muon", the uniqueness of Muon for me lies in the fact that it can be derived from more fundamental optimization principles and is effective in practice. In contrast, while Adam is also very effective, it feels more like a heuristic solution.

Four Versions

This article does not intend to introduce the mathematical details of Muon, nor its implementation, but rather focuses on the technical details and precautions for switching from Adam to Muon. As stated, Muon is dedicated to matrix parameter optimization and uses a non-element-wise update rule, which can be confusing for new users.

Furthermore, as far as I know, there are currently at least four slightly different versions of Muon, and this multi-version phenomenon contributes to the confusion. If users do not understand the details, they might achieve poor results due to incorrect hyperparameter tuning (especially the learning rate). Below, I will clarify these contents. First, for a matrix $\boldsymbol{W} \in \mathbb{R}^{d_{in} \times d_{out}}$, where $\boldsymbol{G}$ is its gradient, the four Muon variants are:

\begin{align}\newcommand{\msign}{\mathop{\text{msign}}} \boldsymbol{M}_t =&\, \beta \boldsymbol{M}_{t-1} + \boldsymbol{G}_t \\[7pt] \boldsymbol{W}_t =&\, \boldsymbol{W}_{t-1} - \eta_t \left(\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{Naive Version})} \\[5pt] \boldsymbol{W}_t =&\, \boldsymbol{W}_{t-1} - \eta_t \left(\sqrt{\max(1, d_{out}/d_{in})}\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{KellerJordan Version})} \\[5pt] \boldsymbol{W}_t =&\, \boldsymbol{W}_{t-1} - \eta_t \left(\sqrt{ d_{out}/d_{in}}\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{MuP Version})} \\[5pt] \boldsymbol{W}_t =&\, \boldsymbol{W}_{t-1} - \eta_t \left(0.2\times\sqrt{\max(d_{out},d_{in})}\msign(\boldsymbol{M}_t) + \lambda \boldsymbol{W}_{t-1}\right) \quad &\color{skyblue}{(\text{Moonlight Version})} \end{align}

If Nesterov momentum is used, $\msign(\boldsymbol{M}_t)$ is replaced by $\msign(\beta\boldsymbol{M}_t + \boldsymbol{G}_t)$. The $\msign$ is usually named zeropower_via_newtonschulz in implementations; ordinary users do not need to worry about the specific implementation details.

The only difference between the four versions is the scaling factor before $\msign$. The "KellerJordan Version" and "MuP Version" are largely similar, while the "Moonlight Version" is slightly more special. Keras only implements the "KellerJordan Version," while Torch implements both the "KellerJordan Version" and the "Moonlight Version." The Naive version is currently rare, and for me, I often use my own "MuP Version."

Two Dimensions

An important detail to note here is that the "KellerJordan Version" and "MuP Version" are sensitive to the order of $d_{in}, d_{out}$. The first task is to clarify the meaning of $d_{in}, d_{out}$; it is not the case that the first dimension of a matrix is always $d_{in}$ and the second is always $d_{out}$.

The meanings of $d_{in}$ and $d_{out}$ are the input and output dimensions of the linear layer, respectively. Which is $d_{in}$ and which is $d_{out}$ depends on the specific implementation of the linear layer. For example, Keras's Dense layer implementation is $\boldsymbol{x}\boldsymbol{W}$, so the first dimension of the matrix $\boldsymbol{W}$ is $d_{in}$ and the second is $d_{out}$. However, Torch's Linear layer implements $\boldsymbol{x}\boldsymbol{W}^{\top}$, so the second dimension of the matrix $\boldsymbol{W}$ is $d_{in}$ and the first dimension is $d_{out}$.

Therefore, if implementing the "KellerJordan Version" of Muon for Torch's Linear layer, the scaling factor should be max(1, W.shape[0]/W.shape[1])**0.5, while for Keras, it should be max(1, W.shape[1]/W.shape[0])**0.5. Consequently, the current Keras implementation of Muon is actually incorrect because it copied the Torch scaling factor implementation verbatim (source code).

If you have written your own model, you need to judge carefully based on your writing style. For instance, it's possible in Torch to mix built-in Linear layers with hand-written x @ W, so one cannot generalize whether it is W.shape[0]/W.shape[1] or W.shape[1]/W.shape[0]. Of course, if you find it too troublesome to figure these out, you can consider using the "Moonlight Version," as its scaling factor is symmetric with respect to $d_{in}$ and $d_{out}$.

Hyperparameter Settings

Once $d_{in}, d_{out}$ are clarified, the remaining questions are how to set the learning rate $\eta_t$ and the weight decay coefficient $\lambda$. The assumption here is that the user already has experience tuning Adam and has obtained good results, and wants to quickly migrate to Muon to experience it.

Let's look at the "Moonlight Version" first. Its scaling factor was obtained by aligning with Adam's Update RMS. For more details, you can refer to "Muon Sequel: Why Do We Choose to Try Muon?". Regarding the "Magic Number" $0.2$, you can refer to "Why is Adam's Update RMS 0.2?". In simple terms, the "Moonlight Version" of Muon aligns with Adam's update magnitude, so the simplest approach when migrating from Adam is: change nothing, just use Adam's $\eta_t$ and $\lambda$.

Now look at the remaining three versions. We know that mainstream models usually have a hidden_size (denoted as $d$), and the matrix shapes of most models do not deviate significantly from $d \times d$. Therefore, we can approximate this with $d_{in}=d_{out}=d$, in which case these three versions are identical. Compared to the "Moonlight Version," they lack the $0.2\sqrt{d}$ factor. Since the "Moonlight Version" aligns with Adam's update magnitude and doesn't require changing hyperparameters, the learning rates of these three versions should be scaled up by $0.2\sqrt{d}$ to align with Adam's update magnitude; correspondingly, $\lambda$ should be divided by $0.2\sqrt{d}$.

Substituting $d=1024, 2048, 4096$, the results are $6.4, 9, 12.8$, respectively. If you can't remember $0.2\sqrt{d}$, you can simply remember that if we use the other three versions of Muon, we directly multiply the Adam learning rate by 10 to use as the Muon learning rate. If you directly plug Adam's learning rate into Muon, you will conclude that Muon is far inferior to Adam due to underfitting. As far as I know, some negative reviews of Muon stem from this.

Does this mean the "Moonlight Version" is easier to use? The "Moonlight Version" indeed has good practical results, but saying it's better is evaluating it from the perspective of Adam. The benefit of the "MuP Version" or "KellerJordan Version" is that the learning rate is transferable; that is, after tuning the learning rate on a small model, applying it directly to a large model often yields good results. You can refer to Jeremy Bernstein's blog "Deriving Muon" or my blog "Higher-order MuP: Simpler but Smarter Spectral Condition Scaling" for more on this.

Other Parameters

If Muon only handles matrix parameters, what happens to the rest? For example, the Bias terms of linear layers and the gamma terms of RMSNorm are 1D parameters; or perhaps 3D and 4D array parameters in convolutional layers.

First, let me correct this: Muon does not handle all matrix parameters; Muon only handles "matrix parameters of linear layers with dense inputs." If this sounds confusing, just remember that the matrix parameters for the Embedding layer and the final classification layer (including GPT's LM Head) cannot use Muon, otherwise, the results will be significantly worse. For these matrix parameters that cannot use Muon, as well as 1D, 3D, and higher-dimensional parameters, if you don't want to overthink it, just use Adam. Basically, all Muon implementations are hybrid with Adam, allowing users to choose Adam for certain layers.

If readers are willing to experiment, 3D and 4D parameters such as those in convolutional layers can also use Muon. Taking Conv2D as an example, the kernel shape is usually $(w, h, d_{in}, d_{out})$. Its equivalent implementation actually flattens the patch input of $(w, h, d_{in})$ into a vector of $w \times h \times d_{in}$, and the kernel is also reshaped to $(w \times h \times d_{in}, d_{out})$, followed by matrix multiplication. So, if it wants to use Muon, it must first reshape the momentum to $(w \times h \times d_{in}, d_{out})$, calculate $\msign$, and then reshape it back for the update.

Similarly, for the gamma parameter of RMSNorm, which can be seen as multiplication by a diagonal matrix, its momentum can be treated as a diagonal matrix to calculate $\msign$, which results in something equivalent to SignSGDM. The Embedding layer can be viewed as multiple $(1,d)$ matrices calculating $\msign$, which results in Normalized SGDM (refer to "Muon Optimizer Appreciation: A Fundamental Leap from Vectors to Matrices"). If you still want to tinker, consider Multi-Head Attention—could each head's projection matrix be taken out separately to calculate $\msign$?

Life is endless, and so is tinkering~

Expected Results

Finally, if the user correctly sets everything according to the instructions above and runs it, they can begin to pray to the goddess of luck.

What kind of result should we expect? In most cases, if there are no anomalies like gradient explosions, Muon will be slightly better than Adam. Of course, it is not ruled out that in some cases Muon might be slightly worse, but in any case, the gap between them will not be very large. If one side performs much better than the other, you may need to reflect on whether there is an issue with the settings on either side.

However, nothing is absolute. For instance, under some extreme settings, Muon can indeed be much better than Adam, or Adam might fail no matter how you tune it. In short, good luck. If any interesting phenomena occur, you are welcome to exchange and analyze them with us.