By 苏剑林 | Oct 21, 2025
Have you ever noticed an interesting detail? Muon and MuP both start with "Mu," but the original meanings of the two "Mu"s are completely different. The former stands for "MomentUm Orthogonalized by Newton-Schulz," while the latter stands for "Maximal Update Parametrization." Yet, there is a profound connection between the two. In other words, Muon and MuP have completely different starting points, but they ultimately head in the same direction, even unintentionally adopting similar names—as if it were "predestined by fate."
Let's get down to business. In short, by a series of coincidences, I happened to learn Muon and MuP at the same time. This has greatly deepened my understanding of model optimization and prompted me to think about more fundamental principles. After some trial and error, I have gained some modest insights, which I would like to share with you here.
Write in Front
In terms of publication order, MuP came before Muon. However, my learning sequence was exactly the opposite: I first studied Muon and then MuP. Looking back, this turned out to be a good learning path.
In articles such as "Appreciation of Muon Optimizer: A Transcendental Leap from Vectors to Matrices" and "Muon Sequel: Why Do We Choose to Try Muon?", we described Muon as "steepest descent under spectral norm constraints." The MuP series of work happens to explain "why spectral norm constraints are necessary." The two complement each other perfectly.
A special clarification here: the MuP we refer to has two meanings. First is the one introduced in "First Exploration of MuP: Cross-Model Scaling Laws of Hyperparameters", which is part of the Tensor Programs series, which we call "Elementary MuP." The second is introduced in "Higher-Order MuP: Simpler but Smarter Spectral Condition Scaling", which we call "Higher-Order MuP." It achieves richer conclusions than Elementary MuP in a more concise way—both are the work of Greg Yang (respect to the master).
Unless otherwise stated, MuP in this article refers to "Higher-Order MuP." In fact, this series of articles, which I call "Beyond MuP," consists of a series of thoughts and extensions based on Higher-Order MuP. However, for some readers who only know the "Elementary MuP" from the Tensor Programs series, it might initially be confusing how MuP can answer "why spectral norm is needed."
Regardless, I will try to make this series self-contained, so while I will mention many related papers or blogs, readers do not need to read every single one in depth.
Seeking Stability in Speed
Let's get back to the topic. As the first article, the task here is to establish the core goal. More specifically, to think clearly about "what kind of model we actually want" and "how to train such a model."
Intuitively, as long as the model shows no signs of collapsing, we can keep training it until it converges to a satisfactory result. On this basis, we try to find methods to make the model converge faster. So, essentially, it boils down to two things: "stability" and "speed," or "seeking speed within stability." How do we judge if a model is stable? Naturally, this involves monitoring various "internal indicators." The more you monitor, the more problems you can expose.
However, I don't plan to list various internal indicators here. Instead, I will try to find the most core or necessary conditions. To this end, let's first define a concept—RMS (Root Mean Square): Given $\boldsymbol{x}=(x_1,x_2,\cdots,x_d)\in\mathbb{R}^d$, we define:
\begin{equation}\Vert\boldsymbol{x}\Vert_{RMS} = \sqrt{\frac{1}{d}\sum_{i=1}^d x_i^2} = \frac{\Vert\boldsymbol{x}\Vert_2}{\sqrt{d}}\end{equation}
It represents the average scale of each element, differing from the vector norm $\Vert\boldsymbol{x}\Vert_2$ by a factor of $\sqrt{d}$.
Some readers might ask, since the difference is just a factor, why not directly observe the norm instead of defining a new concept? There are several considerations: for example, RMSNorm is frequently used, RMS is easier to perceive than the norm, and another significant reason is that most activation functions are Element-wise. Thus, we need to inspect and control the scale averaged to each element to ensure that activation functions play a similar role across different models.
Three Conditions
With the RMS notation, I can write down the three most essential conditions for stably training a good model:
\begin{align}
&\text{Forward Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c1}\\[5pt]
&\text{Dependence Stability:}\quad\max_{\boldsymbol{x}_1,\boldsymbol{x}_2} \Vert \boldsymbol{f}(\boldsymbol{x}_1;\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x}_2;\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c2}\\[5pt]
&\text{Update Stability:}\quad\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega} + \Delta\boldsymbol{\omega}) - \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS} = \mathcal{\Theta}(1) \label{eq:c3}
\end{align}
where $\boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})$ represents a model family from $\mathbb{R}^{d_{in}}\mapsto \mathbb{R}^{d_{out}}$, input $\boldsymbol{x}\in\mathbb{R}^{d_{in}}$, output $\boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\in\mathbb{R}^{d_{out}}$, and $\boldsymbol{\omega}$ represents the model parameters, which can be scalars, vectors, matrices, etc. $\mathcal{\Theta}$ is the "Big Theta Notation." Here $\boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})$ can be a layer, a block composed of several layers, or even the entire model. Theoretically, a coarser granularity yields looser or more accurate constraints, but solving the $\max$ also becomes more difficult; thus, it depends on our ability to calculate the $\max$.
Among the three equations, Eq $\eqref{eq:c1}$ is probably the easiest to understand. It represents the stability of the forward pass. After taking the $\max$ over $\boldsymbol{x}$, the only variable left is $\boldsymbol{\omega}$, so this is a constraint on $\boldsymbol{\omega}$. Note that we haven't limited the value of $\boldsymbol{x}$, so by default $\boldsymbol{x}\in\mathbb{R}^{d_{in}}$, meaning the maximum might not exist—for a non-zero $\boldsymbol{W}$, $\max\limits_{\boldsymbol{x}}\Vert \boldsymbol{x}\boldsymbol{W}\Vert_{RMS}\to\infty$.
To ensure the existence of the maximum value, we usually add Normalization operations, such as:
\begin{align}
&\text{Pre Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x})\boldsymbol{W} \\[5pt]
&\text{Post Norm:}\quad \mathop{\text{RMSNorm}}(\boldsymbol{x}\boldsymbol{W})
\end{align}
where $\mathop{\text{RMSNorm}}(\boldsymbol{x})=\boldsymbol{x}/\Vert\boldsymbol{x}\Vert_{RMS}$. Thus, condition $\eqref{eq:c1}$ also implicitly imposes requirements on the model architecture. Similarly, Eq $\eqref{eq:c2}$ requires that the model architecture depends on the input smoothly. For a simple example, $f(x;\omega)=x\times\omega\times 0 + 1$; this "model" is naturally stable forward, but it doesn't depend on $x$ at all, so Eq $\eqref{eq:c2}$ cannot be satisfied, and it is not a good model.
Finally, Eq $\eqref{eq:c3}$ should also be easy to understand. After taking the $\max$ over $\boldsymbol{x}$, the result is a constraint on $\boldsymbol{\omega}$ and $\Delta \boldsymbol{\omega}$. it focuses mainly on the influence of the increment $\Delta \boldsymbol{\omega}$, representing expectations for training smoothness. We can use it to guide the setting of optimizer hyperparameters, or even build new optimizers based on it.
Related Thoughts
In summary, the three conditions in Eq $\eqref{eq:c1}, \eqref{eq:c2}, \eqref{eq:c3}$ synthesize considerations of model architecture, initialization, and optimizers. It is hard to say which condition can be removed, so I believe they are all necessary. Of course, there are some details regarding these three conditions worth discussing, such as the choice between $\max$ and $\mathbb{E}$.
In the current formulas, we "eliminate" $\boldsymbol{x}$ by taking the $\max$, obtaining formulas containing only $\boldsymbol{\omega}$ and $\Delta\boldsymbol{\omega}$. This might be questionable for some; for some readers, a more intuitive approach would be taking the mathematical expectation $\mathbb{E}_{\boldsymbol{x}}$. Why $\max$ instead of $\mathbb{E}$? There are several reasons. First, calculating the $\max$ only requires defining the domain of $\boldsymbol{x}$, while calculating $\mathbb{E}$ requires defining the distribution of $\boldsymbol{x}$. Different distributions yield different results, and accurately defining this distribution is not a trivial matter.
Secondly, $\max$ has the advantage of being invariant to monotonic transformations, whereas $\mathbb{E}$ does not. For example, for $\max$ we have the identity $(\max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS})^2 = \max_{\boldsymbol{x}} \Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2$. That is, whether you take the $\max$ of $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}$ or $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2$, the essence is the same. But for $\mathbb{E}$, this is not the case: the expectation of $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}$ and the expectation of $\Vert \boldsymbol{f}(\boldsymbol{x};\boldsymbol{\omega})\Vert_{RMS}^2$ usually differ in calculation difficulty, and their results don't necessarily have a simple relationship.
Therefore, $\max$ is simpler in both concept and properties. One possible concern is whether $\max$ is too harsh, like a "sufficient but not necessary" condition? In fact, $\max$ is an intuitive term; mathematically it's called the "supremum ($\sup$)." The word "least upper bound" indicates the value is reachable and compact. In practice, the mean and the maximum are often of the same order of magnitude, and our goal is only $\mathcal{\Theta}(1)$, so the difference is not large. Conversely, $\max$ accounts for extreme cases, guaranteeing training stability to the greatest extent, which is particularly important for training large models like LLMs.
In fact, Elementary MuP, or the Tensor Programs series, is based on a series of analyses using $\mathbb{E}$. Higher-Order MuP, like this article, is based on $\max$. In hindsight, the analysis based on $\mathbb{E}$ is inferior to the $\max$-based Higher-Order MuP in terms of simplicity of calculation and universality of results, which in turn corroborates the effectiveness of $\max$.
Article Summary
Starting from this article, I will share some top-down understandings of model optimization, which are extended thoughts and expansions based on the previous "Higher-Order MuP." As the first article, we mainly described three basic conditions for model stability, or the three characteristics of a good model, which will serve as the foundation for subsequent calculations and analyses.