Muon Implementation Based on Streaming Power Iteration: 5. Extensions

By 苏剑林 | April 17, 2026

The theme of this series is "Streaming Power Iteration." As the name suggests, it consists of two parts: "Streaming" and "Power Iteration." While "Power Iteration" is a classic multi-step iterative scheme for finding the SVD of a matrix, "Streaming" refers to spreading the algorithm—which originally required multiple iterations—across each training step. This makes the computational cost acceptable. The core idea is that instead of completing a complex calculation all at once, it is better to continuously approximate the target during the training process.

As an extension of this series, this article will introduce other applications of the "streaming" philosophy, further demonstrating how relatively expensive operations can be cleverly integrated into the training workflow through streaming transformation.

Orthogonal Projection

In some scenarios, we may want to constrain certain parameter matrices to be orthogonal. Orthogonal matrices possess good numerical stability, which can prevent numerical explosion or vanishing gradient problems, while also providing better theoretical guarantees in certain designs. Of course, which specific places are suitable for constraining parameters as orthogonal matrices requires a case-by-case analysis, which we will not expand on here.

In the articles "Steepest Descent on Manifolds: 2. Muon + Orthogonality" and "Steepest Descent on Manifolds: 3. Muon + Stiefel", we explored orthogonal (Stiefel) manifolds. However, that was focused on combining steepest descent to derive new update rules, which is overall quite complex. Here, we consider a simpler approach: re-projecting (Retracting) the parameters back onto the orthogonal manifold after each update step.

Without loss of generality, let the parameter matrix be $\boldsymbol{W} \in \mathbb{R}^{n \times m}$, where $n \geq m$. The operation we want to perform can be written as:

\begin{equation}\newcommand{orth}{\mathop{\text{orth}}}\boldsymbol{W}_t = \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t \qquad\to\qquad \boldsymbol{W}_t = \orth(\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t)\end{equation}

Where $\orth$ is defined as:

\begin{equation}\newcommand{argmin}{\mathop{\text{argmin}}}\orth(\boldsymbol{W}) = \argmin_{\boldsymbol{O}^{\top}\boldsymbol{O}=\boldsymbol{I}} \Vert\boldsymbol{W} - \boldsymbol{O}\Vert_F\end{equation}

That is, finding the orthogonal matrix closest to $\boldsymbol{W}$. This operation appeared in our first blog post introducing Muon, "Muon Optimizer Appreciation: An Essential Leap from Vectors to Matrices". It is actually the $\newcommand{msign}{\mathop{\text{msign}}}\msign$ in Muon. Thus, the operation we want to implement can also be written as:

\begin{equation}\boldsymbol{W}_t = \msign(\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t)\label{eq:msign-u}\end{equation}

Streaming Iteration

In other words, we need to perform an $\msign$ operation after each update step to project the parameters back onto the orthogonal manifold. However, the $\msign$ calculation is not exactly "cheap," even if it isn't prohibitively "expensive." We know that Muon's core operation is $\msign$, but Muon's $\msign$ runs in BF16, while parameters are stored in FP32. Calculating $\msign$ for parameters needs to be done in FP32, making the cost significant.

The efficient calculation of $\msign$ is based on Newton-Schulz iteration, which we discussed in detail in "Newton-Schulz Iteration for the msign Operator (Part 1)" and "Newton-Schulz Iteration for the msign Operator (Part 2)". Different Newton-Schulz iterations have different polynomial degrees and coefficients corresponding to different convergence speeds. A classic 3rd-order format is:

\begin{equation}\boldsymbol{X}_t = \frac{3}{2}\boldsymbol{X}_{t-1} - \frac{1}{2}\boldsymbol{X}_{t-1}\boldsymbol{X}_{t-1}^{\top}\boldsymbol{X}_{t-1},\qquad \boldsymbol{X}_0 = \boldsymbol{W}\end{equation}

It can be proven that $\lim_{t\to\infty} \boldsymbol{X}_t = \msign(\boldsymbol{W})$. Although this iteration is classic, it converges slowly. In Muon, we typically use the faster-converging 5th-order iteration. Regardless of the version, the cost of performing the full iteration is considerable.

However, for the parameters themselves, is it really necessary to perform a full iteration at every step? Assuming $\boldsymbol{W}_{t-1}$ is already orthogonal or near-orthogonal, since the learning rate $\eta$ is small, $\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t$ will not deviate significantly from orthogonality. This is exactly where the streaming philosophy comes into play—perhaps iterating just once per training step is sufficient. For example, consider:

\begin{equation}\boldsymbol{W}_t = \frac{3}{2}\tilde{\boldsymbol{W}}_t - \frac{1}{2}\tilde{\boldsymbol{W}}_t\tilde{\boldsymbol{W}}_t^{\top}\tilde{\boldsymbol{W}}_t,\qquad \tilde{\boldsymbol{W}}_t = \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t\end{equation}

From an asymptotic perspective, this also achieves the effect of $\eqref{eq:msign-u}$, and each step only requires calculating one extra term $\tilde{\boldsymbol{W}}_t\tilde{\boldsymbol{W}}_t^{\top}\tilde{\boldsymbol{W}}_t$. The cost is significantly lower than a complete Newton-Schulz iteration.

Spectral Constraints

In general, orthogonal constraints can only be applied to certain specific matrices because they are too strict, reducing the degrees of freedom of the matrix (and essentially the parameter count) by half. Often, we might prefer more relaxed spectral constraints, such as the singular value clipping mentioned in "High-Order MuP: Simpler but Smarter Spectral Condition Scaling".

If $\msign$ turns all singular values of a matrix into 1, then singular value clipping only turns singular values greater than 1 into 1, while leaving the remaining singular values unchanged. In "Calculating Singular Value Clipping mclip via msign (Part 1)" and "Calculating Singular Value Clipping mclip via msign (Part 2)", we called this $\newcommand{mclip}{\mathop{\text{mclip}}}\mclip$. Assuming what we want to do is clip all singular values to within 1 after each parameter update step, it can be written as:

\begin{equation}\boldsymbol{W}_t = \mclip(\boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t)\label{eq:mclip-u}\end{equation}

However, calculating $\mclip$ is much more troublesome than $\msign$. We previously explored schemes to implement $\mclip$ based on $\msign$, such as the identity:

\begin{equation}\mclip(\boldsymbol{M}) = \frac{1}{2}\Big[\boldsymbol{M} + \msign(\boldsymbol{M}) + (\msign(\boldsymbol{M}) - \boldsymbol{M}) \msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I})\Big]\end{equation}

This requires two $\msign$ calls to compute $\mclip$, which is quite costly. Of course, we could also perform another streaming power iteration on $\boldsymbol{W}$ and calculate $\mclip$ via SVD, but that would require adding another buffer variable, which also seems like a lot of trouble.

One-by-One Clipping

Let's look at $\mclip$ from another perspective: $\mclip$ turns all singular values greater than 1 into 1. A necessary operation for this is to change the principal singular value to 1 (if it is greater than 1). Once we clip the principal singular value, if there are still singular values greater than 1, the largest among them becomes the new principal singular value. This means we only need to repeatedly "clip the principal singular value to 1" to achieve $\mclip$.

The advantage of this "one-by-one clipping" strategy is that calculating only the principal singular value and principal singular vectors (let's call this $\mathop{\text{SVD1}}$) is much cheaper than performing a full SVD. It only requires a vector-based power iteration:

\begin{equation}\boldsymbol{v}\quad\leftarrow\quad \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}\Vert}\end{equation}

to converge to the right principal singular vector $\boldsymbol{v}_1$ of $\boldsymbol{W}$, then $\sigma_1 = \Vert\boldsymbol{W}\boldsymbol{v}_1\Vert$ and $\boldsymbol{u}_1 = \boldsymbol{W}\boldsymbol{v}_1/\sigma_1$. In practice, we can just use a fixed number of iterations to get an approximate value. Then, continuing to implement the "streaming" idea, we perform only one principal singular value clipping at each update step:

\begin{equation}\boldsymbol{W}_t = \tilde{\boldsymbol{W}}_t - \max(\sigma_1 - 1, 0) \boldsymbol{u}_1 \boldsymbol{v}_1^{\top},\quad\sigma_1, \boldsymbol{u}_1, \boldsymbol{v}_1 = \mathop{\text{SVD1}}(\tilde{\boldsymbol{W}}_t),\quad\tilde{\boldsymbol{W}}_t = \boldsymbol{W}_{t-1} - \eta \boldsymbol{\Phi}_t\end{equation}

Over long-term training, this can control the singular values of $\boldsymbol{W}$ to be around 1 (in practice, slightly larger than 1 due to the approximation of power iteration and the streaming nature). This "Streaming Singular Value Clipping" can be regarded as a variant of the "Spectral Weight Decay" introduced in "Thoughts from Spectral Norm Gradients to New Styles of Weight Decay". In the paper "Training Transformers with Enforced Lipschitz Constants", it is referred to as "Spectral Hammer."

Other Examples

When is the streaming philosophy applicable? A typical scenario is when we can estimate that, over long-term training, a certain variable changes slowly. Then we can try to execute only a small number of iterations per training step, expecting it to be sufficient to correct the changes brought about by parameter updates. In addition to the two new examples above, we have used "streaming" ideas in previous articles. Let's briefly review them.

In "Steepest Descent on Manifolds: 3. Muon + Stiefel" and "Steepest Descent on Manifolds: 4. Muon + Spectral Sphere", when solving for the steepest descent on the corresponding manifolds, a non-linear equation needs to be solved. At that time, we proposed using fixed-point iteration to solve it. Later, in "Steepest Descent on Manifolds: 5. Dual Gradient Descent", we discussed the dual gradient descent solution and proposed the practice of amortizing the solving process over each training step through the streaming idea.

Coincidentally, in "MoE Voyage: 6. Optimal Allocation Promotes Balance", we looked at the MoE load balancing problem from an optimal allocation perspective. When solving the dual problem of optimal allocation, originally one would need to alternately optimize $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ until convergence at each step. But considering that the change in $\boldsymbol{\beta}$ at each step should be small, still based on the streaming idea, we only execute one step of updating $\boldsymbol{\alpha}$ and $\boldsymbol{\beta}$ at each training step, which achieves good load balancing effects.

In short, when we realize that the change of a variable is slow, we can consider performing fewer iteration steps per update to amortize the computational load. To achieve this, sometimes we need to add extra buffer variables (like $\boldsymbol{V}$ in streaming power iteration), but sometimes we don't. How to better design streaming iterations may require some "ingenuity" and is worth exploring and savoring.

Article Summary

This article mainly introduced two other examples of applying the "streaming" philosophy: implementing the projection of parameters onto the orthogonal (Stiefel) manifold or clipping the parameter spectral norm to not exceed 1 with minimal extra cost. These examples once again demonstrate the wonder of the streaming idea: many complex calculations that seem to require one-time completion can be disassembled into gradual adjustments and integrated into every step of training.