Steepest Descent on Manifolds: 4. Muon + Spectral Sphere

By 苏剑林 | August 21, 2025

Readers who have finished the first three articles should already be familiar with the "routine" of this series—first propose a constraint on the update amount, find the steepest descent direction, and then add constraints to the parameters themselves to find a new steepest descent direction. When solving parameter constraint problems, we adopt the "first-order approximation is enough" principle to simplify the constraint form, which geometrically corresponds to the "tangent space." Then, we use the method of undetermined coefficients to convert the problem into an unconstrained form to write out an analytical solution, and finally numerically solve for the undetermined coefficients.

In this article, we will solve a new example—Muon under the spectral sphere constraint—which is an analogous extension of the first article "Steepest Descent on Manifolds: 1. SGD + Hypersphere." We can consider it when we want the spectral norm of the parameters to remain constant. Of course, it can also be used simply as an exercise to practice your skills.

Problem Description

In "Steepest Descent on Manifolds: 2. Muon + Orthogonal" and "Steepest Descent on Manifolds: 3. Muon + Stiefel," we have already discussed the collision of Muon with orthogonal constraints in detail, so we will not expand on the background and instead directly give the problem form:

\begin{equation}\newcommand{tr}{\mathop{\text{tr}}}\max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\,\, \Vert\boldsymbol{W}\Vert_2 = 1,\,\,\, \Vert\boldsymbol{W} - \eta \boldsymbol{\Phi}\Vert_2=1\end{equation}

where $\boldsymbol{W},\boldsymbol{\Phi}\in\mathbb{R}^{n\times m}(n \geq m)$, and $\Vert\cdot\Vert_2$ is the spectral norm. Of course, if we are interested, the latter two spectral norms can be replaced with other norms, such as the $F$-norm representing the "Muon + Hypersphere" combination.

The "first-order approximation is enough" principle requires us to find the gradient of the spectral norm, which we have already introduced in "Reflections from Spectral Norm Gradients to New Types of Weight Decay" and "Derivatives of SVD." The answer is $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_2=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$, where $\boldsymbol{u}_1,\boldsymbol{v}_1$ are the two singular vectors corresponding to the largest singular value of $\boldsymbol{W}$, which can be solved by power iteration. This result also assumes that the largest singular value is unique; we will discuss the non-unique case later.

If it is the $F$-norm, then we have $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert_F=\boldsymbol{W}/\Vert\boldsymbol{W}\Vert_F$. In short, no matter which norm is used, there exists a matrix $\boldsymbol{\Theta}$ that depends only on $\boldsymbol{W}$, such that $\nabla_{\boldsymbol{W}}\Vert\boldsymbol{W}\Vert=\boldsymbol{\Theta}$. Then, from $\Vert\boldsymbol{W}\Vert = 1$ and $\Vert\boldsymbol{W} - \eta \boldsymbol{\Phi}\Vert=1$, we can obtain its first-order approximation version $0 = \langle\boldsymbol{\Theta},\boldsymbol{\Phi}\rangle_F = \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$. Therefore, under first-order approximation, the general formulation of such problems is:

\begin{equation}\max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\,\, \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0\end{equation}

Undetermined Coefficients

The routine is the same; introducing the undetermined coefficient $\lambda$, we have

\begin{equation}\begin{aligned} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) =&\, \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \tr((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}) \\ \leq &\, \Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_* \end{aligned}\end{equation}

The last inequality is the result of Muon itself, similar to the Hölder inequality for two vectors, where equality holds at

\begin{equation}\boldsymbol{\Phi} = \newcommand{msign}{\mathop{\text{msign}}}\msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\end{equation}

The next task is to solve for a $\lambda$ such that it satisfies the constraint condition $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$, and then the job is done.

Due to the existence of $\msign$, $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ is actually a non-linear equation. The author tends to believe that it has no analytical solution, so a numerical solution method is sought. However, after the experience of "Steepest Descent on Manifolds: 3. Muon + Stiefel," we can calmly construct an iterative method for such equations.

Iterative Solution

First, according to the definition $\msign(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}$, we can write $\boldsymbol{\Phi}=(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1}$, where $\boldsymbol{Q}=((\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta}))^{1/2}$, then

\begin{equation}\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0\qquad\Rightarrow\qquad \lambda = -\frac{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G}\boldsymbol{Q}^{-1})}{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta}\boldsymbol{Q}^{-1})}\end{equation}

But note that this is not an analytical solution because $\boldsymbol{Q}$ also depends on $\lambda$. However, based on the above equation, we can construct an iterative format: by substituting an initial $\lambda$, we can solve for $\boldsymbol{Q}= (\boldsymbol{G} + \lambda\boldsymbol{\Theta})^{\top}\boldsymbol{\Phi}$, then substitute it into the above equation to update $\lambda$, and repeat until convergence.

However, although this iterative scheme is theoretically feasible, it requires calculating $\boldsymbol{Q}^{-1}$. Although we mentioned efficient algorithms in "Efficient Calculation of Matrix r-th Power Roots and Inverse r-th Power Roots," from the perspective of "Do not multiply entities," we still hope to avoid iterations other than $\msign$. Therefore, the author tried to find another iterative scheme that does not require calculating $\boldsymbol{Q}^{-1}$. For this, we first write

\begin{equation}\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1}\end{equation}

For our goal, the trace of the above equation is equal to zero. We can explicitly subtract $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m$ from the left side of the above equation to ensure it satisfies this condition:

\begin{equation}\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}^{\top}(\boldsymbol{G} + \lambda\boldsymbol{\Theta})\boldsymbol{Q}^{-1}\end{equation}

Now, by multiplying both sides by $\boldsymbol{Q}$ and taking the trace, we can solve for $\lambda$ inversely:

\begin{equation}\lambda = \frac{\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}\boldsymbol{Q}) - \tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Phi}) \tr(\boldsymbol{Q})/m - \tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G})}{\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})}\end{equation}

In this way, the iteration does not require solving for $\boldsymbol{Q}^{-1}$.

Reference Code

We use $\lambda=-\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{G})/\tr(\boldsymbol{\Theta}^{\top}\boldsymbol{\Theta})$ as the initial value. The test code is as follows:

import numpy as np

def msign(g):
    """Accurate calculation of msign using SVD
    """
    u, s, vh = np.linalg.svd(g, full_matrices=False)
    return u @ np.diag(np.sign(s)) @ vh

def dot(a, b):
    """Equates to np.trace(a.T @ b)
    """
    return (a * b).sum()

n, m = 100, 50
w = np.random.randn(n, m) / m**0.5
g = np.random.randn(n, m) / m**0.5
u, s, vh = np.linalg.svd(w, full_matrices=False)
theta = u[:, :1] @ vh[:1]

lamb = - dot(theta, g) / dot(theta, theta)
for i in range(10):
    phi = msign(z := g + lamb * theta)
    print('step:', i, ', inner product:', dot(phi, g), ', tangent error:', dot(theta, phi))
    q, x = z.T @ phi, theta.T @ phi
    lamb = (dot(x, q) - np.trace(x) * np.trace(q) / m - dot(theta, g)) / dot(theta, theta)

Other Details

As in the previous three articles, because the "first-order approximation is enough" principle is used, the spectral norm of $\boldsymbol{W} - \eta\boldsymbol{\Phi}$ is accurate to $1 + \mathcal{O}(\eta^2)$, which usually cannot be exactly 1. Therefore, we still need to perform a Spectral Normalization:

\begin{equation}\boldsymbol{W}\quad\leftarrow\quad \frac{\boldsymbol{W} - \eta\boldsymbol{\Phi}}{\Vert\boldsymbol{W} - \eta\boldsymbol{\Phi}\Vert_2}\end{equation}

Fortunately, the spectral norm can be computed efficiently via power iteration, so this is not a particularly expensive calculation (compared to the $\msign$ iteration itself).

Another point worth analyzing is the case where the maximum singular value is not unique. This special case can be ignored in practical numerical calculations, but for theoretical completeness, it should be included in the analysis. In this case, the corresponding singular vectors are also not unique, which is equivalent to saying that there are multiple different tangent spaces, and the actual feasible space is the intersection of these tangent spaces. Taking two maximum singular values as an example, the problem becomes

Here $\boldsymbol{\Theta}_1=\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}, \boldsymbol{\Theta}_2=\boldsymbol{u}_2 \boldsymbol{v}_2^{\top}$. Introducing two undetermined coefficients $\lambda_1,\lambda_2$, we can solve for

\begin{equation}\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\end{equation}

Next, we need to solve the system of equations $\tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})=0,\tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})=0$. Similarly, introducing

\begin{equation}\boldsymbol{Q}=((\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2))^{1/2} = (\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)^{\top}\boldsymbol{\Phi}\end{equation}

We can write the system of equations

\begin{equation}\begin{gathered} \boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}_1^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_1^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi} - \tr(\boldsymbol{\Theta}_2^{\top} \boldsymbol{\Phi})\boldsymbol{I}/m = \boldsymbol{\Theta}_2^{\top}(\boldsymbol{G} + \lambda_1\boldsymbol{\Theta}_1+ \lambda_2\boldsymbol{\Theta}_2)\boldsymbol{Q}^{-1} \\ \end{gathered}\end{equation}

Multiplying both sides by $\boldsymbol{Q}$ and taking the trace, it becomes a system of two linear equations in two variables for $\lambda_1, \lambda_2$ that can be solved. Thus, an iterative solution format can be constructed accordingly. The details will not be discussed here; readers interested in practicing can supplement them independently.

Article Summary

This article mainly considers the Muon form after imposing a spectral norm or a general norm constraint on parameters. Building on the previous three articles, there are no obvious technical difficulties in this article, and readers can simply regard it as a supplementary exercise for practice.