Steepest Descent on Manifolds: 5. Dual Gradient Descent

By 苏剑林 | November 03, 2025

In the first four articles, we solved several specific steepest descent problems with equality constraints on parameters. Among them, the problems in the third and fourth articles could not be solved analytically, so I proposed corresponding fixed-point iteration methods. In particular, the study of "Muon + Stiefel" in the third article, "Steepest Descent on Manifolds: 3. Muon + Stiefel", originated from Jeremy Bernstein's article "Orthogonal manifold".

For this problem, Jeremy Bernstein eventually provided his own solution, which I call "Dual Gradient Descent," and it is also quite worth learning.

Basic Concepts

Jeremy Bernstein's solution was finally published in the Thinking Machines Lab blog post "Modular Manifolds", the lab's second blog post. In that article, they refer to it as "Dual Ascent," but here I will combine it with the content of the previous four articles and call it "Dual Gradient Descent."

In fact, dual gradient descent can be seen as a natural consequence of the Lagrange multiplier method. However, the rigorous discussion of Lagrange multipliers is quite cumbersome, as it requires introducing the Minimax theorem. Therefore, in this series, to avoid these complications, we have adopted the "undetermined coefficients" approach for derivation. This makes dual gradient descent seem less natural. But no matter—we can still derive it following our logic, though it might take a bit more space.

First, let's review the notation. $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ is a matrix parameter; without loss of generality, assume $n\geq m$. $\boldsymbol{G}\in\mathbb{R}^{n\times m}$ is its gradient. $\Vert\boldsymbol{G}\Vert_2$ is the spectral norm of matrix $\boldsymbol{G}$, equal to the largest singular value; $\Vert\boldsymbol{G}\Vert_*$ is the nuclear norm of matrix $\boldsymbol{G}$, equal to the sum of all singular values. Specifically, according to the conclusions in the article "Derivatives of SVD", we have \begin{equation}\nabla_{\boldsymbol{G}}\Vert\boldsymbol{G}\Vert_* = \sum_i \nabla_{\boldsymbol{G}} \sigma_i = \sum_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}\boldsymbol{V}^{\top} = \msign(\boldsymbol{G}) \label{eq:nuclear-grad}\end{equation} where $\boldsymbol{G}=\sum_i \sigma_i \boldsymbol{u}_i \boldsymbol{v}_i^{\top} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$ is the SVD of $\boldsymbol{G}$. That is to say, the gradient of the nuclear norm is exactly the $\msign$ operator, which is an important foundation for the subsequent derivation.

Problem Description

We will still follow the previous derivation logic to introduce dual gradient descent, so this section restates the problem and existing results.

In "Steepest Descent on Manifolds: 3. Muon + Stiefel", the problem we wanted to solve was \begin{equation}\max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\,\, \boldsymbol{W}^{\top}\boldsymbol{W}=\boldsymbol{I},\,\,\,\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0} \label{eq:muon-stiefel}\end{equation} The solution is $\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})$, where $\boldsymbol{X}\in\mathbb{R}^{m\times m}$ is an undetermined symmetric matrix such that $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} = \boldsymbol{0}$.

In "Steepest Descent on Manifolds: 4. Muon + Spectral Sphere", the problem we wanted to solve was \begin{equation}\max_{\boldsymbol{\Phi}} \tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) \qquad \text{s.t.}\qquad \Vert\boldsymbol{\Phi}\Vert_2 = 1,\,\,\, \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0 \label{eq:muon-spectral}\end{equation} The answer is $\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta})$, where $\lambda$ is an undetermined coefficient such that $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$.

As we can see, our final task has become finding undetermined coefficients that satisfy the extra equality constraints. This is essentially solving a (system of) non-linear equation(s). Dual gradient descent transforms the solution of equations into the minimization of a certain objective function, which is then solved using gradient descent.

Dual Objective

The key to the transformation is the nuclear norm gradient equality \eqref{eq:nuclear-grad}. For simplicity, let's first look at the "Muon + Spectral Sphere" problem \eqref{eq:muon-spectral}, where the undetermined coefficient is just a scalar, making it easier to observe. It is not difficult to verify that \begin{equation}\nabla_{\lambda} \Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_* = \tr(\boldsymbol{\Theta}^{\top}\msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta})) = \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\end{equation} This means that solving the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ is equivalent to finding a point where the gradient of $\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_*$ with respect to $\lambda$ is zero; this could be its (local) minimum/maximum point. Since $\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_*$ clearly has no maximum value, we transform it into finding its minimum point: \begin{equation}\lambda^* = \argmin_{\lambda} \Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert^* \label{eq:muon-spectral-obj}\end{equation}

Let's recap the steps here:

1. Our goal is to solve the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$, finding any one solution will do;

2. $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$ happens to be the gradient of $\Vert\boldsymbol{G} + \lambda\boldsymbol{\Theta}\Vert_*$ with respect to $\lambda$;

3. This transforms into finding a (local) minimum/maximum point, as the gradient is usually zero there;

4. One can easily determine there is no maximum, so we must find the minimum.

Gradient Descent

After determining the objective \eqref{eq:muon-spectral-obj}, we can use gradient descent to solve it. The gradient is already available, namely $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$. Thus, the update format for gradient descent is: \begin{equation}\lambda \quad \leftarrow\quad \lambda - \eta \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\end{equation} Of course, we could also consider adding a $\sign$ to $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$, i.e., SignSGD; these variations can be applied freely. From the iterative format, dual gradient descent is much simpler than the fixed-point iteration we proposed earlier. However, in many cases, dual gradient descent requires significantly more iterations and might need fine-tuning of the learning rate or the introduction of momentum to converge.

Thus, for solving the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$ itself, dual gradient descent is not necessarily an ideal solution. However, our ultimate goal is not just solving the equation $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$, but calculating $\boldsymbol{\Phi}$ as the optimization direction for the model. Model optimization is itself an iterative process. We can cache the historical $\lambda$ and adopt an approximation strategy where $\lambda$ is updated synchronously with the model parameters: \begin{equation}\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \lambda\boldsymbol{\Theta}), \quad \boldsymbol{W}\leftarrow\boldsymbol{W}- \eta_1 \boldsymbol{\Phi},\quad \lambda \leftarrow\lambda - \eta_2 \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\end{equation} In this way, each training step only requires calculating one almost-free extra step for $\lambda - \eta_2 \tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})$, resulting in an approximate implementation of the original objective \eqref{eq:muon-spectral}. Formally, it is equivalent to a form of adaptive Weight Decay for Muon.

On Stiefel

Having discussed the relatively simple "Muon + Spectral Sphere," let's look at "Muon + Stiefel," i.e., objective \eqref{eq:muon-stiefel}. Here, the undetermined matrix $\boldsymbol{X}$ has the constraint $\boldsymbol{X}=\boldsymbol{X}^{\top}$. We remove the constraint by setting $\boldsymbol{X}=\boldsymbol{\Lambda}+\boldsymbol{\Lambda}^{\top}$, where $\boldsymbol{\Lambda}\in\mathbb{R}^{m\times m}$ is an arbitrary matrix. It can then be found that \begin{equation}\nabla_{\boldsymbol{\Lambda}}\Vert\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}\Vert_* = \boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W} \end{equation} where $\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X})$. Therefore, solving the system of equations $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}=\boldsymbol{0}$ can similarly be transformed into finding the minimum point of the function $\Vert\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}\Vert_*$, solved with gradient descent: \begin{equation}\boldsymbol{\Lambda} \quad\leftarrow\quad \boldsymbol{\Lambda} - \eta(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}) \end{equation} Since $\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W}$ must be symmetric, it is also feasible to directly update $\boldsymbol{X} \leftarrow\boldsymbol{X} - \eta(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})$. By iterating it synchronously with $\boldsymbol{W}$, we get: \begin{equation}\boldsymbol{\Phi} = \msign(\boldsymbol{G} + \boldsymbol{W}\boldsymbol{X}), \quad \boldsymbol{W}\leftarrow\boldsymbol{W}- \eta_1 \boldsymbol{\Phi},\quad \boldsymbol{X} \leftarrow\boldsymbol{X} - \eta_2(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})\end{equation} This realizes an approximation of objective \eqref{eq:muon-stiefel}, and the extra $\boldsymbol{X} - \eta_2(\boldsymbol{W}^{\top}\boldsymbol{\Phi}+\boldsymbol{\Phi}^{\top}\boldsymbol{W})$ at each step is also virtually free.

Lagrange Multipliers

In these two examples, is it a pure coincidence that the equations to be solved exactly equal the gradient of some nuclear norm objective? Of course not. As we mentioned in the "Basic Concepts" section, this is a natural result of the Lagrange multiplier method. In this section, we will expand on this discussion.

For ease of understanding, we'll use the simpler objective \eqref{eq:muon-spectral} as an example. It can be equivalently written as: \begin{equation}\max_{\Vert\boldsymbol{\Phi}\Vert_2\leq 1} \min_{\lambda\in\mathbb{R}}\tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})\end{equation} To understand this transformation, one only needs to realize that the above expression must satisfy $\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi})=0$; otherwise, the $\min$ step can always reach negative infinity, and the final $\max$ result would also be negative infinity. As for changing $\Vert\boldsymbol{\Phi}\Vert_2 = 1$ to $\Vert\boldsymbol{\Phi}\Vert_2\leq 1$, it does not change the result of the maximum (since the maximum is always taken at the boundary), but it makes the feasible region of $\boldsymbol{\Phi}$ a convex set.

With this equivalent form, we can use the Minimax theorem to swap the positions of $\min$ and $\max$: \begin{equation}\begin{aligned} &\,\max_{\Vert\boldsymbol{\Phi}\Vert_2\leq 1} \min_{\lambda\in\mathbb{R}}\tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \min_{\lambda\in\mathbb{R}}\max_{\Vert\boldsymbol{\Phi}\Vert_2\leq 1}\tr(\boldsymbol{G}^{\top}\boldsymbol{\Phi}) + \lambda\tr(\boldsymbol{\Theta}^{\top} \boldsymbol{\Phi}) \\ =&\, \min_{\lambda\in\mathbb{R}} \Vert\boldsymbol{G} + \lambda \boldsymbol{\Theta}\Vert_* \end{aligned}\end{equation} Here, the step taking the $\max$ over $\Vert\boldsymbol{\Phi}\Vert_2\leq 1$ is a basic result of the Muon derivation, so calculating the $\max$ first presents no difficulty. Thus, we have obtained the dual objective $\Vert\boldsymbol{G} + \lambda \boldsymbol{\Theta}\Vert_*$ of the original problem \eqref{eq:muon-spectral}.

Some readers might wonder: How is this Lagrange multiplier method different from the one I learned? Because the Lagrange multiplier method here is generalized to general convex sets, and the interchangeability of $\min$ and $\max$ is strictly discussed to ensure the final result is what we want. The Lagrange multiplier method we generally learn is just a heuristic set of procedures for solving constrained optimization problems in $\mathbb{R}^n$, without much discussion of details regarding theoretical guarantees.

Summary

This article introduced the idea of using dual gradient descent to find the direction of steepest descent on manifolds. This is the same method used by the Thinking Machines Lab blog "Modular Manifolds" to solve for Muon on the Stiefel manifold.