The Derivative of msign

By 苏剑林 | June 13, 2025

In this article, we will derive the derivative formula for the $\msign$ operator. If you are looking to combine TTT (Test-Time Training) with Muon, as seen in "Test-Time Training Done Right", then this article may be helpful to you.

Two Definitions

This article assumes that readers already have some understanding of $\msign$. If not, you can first read "Appreciating the Muon Optimizer: The Essential Leap from Vectors to Matrices" and "Newton-Schulz Iteration for the msign Operator (Part 1)". Let there be a matrix $\boldsymbol{M}\in\mathbb{R}^{n\times m}$, then:

\begin{equation}\boldsymbol{U},\boldsymbol{\Sigma},\boldsymbol{V}^{\top} = \text{SVD}(\boldsymbol{M}) \quad\Rightarrow\quad \msign(\boldsymbol{M}) = \boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top}\end{equation}

where $\boldsymbol{U}\in\mathbb{R}^{n\times n}, \boldsymbol{\Sigma}\in\mathbb{R}^{n\times m}, \boldsymbol{V}\in\mathbb{R}^{m\times m}$, and $r$ is the rank of $\boldsymbol{M}$. Simply put, $\msign$ is the new matrix obtained by changing all non-zero singular values of the matrix to 1. Based on SVD, we can also prove:

\begin{equation}\msign(\boldsymbol{M}) = (\boldsymbol{M}\boldsymbol{M}^{\top})^{-1/2}\boldsymbol{M}= \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}\end{equation}

Here, $^{-1/2}$ represents the inverse of the square root of the matrix. Since $\boldsymbol{M}\boldsymbol{M}^{\top}$ and $\boldsymbol{M}^{\top}\boldsymbol{M}$ are (semi-)positive definite symmetric matrices, the square root can always be found, but the inverse might not exist. When it is not invertible, we can use the "pseudo-inverse". The name $\msign$ comes from the similarity of the above formula to the real-valued sign function $\sign(x) = x/\sqrt{x^2}$. However, as mentioned before, there is another matrix version of the sign function, which we denote here as $\mcsgn$:

\begin{equation}\mcsgn(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^2)^{-1/2}\end{equation}

This is the same as $\msign$ but with $\boldsymbol{M}^{\top}\boldsymbol{M}$ replaced by $\boldsymbol{M}^2$. Since only square matrices can have a square, this definition only applies to square matrices. Introducing two similar but different definitions in one article can be confusing, but unfortunately, both definitions are needed in the following calculations, so they must appear simultaneously.

$\mcsgn$ possesses similarity invariance. If $\boldsymbol{M}=\boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$, then $\mcsgn(\boldsymbol{M})=\boldsymbol{P}\mcsgn(\boldsymbol{\Lambda})\boldsymbol{P}^{-1}$. Furthermore, if $\boldsymbol{\Lambda}$ is a diagonal matrix (which can almost always be achieved in the complex domain), then we have:

\begin{equation}\mcsgn(\boldsymbol{M}) = \boldsymbol{P}\csgn(\boldsymbol{\Lambda})\boldsymbol{P}^{-1}\end{equation}

$\csgn(\boldsymbol{\Lambda})$ indicates that the sign function is applied to each element on the diagonal, where $\csgn(z) = z/\sqrt{z^2}$ is the complex version of the sign function; if the real part of $z$ is non-zero, it equals $\sign(\text{Re}[z])$. Thus, the difference between $\msign$ and $\mcsgn$ is that the former applies the sign function to the singular values based on Singular Value Decomposition, while the latter applies it to the eigenvalues based on Eigenvalue Decomposition. When $\boldsymbol{M}$ is a symmetric matrix, they are equal.

Unified Computation

Currently, the numerical calculation of $\msign$ mainly relies on the following format of "Newton-Schulz iteration":

\begin{equation}\boldsymbol{X}_0 = \frac{\boldsymbol{M}}{\Vert\boldsymbol{M}\Vert_F},\qquad \boldsymbol{X}_{t+1} = a_{t+1}\boldsymbol{X}_t + b_{t+1}\boldsymbol{X}_t(\boldsymbol{X}_t^{\top}\boldsymbol{X}_t) + c_{t+1}\boldsymbol{X}_t(\boldsymbol{X}_t^{\top}\boldsymbol{X}_t)^2\end{equation}

As for the choice of coefficients, we have discussed this in detail in "Newton-Schulz Iteration for the msign Operator (Part 1)" and "Newton-Schulz Iteration for the msign Operator (Part 2)". A newer result from the second part is:

$$ \begin{array}{c|ccc} \hline t & a\times 1.01 & b\times 1.01^3 & c\times 1.01^5 \\ \hline \quad 1\quad & 8.28721 & -23.5959 & 17.3004 \\ 2 & 4.10706 & -2.94785 & 0.544843 \\ 3 & 3.94869 & -2.9089 & 0.551819 \\ 4 & 3.31842 & -2.48849 & 0.510049 \\ 5 & 2.30065 & -1.6689 & 0.418807 \\ 6 & 1.8913 & 1.268 & 0.376804 \\ 7 & 1.875 & -1.25 & 0.375 \\ 8 & 1.875 & -1.25 & 0.375 \\ \hline \end{array} $$

The advantage of this result is that it can be truncated and superposed arbitrarily. For example, if only the first 5 rows are kept, it is the optimal 5-step iteration; if the first 6 rows are kept, it is the optimal 6-step iteration, and the degree of approximation is guaranteed to be better than the 5-step iteration, and so on.

As for $\mcsgn$, it simply replaces $\boldsymbol{M}^{\top}\boldsymbol{M}$ in $\msign$ with $\boldsymbol{M}^2$. Therefore, in theory, Newton-Schulz iteration can also be used. however, since eigenvalues can be complex, general convergence is much more difficult. Nevertheless, if we can confirm that the eigenvalues of the matrix $\boldsymbol{M}$ are all real (such as the block triangular matrix to which we will apply $\mcsgn$ later in this article), then we can reuse the $\msign$ iteration and coefficients:

\begin{equation}\boldsymbol{X}_0 = \frac{\boldsymbol{M}}{\sqrt{\tr(\boldsymbol{M}^2)}},\qquad \boldsymbol{X}_{t+1} = a_{t+1}\boldsymbol{X}_t + b_{t+1}\boldsymbol{X}_t^3 + c_{t+1}\boldsymbol{X}_t^5\end{equation}

Derivation Process

Now we formally enter the main topic — calculating the derivative of $\boldsymbol{O}=\msign(\boldsymbol{M})$. If you are just using Muon as a regular optimizer, this article probably has nothing to do with you. When we need to refer to TTT and use the Muon optimizer to build an RNN model, we need the derivative of $\msign$. In this case, $\msign$ acts as part of the forward propagation in the model, and to backpropagate through the entire model, the derivative of $\msign$ is naturally involved.

Since $\msign$ is calculated through Newton-Schulz iteration, it can actually be backpropagated directly. Therefore, numerical differentiation of $\msign$ itself is not a problem. However, backpropagating through iterations means many intermediate states must be stored, which often leads to memory explosion. Thus, we hope to obtain an analytical solution for the derivative to simplify things. On the other hand, in "Derivative of SVD", we actually derived the derivative of $\msign$, but that was based on the SVD expression, and SVD is not a GPU-efficient algorithm.

Therefore, our goal is to seek a result that does not rely on SVD and can be computed efficiently. We start from the identity:

\begin{equation}\boldsymbol{M} = \boldsymbol{O}\boldsymbol{M}^{\top}\boldsymbol{O}\end{equation}

(which can be proven by the definition of $\msign$). Differentiating both sides gives:

\begin{equation}d\boldsymbol{M} = (d\boldsymbol{O})\boldsymbol{M}^{\top}\boldsymbol{O} + \boldsymbol{O}(d\boldsymbol{M}^{\top})\boldsymbol{O} + \boldsymbol{O}\boldsymbol{M}^{\top}(d\boldsymbol{O})\label{eq:dm-do}\end{equation}

The difficulty with this result is that it is not easy to isolate $d\boldsymbol{M}=f(d\boldsymbol{O})$ or $d\boldsymbol{O}=f(d\boldsymbol{M})$, making it hard to see the relationship between $\nabla_{\boldsymbol{O}}\mathcal{L}$ and $\nabla_{\boldsymbol{M}}\mathcal{L}$ ($\mathcal{L}$ being the loss function). In this case, the best way is to return to the fundamental approach of matrix differentiation — the "trace trick":

Trace trick: If we can find a matrix $\boldsymbol{G}$ of the same shape as $\boldsymbol{M}$ that satisfies \begin{equation}d\mathcal{L}=\langle \boldsymbol{G}, d\boldsymbol{M}\rangle_F = \tr(\boldsymbol{G}^{\top} (d\boldsymbol{M}))\end{equation} then $\boldsymbol{G} = \nabla_{\boldsymbol{M}}\mathcal{L}$.

The essence of the trace trick is to convert matrices/vectors into scalars, and then convert scalars into traces, which then allows the use of trace identities:

\begin{equation}\tr(\boldsymbol{A}\boldsymbol{B}) = \tr(\boldsymbol{B}\boldsymbol{A}) = \tr(\boldsymbol{A}^{\top}\boldsymbol{B}^{\top}) = \tr(\boldsymbol{B}^{\top}\boldsymbol{A}^{\top})\end{equation}

Now let $\boldsymbol{X}$ be any matrix of the same shape as $\boldsymbol{M}$. Multiply both sides of equation $\eqref{eq:dm-do}$ by $\boldsymbol{X}^{\top}$ and then take the trace:

\begin{equation}\begin{aligned} \tr(\boldsymbol{X}^{\top}(d\boldsymbol{M})) =&\, \tr(\boldsymbol{X}^{\top}(d\boldsymbol{O})\boldsymbol{M}^{\top}\boldsymbol{O}) + \tr(\boldsymbol{X}^{\top}\boldsymbol{O}(d\boldsymbol{M}^{\top})\boldsymbol{O}) + \tr(\boldsymbol{X}^{\top}\boldsymbol{O}\boldsymbol{M}^{\top}(d\boldsymbol{O})) \\[7pt] =&\, \tr(\boldsymbol{M}^{\top}\boldsymbol{O}\boldsymbol{X}^{\top}(d\boldsymbol{O})) + \tr(\boldsymbol{O}\boldsymbol{X}^{\top}\boldsymbol{O}(d\boldsymbol{M}^{\top})) + \tr(\boldsymbol{X}^{\top}\boldsymbol{O}\boldsymbol{M}^{\top}(d\boldsymbol{O})) \\[7pt] =&\, \tr(\boldsymbol{M}^{\top}\boldsymbol{O}\boldsymbol{X}^{\top}(d\boldsymbol{O})) + \tr(\boldsymbol{O}^{\top}\boldsymbol{X}\boldsymbol{O}^{\top}(d\boldsymbol{M})) + \tr(\boldsymbol{X}^{\top}\boldsymbol{O}\boldsymbol{M}^{\top}(d\boldsymbol{O})) \\[7pt] \end{aligned}\end{equation}

From this, we obtain:

\begin{equation}\tr((\boldsymbol{X}^{\top} - \boldsymbol{O}^{\top}\boldsymbol{X}\boldsymbol{O}^{\top})(d\boldsymbol{M})) = \tr((\boldsymbol{M}^{\top}\boldsymbol{O}\boldsymbol{X}^{\top} + \boldsymbol{X}^{\top}\boldsymbol{O}\boldsymbol{M}^{\top})(d\boldsymbol{O}))\end{equation}

If we let $\boldsymbol{M}^{\top}\boldsymbol{O}\boldsymbol{X}^{\top} + \boldsymbol{X}^{\top}\boldsymbol{O}\boldsymbol{M}^{\top}=(\nabla_{\boldsymbol{O}}\mathcal{L})^{\top}$, then the above equation has the meaning of $d\mathcal{L}$, and according to the trace trick, we have $\boldsymbol{X}^{\top} - \boldsymbol{O}^{\top}\boldsymbol{X}\boldsymbol{O}^{\top}=(\nabla_{\boldsymbol{M}}\mathcal{L})^{\top}$. This indicates that the relationship between $\nabla_{\boldsymbol{M}}\mathcal{L}$ and $\nabla_{\boldsymbol{O}}\mathcal{L}$ is described by the following system of equations:

\begin{gather}\boldsymbol{X} - \boldsymbol{O}\boldsymbol{X}^{\top}\boldsymbol{O} = \nabla_{\boldsymbol{M}}\mathcal{L} \label{eq:g-m}\\[7pt] \boldsymbol{X}\boldsymbol{O}^{\top}\boldsymbol{M} + \boldsymbol{M}\boldsymbol{O}^{\top}\boldsymbol{X} = \nabla_{\boldsymbol{O}}\mathcal{L}\label{eq:g-o}\end{gather}

Theoretical Form

So, the problem now becomes solving for $\boldsymbol{X}$ from equation $\eqref{eq:g-o}$, and then substituting it into equation $\eqref{eq:g-m}$ to obtain $\nabla_{\boldsymbol{M}}\mathcal{L}$, which expresses $\nabla_{\boldsymbol{M}}\mathcal{L}$ as a function of $\nabla_{\boldsymbol{O}}\mathcal{L}$, avoiding the direct calculation of $\nabla_{\boldsymbol{M}}\boldsymbol{O}$. Obviously, the only difficulty is solving equation $\eqref{eq:g-o}$.

In this section, we first seek a theoretical solution based on SVD, which is not very practical but helps us understand the properties of equation $\eqref{eq:g-o}$ and aligns with previous results. Let $\boldsymbol{X}=\boldsymbol{U}\boldsymbol{Y}\boldsymbol{V}^{\top}$, and we also have $\boldsymbol{O}^{\top}\boldsymbol{M} = (\boldsymbol{M}^{\top}\boldsymbol{M})^{1/2} = \boldsymbol{V}(\boldsymbol{\Sigma}^{\top}\boldsymbol{\Sigma})^{1/2}\boldsymbol{V}^{\top}$ and $\boldsymbol{M}\boldsymbol{O}^{\top}=(\boldsymbol{M}\boldsymbol{M}^{\top})^{1/2} = \boldsymbol{U}(\boldsymbol{\Sigma}\boldsymbol{\Sigma}^{\top})^{1/2}\boldsymbol{U}^{\top}$. Substituting these identities into equation $\eqref{eq:g-o}$ gives:

\begin{equation}\boldsymbol{U}\boldsymbol{Y}(\boldsymbol{\Sigma}^{\top}\boldsymbol{\Sigma})^{1/2}\boldsymbol{V}^{\top} + \boldsymbol{U}(\boldsymbol{\Sigma}\boldsymbol{\Sigma}^{\top})^{1/2}\boldsymbol{Y}\boldsymbol{V}^{\top} = \nabla_{\boldsymbol{O}}\mathcal{L}\end{equation}

which is:

\begin{equation}\boldsymbol{Y}(\boldsymbol{\Sigma}^{\top}\boldsymbol{\Sigma})^{1/2} + (\boldsymbol{\Sigma}\boldsymbol{\Sigma}^{\top})^{1/2}\boldsymbol{Y} = \boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}\label{eq:g-o-2}\end{equation}

If written in component form, the left side is $\boldsymbol{Y}_{i,j}\sigma_j + \sigma_i \boldsymbol{Y}_{i,j} = (\sigma_i + \sigma_j)\boldsymbol{Y}_{i,j}$, where $\sigma_1, \sigma_2, \dots, \sigma_r$ are the non-zero singular values of $\boldsymbol{M}$, and $0 = \sigma_{r+1} = \sigma_{r+2} = \dots$. Clearly, if $\boldsymbol{M}$ is a full-rank square matrix, we can solve for:

\begin{equation}\boldsymbol{Y} = (\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}) \oslash \boldsymbol{S}\end{equation}

where $\boldsymbol{S}_{i,j} = \sigma_i + \sigma_j$, and $\oslash$ is the Hadamard division (element-wise division). If we then substitute $\boldsymbol{X}=\boldsymbol{U}\boldsymbol{Y}\boldsymbol{V}^{\top}$ into equation $\eqref{eq:g-m}$, we get a result consistent with that in "Derivative of SVD". This convergence from different paths also strengthens our confidence that our derivation so far is correct.

What if $\boldsymbol{M}$ is not full-rank or not square? In this case, if the $\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}$ on the right side doesn't "cooperate," equation $\eqref{eq:g-o-2}$ will have no solution. However, since equation $\eqref{eq:g-o-2}$ is derived from a physical problem, it must have a solution, meaning the right side "must cooperate"! What does cooperation mean? If the rank of $\boldsymbol{M}$ is $r$, then the matrix $\boldsymbol{S}$ is non-zero only within $\boldsymbol{S}_{[:r,:r]}$. For equation $\eqref{eq:g-o-2}$ to have a solution, the parts beyond $(\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V})_{[:r,:r]}$ must be zero. Under this condition, we can write:

\begin{equation}\boldsymbol{Y} = \lim_{\epsilon\to 0}\,\, (\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}) \oslash (\boldsymbol{S} + \epsilon) \end{equation}

This is equivalent to saying that we can add some perturbation to the singular values, converting it to a case where all singular values are non-zero, and then let the perturbation approach zero after the calculation is complete to obtain the correct result.

Efficient Solution

The SVD solution in the previous section often only has theoretical value. To compute efficiently on a GPU, we still need to seek other forms of solutions. Introducing the notation $\boldsymbol{M}\boldsymbol{O}^{\top}=\boldsymbol{A}, \boldsymbol{O}^{\top}\boldsymbol{M}=\boldsymbol{B}, \nabla_{\boldsymbol{O}}\mathcal{L}=\boldsymbol{C}$, equation $\eqref{eq:g-o}$ is actually a Sylvester equation:

\begin{equation}\boldsymbol{A}\boldsymbol{X}+\boldsymbol{X}\boldsymbol{B} = \boldsymbol{C}\end{equation}

There are many methods to solve the Sylvester equation. Among them, the most elegant and GPU-efficient might be the solution scheme based on $\mcsgn$ (not $\msign$) (referenced here from "Fast Differentiable Matrix Square Root"). First, starting from the above equation, we can verify that the following equation holds:

\begin{equation}\begin{bmatrix} \boldsymbol{A} & -\boldsymbol{C} \\ \boldsymbol{0} & -\boldsymbol{B}\end{bmatrix} = \begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}\begin{bmatrix} \boldsymbol{A} & \boldsymbol{0} \\ \boldsymbol{0} & -\boldsymbol{B}\end{bmatrix}\begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}^{-1} \end{equation}

Taking $\mcsgn$ on both sides, according to the properties of $\mcsgn$, we have:

\begin{equation}\mcsgn\left(\begin{bmatrix} \boldsymbol{A} & -\boldsymbol{C} \\ \boldsymbol{0} & -\boldsymbol{B}\end{bmatrix}\right) = \begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}\begin{bmatrix} \mcsgn(\boldsymbol{A}) & \boldsymbol{0} \\ \boldsymbol{0} & -\mcsgn(\boldsymbol{B})\end{bmatrix}\begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}^{-1} \end{equation}

Note that $\boldsymbol{A}=\boldsymbol{M}\boldsymbol{O}^{\top}=(\boldsymbol{M}\boldsymbol{M}^{\top})^{1/2}$ and $\boldsymbol{B}=\boldsymbol{O}^{\top}\boldsymbol{M}=(\boldsymbol{M}^{\top}\boldsymbol{M})^{1/2}$. Assuming $\boldsymbol{M}$ is a full-rank square matrix, then $\boldsymbol{A}$ and $\boldsymbol{B}$ are both positive definite symmetric matrices. The $\mcsgn$ of a positive definite symmetric matrix is the identity matrix $\boldsymbol{I}$, so:

\begin{equation}\mcsgn\left(\begin{bmatrix} \boldsymbol{A} & -\boldsymbol{C} \\ \boldsymbol{0} & -\boldsymbol{B}\end{bmatrix}\right) = \begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}\begin{bmatrix} \boldsymbol{I} & \boldsymbol{0} \\ \boldsymbol{0} & -\boldsymbol{I}\end{bmatrix}\begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}^{-1} = \begin{bmatrix} \boldsymbol{I} & -2\boldsymbol{X} \\ \boldsymbol{0} & -\boldsymbol{I}\end{bmatrix} \end{equation}

The final simplification uses the equality $\begin{bmatrix} \boldsymbol{I} & \boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}^{-1}=\begin{bmatrix} \boldsymbol{I} & -\boldsymbol{X} \\ \boldsymbol{0} & \boldsymbol{I}\end{bmatrix}$. From this result, it can be seen that we only need to calculate $\mcsgn$ for the block matrix $\begin{bmatrix} \boldsymbol{A} & -\boldsymbol{C} \\ \boldsymbol{0} & -\boldsymbol{B}\end{bmatrix}$, and then we can read $\boldsymbol{X}$ from the upper-right corner of the result. $\mcsgn$ can be efficiently calculated through Newton-Schulz iteration, so this scheme is GPU-friendly.

When $\boldsymbol{M}$ is not full-rank or not square, $\boldsymbol{A}$ and $\boldsymbol{B}$ are only semi-positive definite, so their $\mcsgn$ is not $\boldsymbol{I}$. However, our experience from the previous section tells us that since $\nabla_{\boldsymbol{O}}\mathcal{L}$ "must cooperate," we only need to add some perturbation to $\boldsymbol{\Sigma}$ to make it positive definite. Adding perturbation here is equivalent to adding $\epsilon \boldsymbol{I}$ to $\boldsymbol{A}$ and $\boldsymbol{B}$, so:

\begin{equation}\boldsymbol{X} = -\frac{1}{2} \left(\lim_{\epsilon \to 0}\,\, \mcsgn\left(\begin{bmatrix} \boldsymbol{A} + \epsilon \boldsymbol{I} & -\boldsymbol{C} \\ \boldsymbol{0} & -\boldsymbol{B} - \epsilon \boldsymbol{I}\end{bmatrix}\right)\right)_{[:n,n:]} \end{equation}

In actual calculation, we can only choose a relatively small $\epsilon > 0$ for approximate calculation. $\epsilon=10^{-3}$ can be considered, as it is within the range of the lower bound for the Newton-Schulz iteration we found previously.

Article Summary

This article has discussed the calculation of the derivative of the $\msign$ operator. If you are interested in the "TTT + Muon" combination, then this article may be helpful to you.