By 苏剑林 | June 07, 2025
Previously, we used two articles, "Newton-Schulz Iteration for the msign Operator (Part 1)" and "Newton-Schulz Iteration for the msign Operator (Part 2)", to discuss the numerical computation of the $\newcommand{msign}{\mathop{\text{msign}}}\newcommand{sign}{\mathop{\text{sign}}}\newcommand{clip}{\mathop{\text{clip}}}\newcommand{mclip}{\mathop{\text{mclip}}}\msign$ operator for matrices. In this article, we shift our focus to the "Singular Value Clipping" operation. This operation recently sparked heated discussions on @_arohan_'s Twitter, and we also touched upon it in "Higher-Order MuP: A Simpler yet Smarter Spectral Condition Scaling." In the following, we will refer to it simply as $\mclip$.
For a scalar $x$, the $\clip$ operation is defined as: \begin{equation}\clip(x) = \max(\min(x, 1), -1) = \left\{\begin{aligned}1, &\quad x\geq 1 \\ x, &\quad x\in(-1, 1)\\ -1, &\quad x\leq -1 \end{aligned}\right.\end{equation} That is, values greater than $1$ or less than $-1$ are truncated; otherwise, the value remains unchanged. For a matrix $\boldsymbol{M}\in\mathbb{R}^{n\times m}$, we define its $\mclip$ as: \begin{equation}\mclip(\boldsymbol{M}) = \boldsymbol{U}\clip(\boldsymbol{\Sigma})\boldsymbol{V}^{\top} \end{equation} where $\boldsymbol{M}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$ is the SVD of matrix $\boldsymbol{M}$, $\boldsymbol{U}\in\mathbb{R}^{n\times n}$ and $\boldsymbol{V}\in\mathbb{R}^{m\times m}$ are orthogonal matrices, and $\boldsymbol{\Sigma}\in\mathbb{R}^{n\times m}$ is the diagonal matrix of singular values. Applying $\clip$ to a diagonal matrix means applying $\clip$ to each of its diagonal elements individually. Note that since $\boldsymbol{\Sigma}$ is a diagonal matrix where diagonal elements are always non-negative, we also have: \begin{equation}\mclip(\boldsymbol{M}) = \boldsymbol{U}\min(\boldsymbol{\Sigma}, 1)\boldsymbol{V}^{\top} \end{equation}
Naturally, SVD is the standard way to calculate $\mclip$, but SVD is not particularly efficient. Given our experience with $\msign$, it is natural to consider finding a Newton-Schulz iteration for $\mclip$ as well. This line of thought is sound, but based on $\msign$, we can adopt an even smarter approach.
This clever idea comes from @leloykun. In his blog post "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration", he proposed that we can stand on the shoulders of $\msign$ and express $\mclip$ using $\msign$, thereby avoiding the need to find a separate Newton-Schulz iteration specifically for $\mclip$. He proposed an ingenious solution in his blog, but personally, I find it somewhat unintuitive and not highly efficient. Below, I present my own derivation.
My starting point is a scalar identity (provided by Kimi): $$\min(x, 1) = \frac{1}{2} [x + 1 - (x-1)\sign(x-1)] $$ For simplicity, let's first assume $\boldsymbol{M}$ is a full-rank square matrix. Then: \begin{equation}\begin{aligned} 2\mclip(\boldsymbol{M}) =&\, \boldsymbol{U} [2\min(\boldsymbol{\Sigma},1)] \boldsymbol{V}^{\top} \\[6pt] =&\, \boldsymbol{U} [\boldsymbol{\Sigma} + \boldsymbol{I} - (\boldsymbol{\Sigma} - \boldsymbol{I})\sign(\boldsymbol{\Sigma} - \boldsymbol{I})] \boldsymbol{V}^{\top} \\[6pt] =&\, \boldsymbol{U} [\boldsymbol{\Sigma} + \boldsymbol{I} - (\boldsymbol{\Sigma} - \boldsymbol{I})\msign(\boldsymbol{\Sigma} - \boldsymbol{I})] \boldsymbol{V}^{\top} \\[6pt] =&\, \boldsymbol{M} + \boldsymbol{U}\boldsymbol{V}^{\top} - \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I})\msign(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \end{aligned}\label{eq:2-mclip-M}\end{equation} Note that: \begin{equation}\begin{aligned} &\, \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I})\msign(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \\[6pt] =&\, \boldsymbol{U}(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{U}^{\top} \boldsymbol{U}\msign(\boldsymbol{\Sigma} - \boldsymbol{I}) \boldsymbol{V}^{\top} \\[6pt] =&\, (\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{U}^{\top} - \boldsymbol{I}) \msign(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] =&\, (\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} (\boldsymbol{U}\boldsymbol{V}^{\top})^{\top} - \boldsymbol{I}) \msign(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] =&\, (\boldsymbol{M} (\boldsymbol{U}\boldsymbol{V}^{\top})^{\top} - \boldsymbol{I}) \msign(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top}) \\[6pt] \end{aligned}\end{equation} where the second equality uses the property that for any orthogonal matrices $\boldsymbol{P}, \boldsymbol{Q}$, the identity $\boldsymbol{P}\msign(\boldsymbol{R})\boldsymbol{Q} = \msign(\boldsymbol{P}\boldsymbol{R}\boldsymbol{Q})$ holds. Substituting this back into Equation $\eqref{eq:2-mclip-M}$, we get: \begin{equation}2\mclip(\boldsymbol{M}) = \boldsymbol{M} + \boldsymbol{U}\boldsymbol{V}^{\top} + (\boldsymbol{I} - \boldsymbol{M}(\boldsymbol{U}\boldsymbol{V}^{\top})^{\top}) \msign(\boldsymbol{M} - \boldsymbol{U}\boldsymbol{V}^{\top})\label{eq:mclip-M-core}\end{equation} If $\boldsymbol{M}$ is a general matrix of rank $r$, then $\boldsymbol{U}\boldsymbol{V}^{\top}$ is replaced by $\boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top}$. We can directly substitute $\boldsymbol{M} = \boldsymbol{U}_{[:,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:,:r]}^{\top}$ into the above equation to verify that the equality still holds.
(Note: I would like to thank @YouJiacheng for the discussions regarding this section.)
We know that $\boldsymbol{U}\boldsymbol{V}^{\top}=\msign(\boldsymbol{M})$, so using Equation $\eqref{eq:mclip-M-core}$ to calculate $\mclip$ only requires calculating $\msign$ twice: \begin{equation}2\mclip(\boldsymbol{M}) = \boldsymbol{M} + \msign(\boldsymbol{M}) + (\boldsymbol{I} - \boldsymbol{M}\msign(\boldsymbol{M})^{\top}) \msign(\boldsymbol{M} - \msign(\boldsymbol{M}))\end{equation} The computational cost is roughly twice that of $\msign$. In contrast, the method in "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration" requires calculating $\msign$ for a matrix about four times the size, which makes the computational cost roughly eight times that of a single $\msign$ call.
Building upon $\msign$, the implementation of Equation $\eqref{eq:mclip-M-core}$ requires only two lines of code. A reference is provided below:
import numpy as np
def msign(m):
u, s, vh = np.linalg.svd(m, full_matrices=False)
return u @ vh
def mclip(m):
ms2 = msign(m - (ms := msign(m)))
return (m + ms + ms2 - m @ ms.mT @ ms2) / 2
m = np.random.randn(10, 20)
u, s, vh = np.linalg.svd(m, full_matrices=False)
result1 = u @ np.diag(s.clip(0, 1)) @ vh
result2 = mclip(m)
print(np.abs(result1 - result2).mean())
Here, SVD is used directly to calculate $\msign$ to quickly verify the correctness of Equation $\eqref{eq:mclip-M-core}$. In actual computations, readers can replace the msign function with the corresponding Newton-Schulz iteration.
We can use the same logic to calculate matrix versions of other functions, such as the step function. We define the scalar step function $\newcommand{mstep}{\mathop{\text{mstep}}}\newcommand{step}{\mathop{\text{step}}}$ as: \begin{equation}\step(x) = \frac{1}{2}[\sign(x - 1) + 1]\end{equation} This means if it's greater than 1, it becomes 1; if it's less than 1, it becomes 0. Consequently, we can define: \begin{equation}\mstep(\boldsymbol{M}) = \boldsymbol{U}\step(\boldsymbol{\Sigma})\boldsymbol{V}^{\top}\end{equation} This operation preserves singular values greater than 1 (clipping them to 1) and zeros out those less than 1. Following the same steps, we obtain: \begin{equation}\mstep(\boldsymbol{M}) = \frac{1}{2}[\msign(\boldsymbol{M}) + \msign(\boldsymbol{M} - \msign(\boldsymbol{M}))]\end{equation} We can even represent even functions. For example, define: \begin{equation}\mathop{\text{msquare}}(\boldsymbol{M}) = \boldsymbol{U} \boldsymbol{\Sigma}^2\boldsymbol{V}^{\top} = \boldsymbol{U}\boldsymbol{V}^{\top}(\boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{U}^{\top})(\boldsymbol{U} \boldsymbol{\Sigma}\boldsymbol{V}^{\top}) = \msign(\boldsymbol{M})\boldsymbol{M}^{\top}\boldsymbol{M}\end{equation} This is different from the matrix square defined directly by $\boldsymbol{M}^2$. The latter is only valid for square matrices and squares the eigenvalues under eigenvalue decomposition. The formula above squares the singular values under singular value decomposition. More generally, we have: \begin{equation}\boldsymbol{U} \boldsymbol{\Sigma}^{2n}\boldsymbol{V}^{\top} = \msign(\boldsymbol{M})(\boldsymbol{M}^{\top}\boldsymbol{M})^n,\quad \boldsymbol{U} \boldsymbol{\Sigma}^{2n+1}\boldsymbol{V}^{\top} = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^n\end{equation} This indicates that for any polynomial $f(x)$ (not just odd polynomials), $\boldsymbol{U}f(\boldsymbol{\Sigma})\boldsymbol{V}^{\top}$ can be obtained from $\boldsymbol{M}$ and $\msign(\boldsymbol{M})$ through a finite number of matrix additions and multiplications.
This article introduced an approach for performing general operations on the singular values of a matrix using the matrix itself and its $\msign$, including singular value clipping, step functions, and arbitrary degree polynomials (not just odd polynomials).