Calculating Singular Value Clipping mclip via msign (Part 2)

By 苏剑林 | June 23, 2025

Previously, in "Calculating Singular Value Clipping mclip via msign (Part 1)", we discussed the numerical calculation of singular value clipping $\mclip$. The core idea came from @leloykun’s article "Numerically Stable Spectral Clipping Via Newton-Schulz Iteration" (now revised and renamed). By finding an expression based on $\msign$, we avoid needing to find a separate Newton-Schulz iteration. In that article, the author proposed a nested $\msign$ scheme with lower computational costs. However, a few days ago, @leloykun pointed out on Twitter that the author's scheme suffers from high numerical errors in actual calculations. This article specifically analyzes this issue and provides a more efficient, lower-error new scheme.

Basic Concepts

As per convention, let's first organize the basic concepts. First is the $\clip$ operator for a scalar $x$, which we define generally as:

\begin{equation}\clip_{[\alpha,\beta]}(x) = \max(\min(x, \beta), \alpha) = \left\{\begin{aligned}\beta, & \quad x \geq \beta \\ x, & \quad x\in(\alpha, \beta)\\ \alpha, & \quad x\leq \alpha \end{aligned}\right.\end{equation}

When the interval is not specified, the default interval is $[-1,1]$, i.e., $\clip(x) = \clip_{[-1,1]}(x)$. Let the SVD of the matrix $\boldsymbol{M}\in\mathbb{R}^{n\times m}$ be $\boldsymbol{M}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$, where $\boldsymbol{U}\in\mathbb{R}^{n\times n}$ and $\boldsymbol{V}\in\mathbb{R}^{m\times m}$ are orthogonal matrices, and $\boldsymbol{\Sigma}\in\mathbb{R}^{n\times m}$ is the diagonal matrix of singular values. Then we define:

\begin{equation}\mclip_{[\alpha,\beta]}(\boldsymbol{M}) = \boldsymbol{U}\clip_{[\alpha,\beta]}(\boldsymbol{\Sigma})\boldsymbol{V}^{\top}\end{equation}

Applying $\clip$ to a diagonal matrix means applying $\clip$ to its diagonal elements individually. Simply put, $\mclip_{[\alpha,\beta]}$ clips the singular values of $\boldsymbol{M}$ to within $[\alpha,\beta]$. Since singular values are non-negative, when $\alpha < 0$, we have $\mclip_{[\alpha,\beta]}(\boldsymbol{M})=\mclip_{[0,\beta]}(\boldsymbol{M})$. However, as we will see later, due to actual calculation errors, considering negative $\alpha$ can have some magical error-canceling effects.

Theoretical General Solution

The goal of this section is to represent $\mclip$ in terms of $\msign$. The starting point is the identity:

\begin{equation}\mclip_{[\alpha,\beta]} (x) = \frac{\alpha + \beta + (\alpha - x)\sign(\alpha - x) - (\beta - x)\sign(\beta - x)}{2}\end{equation}

The key to finding this identity is to represent $\clip$ as a linear combination of absolute values and the variable itself, and then transition to the $\sign$ operator via $|x|=x\sign(x)$. We won't expand on that here. For simplicity, let's first assume $\boldsymbol{M}$ is a full-rank square matrix. Based on this identity, we have:

\begin{equation}2\mclip_{[\alpha,\beta]}(\boldsymbol{M}) = \boldsymbol{U}\Big((\alpha + \beta)\boldsymbol{I} + (\alpha \boldsymbol{I} - \boldsymbol{\Sigma})\sign(\alpha \boldsymbol{I} - \boldsymbol{\Sigma}) - (\beta \boldsymbol{I} - \boldsymbol{\Sigma})\sign(\beta \boldsymbol{I} - \boldsymbol{\Sigma})\Big)\boldsymbol{V}^{\top}\end{equation}

Expanding the right-hand side, it contains several types of terms (where $\gamma\in\{\alpha,\beta\}$):

Original Simplified
$\boldsymbol{U}\boldsymbol{V}^{\top}$ $\msign(\boldsymbol{M})$
$\boldsymbol{U}\sign(\gamma \boldsymbol{I} - \boldsymbol{\Sigma})\boldsymbol{V}^{\top}$ $\begin{aligned}&\, \msign(\gamma \boldsymbol{U}\boldsymbol{V}^{\top} - \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}) \\ =&\, \msign(\gamma \msign(\boldsymbol{M}) - \boldsymbol{M}) \end{aligned}$
$\boldsymbol{U}\boldsymbol{\Sigma}\sign(\gamma \boldsymbol{I} - \boldsymbol{\Sigma})\boldsymbol{V}^{\top}$ $\begin{aligned}&\, \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}\boldsymbol{V}\boldsymbol{U}^{\top}\boldsymbol{U}\sign(\gamma \boldsymbol{I} - \boldsymbol{\Sigma})\boldsymbol{V}^{\top} \\ =&\, \boldsymbol{M}\msign(\boldsymbol{M})^{\top}\msign(\gamma \msign(\boldsymbol{M}) - \boldsymbol{M}) \end{aligned}$

Substituting and rearranging, we get:

\begin{equation}\mclip_{[\alpha,\beta]}(\boldsymbol{M}) = \frac{1}{2}\left\{\begin{aligned}&\,(\alpha + \beta)\msign(\boldsymbol{M}) \\ + &\, (\alpha \boldsymbol{I} - \boldsymbol{M}\msign(\boldsymbol{M})^{\top})\msign(\alpha \msign(\boldsymbol{M}) - \boldsymbol{M})\\ - &\, (\beta \boldsymbol{I} - \boldsymbol{M}\msign(\boldsymbol{M})^{\top})\msign(\beta \msign(\boldsymbol{M}) - \boldsymbol{M}) \end{aligned}\right\}\label{eq:general}\end{equation}

For non-square or non-full-rank matrices, one can verify this by substituting $\msign(\boldsymbol{M})=\boldsymbol{U}_{[:,:r]}\boldsymbol{V}_{[:,:r]}^{\top}$. Thus, the above equation is the theoretical general solution for $\mclip$.

Initial Form

Equation \eqref{eq:general} appears to require calculating $\msign$ at least three times, and the input for the last two $\msign$ calls involves the result of the first $\msign$. Thus, it is formally a nested $\msign$. When we take $\alpha=0, \beta=1$, the number of $\msign$ operations can be reduced to two:

\begin{equation}\mclip(\boldsymbol{M}) = \frac{1}{2}\Big[\boldsymbol{M} + \msign(\boldsymbol{M}) + (\boldsymbol{I} - \boldsymbol{M}\msign(\boldsymbol{M})^{\top}) \msign(\boldsymbol{M} - \msign(\boldsymbol{M}))\Big]\label{eq:mclip-1}\end{equation}

This is the result the author gave in the previous article "Calculating Singular Value Clipping mclip via msign (Part 1)", requiring only two $\msign$ operations. However, actual tests show that when the singular values of $\boldsymbol{M}$ are large and the precision of the $\msign$ calculation is low, this equation generates significant errors—far greater than the scheme provided by @leloykun. But @leloykun’s scheme requires calculating $\msign$ on a matrix approximately four times larger, $\begin{bmatrix}\boldsymbol{I} & \boldsymbol{M} \\ \boldsymbol{M}^{\top} & \boldsymbol{I}\end{bmatrix}$, which is costly. Therefore, we want to see if the scheme here can be improved.

Removing Nesting

Intuitively, the source of error is the cumulative error caused by the nested $\msign$. Fortunately, using a simple trick, the nesting can indeed be removed! First, it can be proven that:

\begin{equation}\begin{aligned} &\,(\boldsymbol{I} - \boldsymbol{M}\msign(\boldsymbol{M})^{\top}) \msign(\boldsymbol{M} - \msign(\boldsymbol{M})) \\[6pt] =&\, (\msign(\boldsymbol{M}) - \boldsymbol{M}) \msign(\msign(\boldsymbol{M})^{\top}\boldsymbol{M} - \boldsymbol{I}) \end{aligned}\end{equation}

Then we have:

\begin{equation}\msign(\boldsymbol{M})^{\top}\boldsymbol{M} - \boldsymbol{I} = \boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{V}^{\top} - \boldsymbol{I} = \boldsymbol{V}(\boldsymbol{\Sigma}-\boldsymbol{I})\boldsymbol{V}^{\top}\end{equation}

Based on the above, we assert:

\begin{equation}\msign(\msign(\boldsymbol{M})^{\top}\boldsymbol{M} - \boldsymbol{I}) = \msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I}) = \msign(\boldsymbol{V}(\boldsymbol{\Sigma}^2-\boldsymbol{I})\boldsymbol{V}^{\top})\end{equation}

This utilizes a very simple property: $\forall x \geq 0, \sign(x-1) = \sign(x^2-1)$. Using this result, we can obtain:

\begin{equation}\mclip(\boldsymbol{M}) = \frac{1}{2}\Big[\boldsymbol{M} + \msign(\boldsymbol{M}) + (\msign(\boldsymbol{M}) - \boldsymbol{M}) \msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I})\Big]\label{eq:mclip-2}\end{equation}

It still requires two $\msign$ operations, but they are no longer nested, meaning that theoretically, the cumulative error from nesting is gone. Tests show that the error of Equation \eqref{eq:mclip-2} is indeed about half that of Equation \eqref{eq:mclip-1}, but it is still inferior to @leloykun's scheme in extreme cases. This indicates that nesting is not the primary source of error.

Mutual Cancellation

Is there any other room for improvement? @leloykun's scheme is required to be an odd function, so it actually considers $\mclip_{[-1,1]}$ rather than $\mclip_{[0,1]}$. Could it be that this choice allows errors in certain parts to cancel each other out, thereby obtaining better computational precision? To verify this, we substitute $\alpha=-1, \beta=1$ into Equation \eqref{eq:general}, obtaining:

\begin{equation}\mclip(\boldsymbol{M}) = \frac{1}{2}\left\{\begin{aligned} &\,(\boldsymbol{I} + \boldsymbol{M}\msign(\boldsymbol{M})^{\top})\msign(\msign(\boldsymbol{M}) + \boldsymbol{M}) \\ - &\,(\boldsymbol{I} - \boldsymbol{M}\msign(\boldsymbol{M})^{\top})\msign(\msign(\boldsymbol{M}) - \boldsymbol{M}) \end{aligned}\right\}\end{equation}

Based on the same de-nesting trick as in the previous section, we get:

\begin{equation}\mclip(\boldsymbol{M}) = \frac{1}{2}\left\{\begin{aligned} &\,(\msign(\boldsymbol{M}) + \boldsymbol{M})\msign(\boldsymbol{M}^{\top}\boldsymbol{M} + \boldsymbol{I}) \\ + &\,(\msign(\boldsymbol{M}) - \boldsymbol{M})\msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I}) \end{aligned}\right\}\label{eq:mclip-3}\end{equation}

Note that $\boldsymbol{M}^{\top}\boldsymbol{M} + \boldsymbol{I}$ is certainly a positive-definite symmetric matrix, so theoretically $\msign(\boldsymbol{M}^{\top}\boldsymbol{M} + \boldsymbol{I})=\boldsymbol{I}$, which allows us to recover Equation \eqref{eq:mclip-2}. However, in actual calculations, the error between $\msign(\boldsymbol{M}^{\top}\boldsymbol{M} + \boldsymbol{I})$ and $\boldsymbol{I}$ might cancel out the error introduced by $\msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I})$. Thus, we decide via experiments whether to keep it. As expected, the numerical error of Equation \eqref{eq:mclip-3} is even smaller than that of @leloykun's scheme! This confirms our hypothesis that setting $\alpha=-1$ and $\beta=1$ to make $\mclip$ an odd function helps cancel out errors.

Reflections on the Cause

Why do the errors cancel out so conveniently? We can perform a brief quantitative analysis. The occurrence of large errors has two prerequisites: first, $\boldsymbol{M}$ has very large singular values; second, the number of $\msign$ iterations is not high, leading to low precision of the $\msign$ result itself. Observing Equation \eqref{eq:mclip-3}, it can be split into a sum of four terms. The terms $\msign(\boldsymbol{M})\msign(\boldsymbol{M}^{\top}\boldsymbol{M} \pm \boldsymbol{I})$ are bounded; even if the $\msign$ precision is low, they basically cannot diverge. Thus, the main error comes from:

\begin{equation}\boldsymbol{M}\msign(\boldsymbol{M}^{\top}\boldsymbol{M} + \boldsymbol{I}) - \boldsymbol{M}\msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I})\label{eq:error-1}\end{equation}

This is proportional to $\boldsymbol{M}$ and is most likely to amplify the error. Correspondingly, the main error term in Equation \eqref{eq:mclip-2} is:

\begin{equation}\boldsymbol{M} - \boldsymbol{M}\msign(\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I})\label{eq:error-2}\end{equation}

Consider singular values much larger than 1. If $\msign$ were exact, then the result of $\msign$ would be 1, and the parts corresponding to large singular values in both equations above would be the expected 0. However, if $\msign$ is calculated with few iteration steps, it might result in a value like $0.6$ or $1.4$. In Equation \eqref{eq:error-2}, the corresponding part would then exhibit a huge error of $\sim\pm 0.4 \boldsymbol{M}$. But for Equation \eqref{eq:error-1}, when the singular values are very large, the relative difference between $\boldsymbol{M}^{\top}\boldsymbol{M} - \boldsymbol{I}$ and $\boldsymbol{M}^{\top}\boldsymbol{M} + \boldsymbol{I}$ is small. Therefore, the difference between $\msign(\boldsymbol{M}^{\top}\boldsymbol{M} \pm \boldsymbol{I})$ is very small, so Equation \eqref{eq:error-1} can still cancel out most of the error. But remember, this always assumes that $\boldsymbol{M}$ has singular values significantly larger than 1 and that the iteration count is low. If these conditions are not met, then the original error of Equation \eqref{eq:mclip-2} is not large, and Equation \eqref{eq:mclip-3} might instead increase the error because of the extra $\msign$ calculation. Therefore, which formula performs best in practice needs to be analyzed case by case.

Comparison Code

Construct a set of singular values with values both greater than and less than 1, with the maximum singular value near 1,000. Then test each algorithm under bfloat16 precision. Reference code is as follows (approximate results are written in comments):

import numpy as np
import jax.numpy as jnp
import jax.lax as lax

def msign(x, steps=4, eps=1e-20):
    """The coefficients come from https://kexue.fm/archives/10996
    """
    abc = [
        (8.287212018145622, -23.59588651909882, 17.300387312530923),
        (4.107059111542197, -2.9478499167379084, 0.54484310829266),
        (3.9486908534822938, -2.908902115962947, 0.5518191394370131),
        (3.3184196573706055, -2.488488024314878, 0.5100489401237208),
        (2.3006520199548186, -1.6689039845747518, 0.4188073119525678),
        (1.8913014077874002, -1.2679958271945908, 0.37680408948524996),
        (1.875, -1.25, 0.375)
    ]
    y = x.mT if x.shape[-2] > x.shape[-1] else x
    y = y * lax.rsqrt((y**2).sum(axis=[-2, -1], keepdims=True) + eps)
    for a, b, c in abc[:steps] + max(steps - 7, 0) * abc[-1:]:
        a, b, c = a / 1.01, b / 1.01**3, c / 1.01**5
        y = a * y + (b * (u := y @ y.mT) + c * u @ u) @ y
    return y.mT if x.shape[-2] > x.shape[-1] else y

def mclip1(m):
    """1st version (2 nested msign)
    """
    ms2 = msign(m - (ms1 := msign(m)))
    return (m + ms1 + ms2 - m @ ms1.mT @ ms2) / 2

def mclip2(m):
    """2nd version (2 non-nested msign)
    """
    ms1 = msign(m)
    ms2 = msign(m.mT @ m - jnp.eye(m.shape[-1]))
    return (m + ms1 + (ms1 - m) @ ms2) / 2

def mclip3(m):
    """3rd version (3 non-nested msign)
    """
    ms1 = msign(m)
    ms2 = msign(m.mT @ m + jnp.eye(m.shape[-1]))
    ms3 = msign(m.mT @ m - jnp.eye(m.shape[-1]))
    return ((ms1 + m) @ ms2 + (ms1 - m) @ ms3) / 2

def spectral_clip(W):
    """@leloykun verision: https://leloykun.github.io/ponder/spectral-clipping/
    """
    m, n = W.shape
    H = jnp.block([[jnp.eye(m), W], [W.T, jnp.eye(n)]])
    OH = msign(H)
    P, Q = OH[:m, :m], OH[:m, m:]
    return Q + P @ W

m = np.random.randn(4096, 1024)
u, s, vh = jnp.linalg.svd(m, full_matrices=False)
s = np.concatenate([np.linspace(1, 1000, 128), np.linspace(0, 1, 896)])
s = np.sort(s)[::-1]
m = u @ jnp.diag(s) @ vh # matrix with large singular values
result0 = u @ np.diag(s.clip(0, 1)) @ vh # exact result via SVD

result1 = mclip1(m.astype('bfloat16'))
result2 = mclip2(m.astype('bfloat16'))
result3 = mclip3(m.astype('bfloat16'))
result4 = spectral_clip(m.astype('bfloat16'))

# spectral norm of the resulting matrix, closer to 1 is better.
# jnp.linalg.svd(result0.astype('float32'))[1][0] # = 1
# jnp.linalg.svd(result1.astype('float32'))[1][0] # ≈ 700
# jnp.linalg.svd(result2.astype('float32'))[1][0] # ≈ 250
# jnp.linalg.svd(result3.astype('float32'))[1][0] # ≈ 1.5
# jnp.linalg.svd(result4.astype('float32'))[1][0] # ≈ 13

# mean absolute error of singular values, closer to 0 is better.
# jnp.abs(jnp.linalg.svd(result1.astype('float32'))[1] - s.clip(0, 1)).mean() # ≈ 20
# jnp.abs(jnp.linalg.svd(result2.astype('float32'))[1] - s.clip(0, 1)).mean() # ≈ 10
# jnp.abs(jnp.linalg.svd(result3.astype('float32'))[1] - s.clip(0, 1)).mean() # ≈ 0.5
# jnp.abs(jnp.linalg.svd(result4.astype('float32'))[1] - s.clip(0, 1)).mean() # ≈ 0.7

# mean absolute error of total matrix, closer to 0 is better.
# jnp.abs(result0 - result1).mean() # ≈ 1
# jnp.abs(result0 - result2).mean() # ≈ 0.5
# jnp.abs(result0 - result3).mean() # ≈ 0.01
# jnp.abs(result0 - result4).mean() # ≈ 0.02

Summary

This article continues to refine the scheme for calculating $\mclip$ using $\msign$ discussed in the previous article. By removing the nesting of $\msign$ and introducing an additional correction term, we have successfully reduced calculation errors.