How to Estimate the Spectral Norm of a Matrix More Scientifically?

By 苏剑林 | May 04, 2026

The spectral norm is a core concept in matrix analysis and plays an important role in the field of deep learning—from the Lipschitz constraints required in the WGAN era to the training stability of modern LLMs, and to the emerging Muon optimizer, all are closely related to the spectral norm of matrix parameters. Therefore, how to efficiently and accurately estimate the spectral norm is becoming increasingly important and worthy of our in-depth study.

As is well known, Power Iteration is the standard scheme for estimating the spectral norm, but it still has significant room for improvement. This article will briefly organize some ideas for estimating the spectral norm, including improving the convergence speed of power iteration and how to estimate a strict upper bound of the spectral norm, among others.

Spectral Norm

The spectral norm is defined as

\begin{equation}\Vert\boldsymbol{W}\Vert_2 = \max_{\Vert\boldsymbol{x}\Vert_2=1} \Vert \boldsymbol{W}\boldsymbol{x}\Vert_2\end{equation}

where $\boldsymbol{W}\in\mathbb{R}^{n\times m}, \boldsymbol{x}\in\mathbb{R}^m$, and without loss of generality, let $n\geq m$. From the definition, the spectral norm represents the "expansion rate" of a linear layer—after an input vector passes through the linear layer, the length of the output vector is expanded by at most $\Vert\boldsymbol{W}\Vert_2$ times. The spectral norm is also equal to the largest singular value of the matrix (for proof, refer to "The Path to Low-Rank Approximation (II): SVD"); we will not expand further on these basic contents.

The author first encountered the spectral norm when GANs were popular and WGAN "emerged," emphasizing the necessity for the discriminator to satisfy the Lipschitz constraint. The Lipschitz constant of a linear layer $\boldsymbol{W}\boldsymbol{x}$ is exactly the spectral norm of the matrix $\boldsymbol{W}$, leading to techniques such as spectral normalization and spectral regularization. Interested readers can refer to "Lipschitz Constraint in Deep Learning: Generalization and Generative Models".

In recent years, the increasingly popular Muon optimizer is also closely related to the spectral norm, as it is often viewed as the steepest descent under the spectral norm. Meanwhile, in "Beyond MuP: 4. Defending Parameter Stability," we also proposed constraining the spectral norm of parameters to ensure training stability. These contents all raise requirements for the accurate calculation of the spectral norm.

Power Iteration

Calculating the spectral norm through SVD is the most direct method, but it is clearly too expensive. We usually use Power Iteration:

\begin{equation}\boldsymbol{v}^{(t)} = \frac{\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}^{(t-1)}}{\Vert\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{v}^{(t-1)}\Vert_2}\end{equation}

It converges to the dominant right singular vector $\boldsymbol{v}_1$ at a rate of $(\sigma_2/\sigma_1)^{2t}$. After $T$ steps of iteration, we have

\begin{equation}\sigma_1 \approx \Vert\boldsymbol{W}\boldsymbol{v}^{(t)}\Vert_2\end{equation}

Proof can be found in "Thoughts from Spectral Norm Gradient to a New Type of Weight Decay." In practice, if we only care about the spectral norm and not the singular vectors, the convergence speed is often more optimistic than $(\sigma_2/\sigma_1)^{2t}$. Furthermore, the result of power iteration is easily influenced by the initial value $\boldsymbol{v}^{(0)}$, causing fluctuations; we can consider calculating $k$ paths in parallel and then taking the maximum value.

If we calculate $k$ paths simultaneously and replace the L2 Normalize with orthogonalization of these $k$ vectors at each iteration, i.e.,

\begin{equation}\newcommand{QR}{\mathop{\text{QR}}}\boldsymbol{V}^{(t)} = \QR(\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{V}^{(t-1)})\end{equation}

then the result will converge to the first $k$ right singular vectors, from which the first $k$ singular values can be obtained. The principle can be found in "Muon Implementation Based on Streaming Power Iteration: 4. Principles." However, this extension will not be expanded upon in this article, focusing instead on the estimation of the spectral norm—the largest singular value.

Computing the Gradient

First, here is a reference implementation of power iteration based on Jax:

import jax, jax.lax as lax
import jax.numpy as jnp

@jax.jit(static_argnums=(1,))
def l2_normalize(x, axis=-2):
    return x / jnp.linalg.vector_norm(x, axis=axis, keepdims=True)

@jax.jit(static_argnums=(1, 2, 3))
def spec_norm_v1(w, T=10, k=1, key=42):
    v_shape = w.shape[:-2] + w.shape[-1:] + (k,)
    v_init = jax.random.normal(jax.random.PRNGKey(key), v_shape)
    v_step = lambda i, v: l2_normalize(w.mT @ (w @ v))
    v = lax.fori_loop(0, T, v_step, v_init)
    return jnp.linalg.vector_norm(w @ v, axis=-2).max(axis=-1)

Obviously, the complexity of power iteration is $\mathcal{O}(mnTk)$, which is quite ideal in terms of complexity alone. If it is only for monitoring purposes, the above writing style is sufficient. However, if it is for spectral normalization or regularization purposes, we need the gradient of the spectral norm. According to the above implementation, it would require backpropagation step-by-step through the power iteration loop, which has high computational complexity.

In fact, in "Thoughts from Spectral Norm Gradient to a New Type of Weight Decay," we already derived the gradient of the spectral norm $\nabla_{\boldsymbol{W}}\sigma_1 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$. Therefore, we can directly define its gradient as $\nabla_{\boldsymbol{W}}\sigma_1 = \boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$ to reduce the backpropagation complexity. Besides, we can use $\sigma_1 = \boldsymbol{u}_1^{\top}\boldsymbol{W}\boldsymbol{v}_1$ and directly output during the forward pass:

\begin{equation}\sigma_1 = \color{skyblue}{\mathop{\text{sg}}[}\boldsymbol{u}_1^{\top}\color{skyblue}{]}\boldsymbol{W}\color{skyblue}{\mathop{\text{sg}}[}\boldsymbol{v}_1\color{skyblue}{]}\end{equation}

where $\color{skyblue}{\mathop{\text{sg}}[\,]}$ is the stop gradient operator. In this way, automatic differentiation will only apply to $\boldsymbol{W}$, and the result will be $\boldsymbol{u}_1 \boldsymbol{v}_1^{\top}$, avoiding backpropagation into the power iterations of $\boldsymbol{u}_1, \boldsymbol{v}_1$.

"Waste Not"

Through $T$ steps of power iteration, we obtain a sequence of vectors $\boldsymbol{v}^{(1)}, \boldsymbol{v}^{(2)}, \cdots, \boldsymbol{v}^{(T)}$, but ultimately we only use $\boldsymbol{v}^{(T)}$ to estimate $\sigma_1$. This seems somewhat "wasteful," so some have thought about utilizing all of them to form a better estimate of the spectral norm.

Usually, $\boldsymbol{v}^{(1)}, \boldsymbol{v}^{(2)}, \cdots, \boldsymbol{v}^{(T)}$ are distinct but get closer to $\boldsymbol{v}_1$. We can imagine them as "$\boldsymbol{v}_1$ plus some noise." If we perform "noise reduction" together, it is indeed possible to obtain more accurate results. Specifically, we consider using a linear combination of $\boldsymbol{v}^{(1)}, \boldsymbol{v}^{(2)}, \cdots, \boldsymbol{v}^{(T)}$ to construct a better approximation of $\boldsymbol{v}_1$. The subspace spanned by this group of vectors is called the "Krylov subspace."

For simplicity, we first perform orthogonalization to obtain an equivalent orthonormal basis $\boldsymbol{Q} = [\boldsymbol{q}_1, \boldsymbol{q}_2, \cdots, \boldsymbol{q}_T] \in \mathbb{R}^{m \times T}$. Next, we want to find coefficients $\boldsymbol{x} = [x_1, x_2, \cdots, x_T]^{\top} \in \mathbb{R}^T$ such that the vector $\sum_{i=1}^T x_i \boldsymbol{q}_i = \boldsymbol{Q}\boldsymbol{x}$ satisfies as much as possible:

\begin{equation}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{Q}\boldsymbol{x} \approx \sigma_1^2\boldsymbol{Q}\boldsymbol{x}\end{equation}

Multiplying both sides by $\boldsymbol{Q}^{\top}$ gives $\boldsymbol{Q}^{\top}\boldsymbol{W}^{\top}\boldsymbol{W}\boldsymbol{Q}\boldsymbol{x} \approx \sigma_1^2\boldsymbol{x}$. This indicates that $\sigma_1^2, \boldsymbol{x}$ are the eigenvalue and eigenvector of $(\boldsymbol{W}\boldsymbol{Q})^{\top}\boldsymbol{W}\boldsymbol{Q}$, respectively. This means we only need to perform eigenvalue decomposition on it, find its largest eigenvalue, and take the square root to get a better estimate. Since this is only a $T \times T$ matrix, when $T$ is small, performing eigenvalue decomposition is not costly, so we can just call an existing function.

This idea is essentially a simplified version of the "Lanczos algorithm" used in modern SVD solvers; real SVD implementations involve more sophisticated treatments. Additionally, this acceleration strategy shares similarities with Randomized SVD, which typically uses a Gaussian random matrix for projection, whereas here we use the orthonormal basis within the Krylov subspace for projection.

Acceleration Techniques

Reference code for power iteration incorporating Krylov subspace acceleration is as follows:

@jax.jit(static_argnums=(1, 2))
def spec_norm_v2(w, T=10, key=42):
    v_shape = w.shape[:-2] + w.shape[-1:] + (1,)
    v_init = jax.random.normal(jax.random.PRNGKey(key), v_shape)
    v_step = lambda v, _: (l2_normalize(w.mT @ (w @ v)),) * 2
    v = lax.scan(v_step, v_init, length=T)[1].swapaxes(0, -1)[0]
    v = jnp.linalg.qr(v)[0]
    return jnp.linalg.eigvalsh((u := w @ v).mT @ u)[..., -1]**0.5

Empirical tests show that the effect of subspace acceleration is very significant. At $T=5$, its accuracy already exceeds that of the original power iteration at $T=10$, and it is faster. At $T=10$, the function takes roughly 2/3 of its time for power iteration and 1/3 for QR decomposition, while the eigenvalue decomposition takes very little time.

Readers of the "Streaming Power Iteration" series might think of using Cholesky QR as introduced there to speed up the QR decomposition. Unfortunately, since the results of power iteration become increasingly collinear, the condition number of the matrix to be QR-decomposed will continue to deteriorate. This is exactly the range where Cholesky QR is "powerless," with a very high failure rate, so it basicallly cannot be used for acceleration.

A more effective acceleration method is to use only the results of the last few steps of power iteration, such as the last three steps $\boldsymbol{v}^{(T-2)}, \boldsymbol{v}^{(T-1)}, \boldsymbol{v}^{(T)}$, to construct the Krylov subspace acceleration. This captures most of the performance benefits and reduces the computation of QR decomposition and eigenvalue decomposition, thereby achieving acceleration.

Estimating the Upper Bound

Strictly speaking, power iteration and its accelerated versions only provide a lower bound for the spectral norm, which is often sufficient. But if we must obtain an upper bound in certain scenarios—for example, to ensure through spectral normalization that the norm of a matrix is strictly less than 1—we need other methods.

A basic solution here is to calculate the Schatten norm, which is an upper bound on the spectral norm. Let the SVD of matrix $\boldsymbol{W}$ be $\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$, with singular values $\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_m$. Then its Schatten-$p$ norm is defined as

\begin{equation}\Vert\boldsymbol{W}\Vert_{S,p} = \sqrt[p]{\sum_{i=1}^m \sigma_i^p} \quad\geq\quad \sigma_1\end{equation}

The larger $p$ is, the closer it is to the accurate value. When $p$ is an even number, calculating the Schatten-$p$ norm is relatively feasible, primarily utilizing the following identities:

\begin{gather}\newcommand{tr}{\mathop{\text{tr}}}\Vert\boldsymbol{W}\Vert_{S,2k} = \sqrt[2k]{\sum_{i=1}^m \sigma_i^{2k}} = \sqrt[2k]{\tr(\boldsymbol{V}\boldsymbol{\Sigma}^{2k}\boldsymbol{V}^{\top})} = \sqrt[2k]{\tr((\boldsymbol{W}^{\top}\boldsymbol{W})^k)} \\ \Vert\boldsymbol{W}\Vert_{S,4k} = \sqrt[4k]{\sum_{i=1}^m \sigma_i^{4k}} = \sqrt[2k]{\Vert\boldsymbol{V}\boldsymbol{\Sigma}^{2k}\boldsymbol{V}^{\top}\Vert_{S,2}} = \sqrt[2k]{\Vert(\boldsymbol{W}^{\top}\boldsymbol{W})^k\Vert_{S,2}}\label{eq:S-4k}\end{gather}

It is not hard to see that the Schatten-2 norm is actually the Frobenius norm, which can be calculated by the square root of the sum of squares of each element. Thus, the two identities above show that as long as we calculate $(\boldsymbol{W}^{\top}\boldsymbol{W})^k$, we can find $\Vert\boldsymbol{W}\Vert_{S,2k}$ and $\Vert\boldsymbol{W}\Vert_{S,4k}$ at a lower cost.

The first step $\boldsymbol{W}^{\top}\boldsymbol{W}$ has a complexity of $\mathcal{O}(nm^2)$, and subsequent repeated self-multiplications have a complexity of $\mathcal{O}(m^3)$ per step. Thus, $(\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}$ can be calculated with a complexity of $\mathcal{O}(nm^2 + Tm^3)$. Theoretically, this complexity far exceeds that of power iteration; even when $n, m$ are relatively large, the first step $\boldsymbol{W}^{\top}\boldsymbol{W}$ already far exceeds power iteration. Therefore, if one only wants to estimate the spectral norm without requiring an upper bound, power iteration should be prioritized.

However, matrix multiplication can usually be fully parallelized, so when $n, m$ are not large or $n \gg m$, the Schatten-$p$ norm does not become a computational bottleneck, and we can consider it. Or, when a strict upper bound is mandatory, this seems to be the most direct approach.

Numerical Stability

To calculate $\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$, a naive implementation starting from $\boldsymbol{M} = \boldsymbol{W}^{\top}\boldsymbol{W}$ and repeatedly executing $\boldsymbol{M} \leftarrow \boldsymbol{M}^2$ for $T$ times to get $(\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}$ and then substituting into formula \eqref{eq:S-4k} would quickly explode to NaN or collapse to zero due to the super-exponential operations.

Generally, $\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$ itself will not numerically explode; the problem lies in the explicit calculation of $(\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}$. The solution is to re-normalize after each squaring step [$\boldsymbol{M} \leftarrow \boldsymbol{M}^2 / \tr(\boldsymbol{M}^2)$], which prevents numerical explosion and ensures the scaling is compact enough not to collapse to zero; at the same time, we accumulate the normalization factor of each step in the log domain for the final calculation of $\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$.

The calculation process is summarized as follows:

Calculate $\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$ as the upper bound of the spectral norm

1: Initialize $\log S = \log\tr(\boldsymbol{W}^{\top}\boldsymbol{W}), \boldsymbol{M} = \frac{\boldsymbol{W}^{\top}\boldsymbol{W}}{\tr(\boldsymbol{W}^{\top}\boldsymbol{W})}$
2: For $t=1, 2, \cdots, T$ do
3: $\quad \log S \leftarrow 2\log S + \log \tr(\boldsymbol{M}^2)$
4: $\quad \boldsymbol{M} \leftarrow \frac{\boldsymbol{M}^2}{\tr(\boldsymbol{M}^2)}$
5: Output $\exp\left(\frac{\log S + \log\Vert\boldsymbol{M}\Vert_F}{2^{T+1}}\right)$

A simple reference implementation:

@jax.jit
def tr(w):
    return w.trace(axis1=-1, axis2=-2)[..., None, None]

@jax.jit(static_argnums=(1,))
def spec_norm_v3(w, T=5):
    m = (m := w.mT @ w) / (s := tr(m))
    ms_step = lambda i, ms: ((m := ms[0] @ ms[0]) / (s := tr(m)), 2 * ms[1] + jnp.log(s))
    m, logs = lax.fori_loop(0, T, ms_step, (m, jnp.log(s)))
    logf = 0.5 * jnp.log((m**2).sum(axis=[-1, -2], keepdims=True))
    return jnp.exp((logs + logf) / 2**(T + 1))

Multiple Moments

To calculate $\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$, we calculated $\boldsymbol{W}^{\top}\boldsymbol{W}, \cdots, (\boldsymbol{W}^{\top}\boldsymbol{W})^{2^{T-1}}, (\boldsymbol{W}^{\top}\boldsymbol{W})^{2^T}$ in some way, meaning we can simultaneously obtain $\Vert\boldsymbol{W}\Vert_{S,2}, \cdots, \Vert\boldsymbol{W}\Vert_{S,2^{T+1}}, \Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$, but in the end we only used $\Vert\boldsymbol{W}\Vert_{S,2^{T+2}}$, which again seems a bit "wasteful."

Is there a way, similar to the Krylov subspace method for power iteration, to utilize all these results to improve estimation accuracy? Yes, there is! "Fast Tight Spectral-Norm Bounds" provides an idea to improve the estimation result through nonlinear programming. Let's look at a simple case: suppose we have obtained the 2nd and 4th moments of the singular values:

\begin{equation}\sum_{i=1}^m \sigma_i^2 = S_2, \qquad \sum_{i=1}^m \sigma_i^4 = S_4\end{equation}

Then according to the results of the previous two sections, $S_4^{1/4}$ is closer to the spectral norm than $S_2^{1/2}$, so we return $S_4^{1/4}$ as the upper bound and discard $S_2$. But is $S_2$ really useless? Imagine if $m=2$, then there are exactly two equations and two unknowns, and theoretically we can solve for $\sigma_1, \sigma_2$! When $m > 2$, although an exact solution is not possible, we can still narrow the range of $\sigma_1$. Specifically, we have:

\begin{equation}\frac{S_2 - \sigma_1^2}{m-1} = \frac{1}{m-1}\sum_{i=2}^m \sigma_i^2 \leq \sqrt{\frac{1}{m-1}\sum_{i=2}^m \sigma_i^4} = \sqrt{\frac{S_4 - \sigma_1^4}{m-1}}\end{equation}

That is, $(S_2 - \sigma_1^2)^2 \leq (m-1)(S_4 - \sigma_1^4)$, which is essentially a univariate quadratic inequality, easily solved as:

\begin{equation}\sigma_1 \leq \sqrt{\frac{S_2+\sqrt{(m-1)(mS_4-S_2^2)}}{m}}\label{eq:s2-s4}\end{equation}

This gives an upper bound for $\sigma_1$ expressed in terms of $S_2$ and $S_4$, which is a better estimate than $S_4^{1/4}$. "Fast Tight Spectral-Norm Bounds" also generalizes this to a form that uses $S_2, S_4, S_6, S_8$ simultaneously, which is more compact but also more cumbersome. Interested readers can study the original paper. Using more moments to improve the estimation is theoretically possible, but in practice, it often requires solving complex nonlinear equations, which has limited practical value.

Limitations

Theoretically, the result \eqref{eq:s2-s4} can be generalized to obtain more accurate upper bounds using any $S_{2k}$ and $S_{4k}$:

\begin{equation}\sigma_1 \leq \sqrt[2k]{\frac{S_{2k}+\sqrt{(m-1)(mS_{4k}-S_{2k}^2)}}{m}}\end{equation}

However, when $k$ is large, the practical significance of this result is very limited because it requires explicitly calculating $S_{2k}$ and $S_{4k}$, which will explode or collapse when $k$ is large, making it unrealistic. One possible improvement is to factor out $S_{4k}^{1/4k}$:

\begin{equation}\sqrt[2k]{\frac{S_{2k}+\sqrt{(m-1)(mS_{4k}-S_{2k}^2)}}{m}} = S_{4k}^{1/4k}\cdot\sqrt[2k]{\frac{S_{2k}/S_{4k}^{1/2}+\sqrt{(m-1)(m-S_{2k}^2/S_{4k})}}{m}}\end{equation}

Then let $S_{2k}/S_{4k}^{1/2} = e^{\epsilon}$ and perform an approximate expansion in the log domain:

\begin{equation}\sqrt[2k]{\frac{S_{2k}/S_{4k}^{1/2}+\sqrt{(m-1)(m-S_{2k}^2/S_4)}}{m}} \approx 1 - \frac{\epsilon^2}{4k(m-1)}\end{equation}

However, our goal is to find an upper bound for the spectral norm; how to ensure the upper bound property while performing an approximate expansion seems complicated. On the other hand, if we have already calculated for a large $k$, the accuracy of $S_{4k}^{1/4k}$ itself is already quite high, and the significance of using programming ideas to further improve accuracy is not great.

Conclusion

This article briefly summarized several ideas for estimating the spectral norm. In practical applications, if only an approximate monitoring of the spectral norm is needed, power iteration and its subspace-accelerated version are usually sufficient; if a strict upper bound must be guaranteed, calculating the Schatten norm can be considered. The "waste not" ideas derived from both routes—utilizing iteration history in the Krylov subspace and multi-moment information in nonlinear programming—also provide good algorithmic design inspiration for us.