Muon Implementation Based on Streaming Power Iteration: 4. Principles

By 苏剑林 | April 13, 2026

After the three articles "Muon Implementation Based on Streaming Power Iteration: 1. First Encounter", "2. Acceleration", and "3. Refinement", I believe everyone has gained a good understanding of the ideas, implementations, and acceleration details of Streaming Power Iteration. Overall, this can be considered a highly competitive way to implement Muon, and because it directly approximates SVD, it also possesses better extensibility.

Limited by length at the time, our description of the mathematical principles of the relevant operations was relatively brief. Therefore, in this article, we supplement some mathematical derivations regarding power iteration and QR decomposition to establish a more complete theoretical picture. However, the derivations here still emphasize explanation over strictness, primarily to help everyone (including the author) clarify their thinking; we ask for the indulgence of professional readers.

Coaxial Equivalence

Before starting the derivation, we need to introduce the concept of "Coaxial Equivalence." For matrices $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{n \times m}$, if there exists a signature matrix $\boldsymbol{S}$ such that $\boldsymbol{A} = \boldsymbol{B}\boldsymbol{S}$, then $\boldsymbol{A}$ and $\boldsymbol{B}$ are said to be "Coaxial Equivalent," and they are each other's "coaxial matrices." Here, a "Signature matrix" refers to a diagonal matrix with $\pm 1$ on the diagonal, i.e., $\newcommand{diag}{\mathop{\text{diag}}}\diag(\pm 1, \pm 1, \cdots, \pm 1)$.

It should be noted that the term "Coaxial" is proposed by the author to describe this equivalence relationship because, from a coordinate system perspective, the matrices $\boldsymbol{A}$ and $\boldsymbol{B}$ satisfying the condition actually describe the same coordinate system, except that the selection of the positive direction for some axes is different. Some literature seems to call them "Sign Equivalent," but the author feels "Coaxial" is more intuitive.

The reason for introducing the concept of coaxiality is that many matrix decompositions are unique only in the sense of coaxial equivalence (the author is also puzzled why such a commonly used equivalence relation lacks a standard name). For example, in QR decomposition, let matrix $\boldsymbol{A} \in \mathbb{R}^{n \times m} (n \geq m)$ be full rank, and let $\boldsymbol{Q}_1\boldsymbol{R}_1$ and $\boldsymbol{Q}_2\boldsymbol{R}_2$ be two QR decompositions of it; then $\boldsymbol{Q}_1$ is coaxial with $\boldsymbol{Q}_2$, and $\boldsymbol{R}_1$ is coaxial with $\boldsymbol{R}_2$.

This uniqueness is not hard to understand, because in the process of Gram-Schmidt orthogonalization, we can arbitrarily flip the orthogonalized vectors without changing their orthogonality. General tutorials usually express the uniqueness of QR decomposition by restricting the diagonal elements of $\boldsymbol{R}$ to be positive, which is one approach, but it sometimes brings unnecessary trouble. Additionally, the uniqueness of SVD also holds under coaxial equivalence.

Power of Iteration

In "Muon Implementation Based on Streaming Power Iteration: 1. First Encounter", we introduced power iteration by solving for eigenvectors one by one and then switched directly to the parallel version, assuming it has the same convergence result. Here, we start directly from the parallel version and prove its convergence.

Again, let matrix $\boldsymbol{A} \in \mathbb{R}^{n \times m} (n \geq m)$ be full rank (rank $m$), and consider the power iteration:

\begin{equation}\newcommand{QR}{\mathop{\text{QR}}}\boldsymbol{V}_t = \QR(\boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}),\qquad \boldsymbol{V}_0 = \boldsymbol{I}\end{equation}

Suppose the QR decomposition of $\boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}$ is $\boldsymbol{Q}_t \boldsymbol{R}_t$, so $\QR(\boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}) = \boldsymbol{A}^{\top}\boldsymbol{A} \boldsymbol{V}_{t-1}\boldsymbol{R}_t^{-1}$. Then, iteratively, we get:

\begin{equation}\boldsymbol{V}_t = (\boldsymbol{A}^{\top}\boldsymbol{A})^t \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1}\end{equation}

Next, let the SVD of $\boldsymbol{A}$ be $\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$, so $\boldsymbol{A}^{\top}\boldsymbol{A}=\boldsymbol{V}\boldsymbol{\Sigma}^2\boldsymbol{V}^{\top}$. Substituting into the above equation gives:

\begin{equation}\boldsymbol{V}_t = \boldsymbol{V}\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top} \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1} = \boldsymbol{V}\underbrace{(\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top}\boldsymbol{\Sigma}^{-2t})}_{\boldsymbol{B}}\,\underbrace{(\boldsymbol{\Sigma}^{2t} \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1})}_{\boldsymbol{C}}\end{equation}

The result is divided into three parts: $\boldsymbol{V}, \boldsymbol{B}, \boldsymbol{C}$. We want to prove that $\boldsymbol{V}_t \to \boldsymbol{V}$ in the coaxial sense. It is known that $\boldsymbol{V}_t, \boldsymbol{V}$ are both orthogonal matrices. From the closure property of upper triangular matrices, we know that $\boldsymbol{C}$ is an upper triangular matrix. If we can prove that $\boldsymbol{B}$ tends towards an upper triangular matrix, then by the uniqueness of QR decomposition, we get $\boldsymbol{V}_t \to \boldsymbol{V}$. Regarding $\boldsymbol{B}=\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top}\boldsymbol{\Sigma}^{-2t}$, writing it in component form yields:

\begin{equation}B_{i,j} = V_{j,i} (\sigma_i / \sigma_j)^{2t}\end{equation}

Assuming $\sigma_1 > \sigma_2 > \cdots > \sigma_m$, then when $i > j$, $(\sigma_i / \sigma_j)^{2t} \to 0$. That is, the lower triangular part of $\boldsymbol{B}$ tends towards 0; in other words, $\boldsymbol{B}$ tends towards an upper triangular matrix, and the condition is proven. From this, we also see that the convergence speed of power iteration depends on the ratio of adjacent singular values; the larger $\sigma_i / \sigma_{i+1}$ is, the faster the convergence. Once equal singular values exist, power iteration fails, but in practice, we can assume the probability of two singular values being exactly equal is zero, thus avoiding this problem.

Fundamental Thinking

Let's think carefully about the whole proof process and understand what it is doing in essence.

First, the block $\boldsymbol{C} = \boldsymbol{\Sigma}^{2t} \boldsymbol{R}_1^{-1}\boldsymbol{R}_2^{-1} \cdots \boldsymbol{R}_t^{-1}$ looks complex, but its only role is to be "an upper triangular matrix." What it exactly equals is not important, as long as it maintains the form of an upper triangular matrix. What truly plays the core role is that $\boldsymbol{B}=\boldsymbol{\Sigma}^{2t}\boldsymbol{V}^{\top}\boldsymbol{\Sigma}^{-2t}$ will tend towards an upper triangular matrix, allowing the preceding $\boldsymbol{V}$ to be extracted by the QR decomposition.

What actually achieves this effect is the step $(\boldsymbol{A}^{\top}\boldsymbol{A})^t$! In other words, theoretically $\boldsymbol{V}_t = \QR((\boldsymbol{A}^{\top}\boldsymbol{A})^t)$. That is, all those previous $\QR$ operations are theoretically redundant; you only need to perform $\QR$ or orthogonalization once at the very end. Of course, this equivalence has only theoretical value; directly calculating $(\boldsymbol{A}^{\top}\boldsymbol{A})^t$ would lead to numerical explosion/collapse, making the results unusable. Therefore, performing $\QR$ periodically (rather than just once at the end) essentially acts to maintain numerical stability.

We can also understand the repeated QR operations from the perspective of a "Preconditioner." A preconditioner refers to some preprocessing operations on the input that theoretically do not change the output but are usually beneficial for numerical computation. For $\QR$, right-multiplying by any full-rank upper triangular matrix $\boldsymbol{R}$ does not change the result, i.e., $\QR(\boldsymbol{A}) = \QR(\boldsymbol{A}\boldsymbol{R})$. Thus, any right-multiplied full-rank upper triangular matrix can be called a preconditioner for $\QR$.

We know that $\QR$ itself can be written in the form of right-multiplying by an upper triangular matrix, so $\QR$ itself is a preconditioner for $\QR$. Therefore, we can insert some $\QR$ operations inside $(\boldsymbol{A}^{\top}\boldsymbol{A})^t$, turning it into:

\begin{equation}(\boldsymbol{A}^{\top}\boldsymbol{A})^t\qquad\to\qquad \boldsymbol{A}^{\top}\boldsymbol{A}\QR(\boldsymbol{A}^{\top}\boldsymbol{A}\QR(\boldsymbol{A}^{\top}\boldsymbol{A}\cdots\QR(\boldsymbol{A}^{\top}\boldsymbol{A})))\end{equation}

Applying $\QR$ at both ends, the result will not change. However, since each $\QR$ outputs an orthogonal matrix, and right-multiplying by an orthogonal matrix has almost no risk of numerical explosion, $\QR$ is an excellent preconditioner for itself.

Related Variants

From the perspective of preconditioners, we can also understand many power iteration variants. For example, in the opening article "1. First Encounter", we used a $\newcommand{ColNorm}{\mathop{\text{ColNorm}}}\ColNorm$ version:

\begin{equation}\boldsymbol{V}_t = \QR(\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{V}_{t-1}) \qquad\to\qquad \boldsymbol{V}_t = \QR(\boldsymbol{A}^{\top}\ColNorm(\boldsymbol{A}\boldsymbol{V}_{t-1}))\end{equation}

The $\boldsymbol{V}_t$ on the left and right are equal because applying $\ColNorm$ to a matrix is equivalent to right-multiplying by a diagonal matrix:

\begin{equation}\underbrace{\left[\frac{\boldsymbol{a}_1}{\Vert\boldsymbol{a}_1\Vert},\frac{\boldsymbol{a}_2}{\Vert\boldsymbol{a}_2\Vert},\cdots,\frac{\boldsymbol{a}_m}{\Vert\boldsymbol{a}_m\Vert}\right]}_{\ColNorm([\boldsymbol{a}_1,\boldsymbol{a}_2,\cdots,\boldsymbol{a}_m])} = [\boldsymbol{a}_1,\boldsymbol{a}_2,\cdots,\boldsymbol{a}_m]\begin{bmatrix}\Vert\boldsymbol{a}_1\Vert^{-1} & 0 & \cdots & 0 \\ 0 & \Vert\boldsymbol{a}_2\Vert^{-1} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \Vert\boldsymbol{a}_m\Vert^{-1}\end{bmatrix}\end{equation}

A diagonal matrix is a special case of an upper triangular matrix, so $\ColNorm$ is also a preconditioner for $\QR$ and does not change the result. For the same reason, we can also move $\ColNorm$ one step outward; this similarly does not change the $\QR$ result:

\begin{equation}\boldsymbol{V}_t = \QR(\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{V}_{t-1}) \qquad\to\qquad \boldsymbol{V}_t = \QR(\ColNorm(\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{V}_{t-1}))\end{equation}

In addition, the protagonists of "2. Acceleration" and "3. Refinement" were power iterations with double $\QR$:

\begin{equation}\boldsymbol{V}_t = \QR(\boldsymbol{A}^{\top}\boldsymbol{A}\boldsymbol{V}_{t-1}) \qquad\to\qquad \boldsymbol{V}_t = \QR(\boldsymbol{A}^{\top}\QR(\boldsymbol{A}\boldsymbol{V}_{t-1}))\end{equation}

Based on the discussions in these two sections, it is easy to prove that adding an extra $\QR$ step to $\boldsymbol{A}\boldsymbol{V}_{t-1}$ theoretically does not change the result for $\boldsymbol{V}_t$.

Finite Error

We have used the term "theoretically" several times because many results are only obtainable based on infinite-precision, exact $\QR$. $\QR$ in actual calculations has finite precision, and for efficiency, we usually use the less accurate "SCQR (Shifted Cholesky QR)." Therefore, analyzing the cumulative effect of errors is very necessary.

For SCQR, everyone should be familiar with it by now after reading these articles. It splits the QR decomposition of matrix $\boldsymbol{A}$ into two steps: 1. Perform Cholesky decomposition on $\boldsymbol{A}^{\top}\boldsymbol{A}+\lambda \boldsymbol{I}$ to get $\boldsymbol{R}$; 2. Obtain orthogonal matrix $\boldsymbol{Q}$ via $\boldsymbol{Q}=\boldsymbol{A}\boldsymbol{R}^{-1}$. If $\lambda=0$, then SCQR is equivalent to exact QR. However, $\lambda=0$ in actual calculation almost always fails due to an excessive condition number, so setting an appropriate $\lambda > 0$ introduces error.

Fortunately, this error does not accumulate! Whether it is $\boldsymbol{A}^{\top}\boldsymbol{A}$ or $\boldsymbol{A}^{\top}\boldsymbol{A}+\lambda \boldsymbol{I}$, the result of their Cholesky decomposition is an upper triangular matrix, and $\boldsymbol{A}$ is then right-multiplied by the inverse of this upper triangular matrix. We know that right-multiplying by an upper triangular matrix does not change the $\QR$ result. Thus, if matrix multiplication errors are ignored, we can consider SCQR to be "lossless" with respect to $\QR$ itself.

In other words, as long as the final step is replaced with an exact $\QR$, then no matter how many imprecise SCQRs were performed before, the result will still be exact. Conversely, the total error is equivalent to the error of a single SCQR step (the last step) and will not accumulate. The "SCQR2" acceleration trick mentioned in "2. Acceleration" also relies on this principle.

On the contrary, if some of the transformations we proposed cannot be written in the form of "right-multiplying by an upper triangular matrix," then they will lossily destroy the original matrix's information. Over long-term iteration, errors will accumulate until it completely fails. The simplification scheme proposed by @Ji_Ha_Kim mentioned in the previous article "3. Refinement" is a classic counterexample.

Cholesky Decomposition

Finally, let's briefly review the "Cholesky Decomposition." Its purpose is to represent a given positive-definite symmetric matrix $\boldsymbol{B}$ in the form $\boldsymbol{L}\boldsymbol{L}^{\top}$, where $\boldsymbol{L}$ is a lower triangular matrix. Then, by letting $\boldsymbol{R}=\boldsymbol{L}^{\top}$, we can get the upper triangular form $\boldsymbol{R}^{\top}\boldsymbol{R}$.

Cholesky decomposition is fast because it has an analytical solution that can be directly calculated recursively. Specifically, from the following equality:

\begin{equation}\begin{bmatrix} a_{1,1} & a_{1,2} & \cdots & a_{1,m} \\ a_{2,1} & a_{2,2} & \cdots & a_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m,1} & a_{m,2} & \cdots & a_{m,m} \end{bmatrix}=\begin{bmatrix} l_{1,1} & & & \\ l_{2,1} & l_{2,2} & & \\ \vdots & \vdots & \ddots & \\ l_{m,1} & l_{m,2} & \cdots & l_{m,m} \end{bmatrix}\begin{bmatrix} l_{1,1} & l_{2,1} & \cdots & l_{m,1} \\ & l_{2,2} & \cdots & l_{m,2} \\ & & \ddots & \vdots \\ & & & l_{m,m} \end{bmatrix}\end{equation}

By making corresponding items equal term by term, we get the recurrence formulas:

\begin{equation}l_{j,j}=\pm\sqrt {a_{j,j}-\sum_{k=1}^{j-1}l_{j,k}^2},\qquad l_{i,j}=\frac{1}{l_{j,j}}\left(a_{i,j}-\sum_{k=1}^{j-1}l_{i,k}l_{j,k}\right)\quad \text{for }i > j\end{equation}

Cholesky QR is one of the classic applications of Cholesky decomposition. It is similar to Gram-Schmidt orthogonalization but allows us to skip the orthogonalization flow and directly obtain the triangular matrix $\boldsymbol{R}$. They also face the same numerical difficulties: Gram-Schmidt orthogonalization is performed vector by vector; once a linearly dependent vector or a zero vector is encountered, orthogonalization cannot proceed. This is reflected in a low effective rank or a large condition number, conditions under which Cholesky decomposition is also prone to failure.

Another classic application of Cholesky decomposition is the inversion of positive-definite symmetric matrices. For example, to solve the equation $\boldsymbol{B}\boldsymbol{X}=\boldsymbol{C}$ where $\boldsymbol{B}$ is positive-definite and symmetric, we can first decompose it into $\boldsymbol{L}\boldsymbol{L}^{\top}$, turning it into $\boldsymbol{L}(\boldsymbol{L}^{\top}\boldsymbol{X})=\boldsymbol{C}$. This only requires solving two triangular coefficient equations, and each step is efficient. Among these, the fractional iteration for $\newcommand{msign}{\mathop{\text{msign}}}\msign$ proposed by @Ji_Ha_Kim is based on this idea for finding the inverse matrix.

Article Summary

This article primarily supplemented the discussion of some mathematical derivations for Streaming Power Iteration, including the convergence of power iteration, preconditioners for QR decomposition, and a brief introduction to Cholesky decomposition, hoping to help everyone better understand this method from a fundamental level.