By 苏剑林 | October 01, 2024
In the previous article, we introduced the "pseudo-inverse," which relates to the optimal solution of the optimization objective $\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2$ when the matrices $\boldsymbol{M}$ and $\boldsymbol{A}$ (or $\boldsymbol{B}$) are given. In this article, we focus on the optimal solution when both $\boldsymbol{A}$ and $\boldsymbol{B}$ are not given, i.e.,
\begin{equation}\mathop{\text{argmin}}_{\boldsymbol{A},\boldsymbol{B}}\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2\label{eq:loss-ab}\end{equation}
where $\boldsymbol{A}\in\mathbb{R}^{n\times r}, \boldsymbol{B}\in\mathbb{R}^{r\times m}, \boldsymbol{M}\in\mathbb{R}^{n\times m}, r < \min(n,m)$. Simply put, this is the search for the "optimal $r$-rank approximation (the best approximation with rank not exceeding $r$)" of the matrix $\boldsymbol{M}$. To solve this problem, we need to bring out the famous "SVD (Singular Value Decomposition)." Although this series started with the pseudo-inverse, its "fame" is far less than that of SVD. There are likely many people who have heard of or even used SVD without knowing about the pseudo-inverse, including the author, who learned about SVD before encountering the pseudo-inverse.
Next, we will introduce SVD around the concept of optimal low-rank matrix approximation.
A First Look at the Conclusion
For any matrix $\boldsymbol{M}\in\mathbb{R}^{n\times m}$, a Singular Value Decomposition (SVD) of the following form can be found:
\begin{equation}\boldsymbol{M} = \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\end{equation}
where $\boldsymbol{U}\in\mathbb{R}^{n\times n}$ and $\boldsymbol{V}\in\mathbb{R}^{m\times m}$ are both orthogonal matrices, and $\boldsymbol{\Sigma}\in\mathbb{R}^{n\times m}$ is a non-negative diagonal matrix:
\begin{equation}\boldsymbol{\Sigma}_{i,j} = \left\{\begin{aligned}&\sigma_i, &i = j \\ &0,&i \neq j\end{aligned}\right.\end{equation}
The diagonal elements are sorted from largest to smallest by default, i.e., $\sigma_1\geq \sigma_2\geq\sigma_3\geq\cdots\geq 0$. These diagonal elements are called Singular Values. From a numerical calculation perspective, we can retain only the non-zero elements in $\boldsymbol{\Sigma}$, reducing the sizes of $\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}$ to $n\times r, r\times r, m\times r$ (where $r$ is the rank of $\boldsymbol{M}$). Retaining the full orthogonal matrices is more convenient for theoretical analysis.
SVD also holds for complex matrices, but orthogonal matrices must be replaced by unitary matrices, and the transpose by the conjugate transpose. However, here we focus primarily on the results for real matrices, which are more closely related to machine learning. The basic theory of SVD includes its existence, calculation methods, and its connection to optimal low-rank approximation, all of which I will provide my own understanding of later.
In a two-dimensional plane, SVD has a very intuitive geometric meaning. A 2D orthogonal matrix primarily represents rotation (and reflection, but for geometric intuition, we can be less rigorous). Thus, $\boldsymbol{M}\boldsymbol{x}=\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\boldsymbol{x}$ means that any linear transformation applied to a (column) vector $x$ can be decomposed into three steps: Rotation, Stretching, and Rotation, as shown in the figure below:
Geometric meaning of SVD
Some Applications
Whether in theoretical analysis or numerical computation, SVD has wide-ranging applications. One of the underlying principles is that commonly used matrix/vector norms remain invariant under orthogonal transformations. Therefore, the characteristic of SVD having two orthogonal matrices sandwiching a diagonal matrix can often be used to transform many matrix-related optimization objectives into equivalent special cases involving non-negative diagonal matrices, thereby simplifying the problem.
General Solution for Pseudo-Inverse
Taking the pseudo-inverse as an example, when the rank of $\boldsymbol{A}\in\mathbb{R}^{n\times r}$ is $r$, we have
\begin{equation}\boldsymbol{A}^{\dagger} = \mathop{\text{argmin}}_{\boldsymbol{B}\in\mathbb{R}^{r\times n}}\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2\end{equation}
In the previous article, we derived the expression for $\boldsymbol{A}^{\dagger}$ by taking derivatives and then put some effort into generalizing it to the case where the rank of $\boldsymbol{A}$ is less than $r$. However, if we introduce SVD, the problem becomes much simpler. We can decompose $\boldsymbol{A}$ into $\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$ and then represent $\boldsymbol{B}$ as $\boldsymbol{V} \boldsymbol{Z} \boldsymbol{U}^{\top}$. Note that we do not require $\boldsymbol{Z}$ to be a diagonal matrix, so $\boldsymbol{B}=\boldsymbol{V} \boldsymbol{Z} \boldsymbol{U}^{\top}$ can always be achieved. Thus:
\begin{equation}\begin{aligned}
\min_\boldsymbol{B}\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{I}_n\Vert_F^2 =&\ \min_\boldsymbol{Z}\Vert \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\boldsymbol{V} \boldsymbol{Z} \boldsymbol{U}^{\top} - \boldsymbol{I}_n\Vert_F^2 \\
=&\ \min_\boldsymbol{Z}\Vert \boldsymbol{U}(\boldsymbol{\Sigma} \boldsymbol{Z} - \boldsymbol{I}_n) \boldsymbol{U}^{\top}\Vert_F^2 \\
=&\ \min_\boldsymbol{Z}\Vert \boldsymbol{\Sigma} \boldsymbol{Z} - \boldsymbol{I}_n\Vert_F^2
\end{aligned}\end{equation}
The last equality is based on the conclusion we proved in the previous article that "orthogonal transformations do not change the $F$-norm." This simplifies the problem to the pseudo-inverse of the diagonal matrix $\boldsymbol{\Sigma}$. We can then represent $\boldsymbol{\Sigma} \boldsymbol{Z} - \boldsymbol{I}_n$ in block matrix form as:
\begin{equation}\begin{aligned}\boldsymbol{\Sigma} \boldsymbol{Z} - \boldsymbol{I}_n =&\ \begin{pmatrix}\boldsymbol{\Sigma}_{[:r,:r]} \\ \boldsymbol{0}_{(n-r)\times r}\end{pmatrix} \begin{pmatrix}\boldsymbol{Z}_{[:r,:r]} & \boldsymbol{Z}_{[:r,r:]}\end{pmatrix} - \begin{pmatrix}\boldsymbol{I}_r & \boldsymbol{0}_{r\times(n-r)} \\ \boldsymbol{0}_{(n-r)\times r} & \boldsymbol{I}_{n-r}\end{pmatrix} \\
=&\ \begin{pmatrix}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{Z}_{[:r,:r]} - \boldsymbol{I}_r & \boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{Z}_{[:r,r:]}\\ \boldsymbol{0}_{(n-r)\times r} & -\boldsymbol{I}_{n-r}\end{pmatrix}
\end{aligned}\end{equation}
The slices here follow Python array rules. From the final form, it is clear that to minimize the $F$-norm of $\boldsymbol{\Sigma} \boldsymbol{Z} - \boldsymbol{I}_n$, the unique solution is $\boldsymbol{Z}_{[:r,:r]}=\boldsymbol{\Sigma}_{[:r,:r]}^{-1}$ and $\boldsymbol{Z}_{[:r,r:]}=\boldsymbol{0}_{r\times(n-r)}$. Simply put, $\boldsymbol{Z}$ is obtained by taking the reciprocal of the non-zero elements of $\boldsymbol{\Sigma}^{\top}$ and then transposing it; we denote this as $\boldsymbol{\Sigma}^{\dagger}$. Thus, under SVD, we have:
\begin{equation}\boldsymbol{A}=\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\quad\Rightarrow\quad \boldsymbol{A}^{\dagger} = \boldsymbol{V}\boldsymbol{\Sigma}^{\dagger}\boldsymbol{U}^{\top}\end{equation}
It can be further proven that this result also applies to the case where the rank of $\boldsymbol{A}$ is less than $r$, so it is a general form. Some tutorials even use this directly as the definition of the pseudo-inverse. Additionally, we can observe that this form does not distinguish between left and right pseudo-inverses, indicating that the left and right pseudo-inverses of the same matrix are equal. Therefore, when discussing the pseudo-inverse, there is no need to specifically distinguish between left and right.
Matrix Norms
Using the conclusion that orthogonal transformations do not change the $F$-norm, we can also obtain:
\begin{equation}\Vert \boldsymbol{M}\Vert_F^2 = \Vert \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\Vert_F^2 = \Vert \boldsymbol{\Sigma} \Vert_F^2 = \sum_{i=1}^{\min(n,m)}\sigma_i^2\end{equation}
In other words, the sum of the squares of the singular values equals the square of the $F$-norm. Besides the $F$-norm, SVD can also be used to calculate the "spectral norm." In the previous article, we mentioned that the $F$-norm is just one type of matrix norm; another commonly used matrix norm is the spectral norm induced by vector norms, defined as:
\begin{equation}\Vert \boldsymbol{M}\Vert_2 = \max_{\Vert \boldsymbol{x}\Vert = 1} \Vert \boldsymbol{M}\boldsymbol{x}\Vert\end{equation}
Note that the norms on the right side are vector norms (magnitude, $L_2$-norm), so the above definition is unambiguous. Since it is induced by the vector $2$-norm, it is also called the matrix $2$-norm. Numerically, the spectral norm of a matrix is equal to its largest singular value, i.e., $\Vert \boldsymbol{M}\Vert_2 = \sigma_1$. To prove this, simply perform SVD on $\boldsymbol{M}$ as $\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$ and substitute it into the spectral norm definition:
\begin{equation}\max_{\Vert \boldsymbol{x}\Vert = 1} \Vert \boldsymbol{M}\boldsymbol{x}\Vert = \max_{\Vert \boldsymbol{x}\Vert = 1} \Vert \boldsymbol{U}\boldsymbol{\Sigma} (\boldsymbol{V}^{\top}\boldsymbol{x})\Vert = \max_{\Vert \boldsymbol{y}\Vert = 1} \Vert \boldsymbol{\Sigma} \boldsymbol{y}\Vert\end{equation}
The second equality utilizes the property that orthogonal matrices do not change the vector norm. Now the problem is simplified to the spectral norm of the diagonal matrix $\boldsymbol{\Sigma}$. This is relatively simple; let $\boldsymbol{y} = (y_1,y_2,\cdots,y_m)$, then:
\begin{equation}\Vert \boldsymbol{\Sigma} \boldsymbol{y}\Vert^2 = \sum_{i=1}^m \sigma_i^2 y_i^2 \leq \sum_{i=1}^m \sigma_1^2 y_i^2 = \sigma_1^2\sum_{i=1}^m y_i^2 = \sigma_1^2\end{equation}
So $\Vert \boldsymbol{\Sigma} \boldsymbol{y}\Vert$ does not exceed $\sigma_1$, and the equality holds when $\boldsymbol{y}=(1,0,\cdots,0)$. Therefore, $\Vert \boldsymbol{M}\Vert_2=\sigma_1$. Comparing this with the result for the $F$-norm, we can also see that $\Vert \boldsymbol{M}\Vert_2\leq \Vert \boldsymbol{M}\Vert_F$ always holds.
Low-Rank Approximation
Finally, we return to the main theme of this article: optimal low-rank approximation, which is the objective $\eqref{eq:loss-ab}$. Decomposing $\boldsymbol{M}$ into $\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$, we can write:
\begin{equation}\begin{aligned}
\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 =&\ \Vert \boldsymbol{U}\boldsymbol{U}^{\top}\boldsymbol{A}\boldsymbol{B}\boldsymbol{V}\boldsymbol{V}^{\top} - \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\Vert_F^2 \\
=&\ \Vert \boldsymbol{U}(\boldsymbol{U}^{\top}\boldsymbol{A}\boldsymbol{B}\boldsymbol{V} - \boldsymbol{\Sigma}) \boldsymbol{V}^{\top}\Vert_F^2 \\
=&\ \Vert \boldsymbol{U}^{\top}\boldsymbol{A}\boldsymbol{B}\boldsymbol{V} - \boldsymbol{\Sigma}\Vert_F^2
\end{aligned}\end{equation}
Note that $\boldsymbol{U}^{\top}\boldsymbol{A}\boldsymbol{B}\boldsymbol{V}$ can still represent any matrix with rank not exceeding $r$. Thus, through SVD, we have simplified the optimal $r$-rank approximation of matrix $\boldsymbol{M}$ into the optimal $r$-rank approximation of the non-negative diagonal matrix $\boldsymbol{\Sigma}$.
In "Matching Full Fine-Tuning! This is the most brilliant LoRA improvement I've seen (I)", we used a similar approach to solve a related optimization problem:
\begin{equation}\mathop{\text{argmin}}_{\boldsymbol{A},\boldsymbol{B}} \Vert \boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{M} + \boldsymbol{M}\boldsymbol{B}^{\top}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2\end{equation}
Utilizing SVD and the fact that orthogonal transformations do not change the $F$-norm, we can get:
\begin{equation}\begin{aligned}
&\ \Vert \boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{M} + \boldsymbol{M}\boldsymbol{B}^{\top}\boldsymbol{B} - \boldsymbol{M}\Vert_F^2 \\
=&\ \Vert \boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} + \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\boldsymbol{B}^{\top}\boldsymbol{B} - \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\Vert_F^2 \\
=&\ \Vert \boldsymbol{U}\boldsymbol{U}^{\top}\boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} + \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\boldsymbol{B}^{\top}\boldsymbol{B}\boldsymbol{V}\boldsymbol{V}^{\top} - \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\Vert_F^2 \\
=&\ \Vert \boldsymbol{U}[(\boldsymbol{U}^{\top}\boldsymbol{A})(\boldsymbol{U}^{\top}\boldsymbol{A})^{\top}\boldsymbol{\Sigma} + \boldsymbol{\Sigma} (\boldsymbol{B}\boldsymbol{V})^{\top} (\boldsymbol{B}\boldsymbol{V}) - \boldsymbol{\Sigma}] \boldsymbol{V}^{\top}\Vert_F^2 \\
=&\ \Vert (\boldsymbol{U}^{\top}\boldsymbol{A})(\boldsymbol{U}^{\top}\boldsymbol{A})^{\top}\boldsymbol{\Sigma} + \boldsymbol{\Sigma} (\boldsymbol{B}\boldsymbol{V})^{\top} (\boldsymbol{B}\boldsymbol{V}) - \boldsymbol{\Sigma}\Vert_F^2 \\
\end{aligned}\end{equation}
This transforms the optimization problem for a general matrix $\boldsymbol{M}$ into the specific case where $\boldsymbol{M}$ is a non-negative diagonal matrix, reducing the difficulty of analysis. Note that if the ranks of $\boldsymbol{A}$ and $\boldsymbol{B}$ do not exceed $r$, then the rank of $\boldsymbol{A}\boldsymbol{A}^{\top}\boldsymbol{M} + \boldsymbol{M}\boldsymbol{B}^{\top}\boldsymbol{B}$ is at most $2r$ (assuming $2r < \min(n,m)$). Thus, the original problem is also seeking the optimal $2r$-rank approximation of $\boldsymbol{M}$, which, after transformation, becomes seeking the optimal $2r$-rank approximation of a non-negative diagonal matrix—essentially the same problem as before.
Theoretical Foundation
After confirming the utility of SVD, we need to supplement some theoretical proofs. First, we must ensure the existence of SVD, and second, find at least one calculation scheme, so that the various applications of SVD can be truly feasible. Next, we will use a single process to address both concerns.
Spectral Theorem
Before that, we need to introduce a "Spectral Theorem," which can be seen as either a special case of SVD or its foundation:
Spectral Theorem: For any real symmetric matrix $\boldsymbol{M}\in\mathbb{R}^{n\times n}$, there exists a spectral decomposition (also called eigenvalue decomposition):
\begin{equation}\boldsymbol{M} = \boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top}\end{equation}
where $\boldsymbol{U},\boldsymbol{\Lambda}\in\mathbb{R}^{n\times n}$, $\boldsymbol{U}$ is an orthogonal matrix, and $\boldsymbol{\Lambda}=\text{diag}(\lambda_1,\cdots,\lambda_n)$ is a diagonal matrix.
Simply put, the spectral theorem asserts that any real symmetric matrix can be diagonalized by an orthogonal matrix, based on the following two properties:
1. The eigenvalues and eigenvectors of a real symmetric matrix are all real;
2. Eigenvectors of a real symmetric matrix corresponding to different eigenvalues are orthogonal.
The proofs of these two properties are actually very simple, so we won't expand on them here. Based on these two points, we can immediately conclude that if the real symmetric matrix $\boldsymbol{M}$ has $n$ distinct eigenvalues, the spectral theorem holds:
\begin{equation}\begin{aligned} \boldsymbol{M}\boldsymbol{u}_1 = \lambda_1 \boldsymbol{u}_1 \\
\boldsymbol{M}\boldsymbol{u}_2 = \lambda_2 \boldsymbol{u}_2\\
\vdots \\
\boldsymbol{M}\boldsymbol{u}_n = \lambda_n \boldsymbol{u}_n\end{aligned} \quad\Rightarrow\quad \boldsymbol{M}\underbrace{(\boldsymbol{u}_1, \boldsymbol{u}_2, \cdots, \boldsymbol{u}_n)}_{\boldsymbol{U}} = \underbrace{(\boldsymbol{u}_1, \boldsymbol{u}_2, \cdots, \boldsymbol{u}_n)}_{\boldsymbol{U}}\underbrace{\begin{pmatrix}\lambda_1 & 0 & \cdots & 0 \\
0 & \lambda_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & \lambda_n \\
\end{pmatrix}}_{\boldsymbol{\Lambda}}\end{equation}
where $\lambda_1,\lambda_2,\cdots,\lambda_n$ are the eigenvalues, and $\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_n$ are the corresponding unit eigenvectors. Written in matrix multiplication form, this is $\boldsymbol{M}\boldsymbol{U}=\boldsymbol{U}\boldsymbol{\Lambda}$, so $\boldsymbol{M} = \boldsymbol{U}\boldsymbol{\Lambda} \boldsymbol{U}^{\top}$. The difficulty lies in generalizing to the case of equal eigenvalues. However, before thinking about a complete proof, we can first gain an intuitive, if less rigorous, sense that the result for distinct eigenvalues can definitely be extended to the general case.
Why say this? From a numerical perspective, the probability of two real numbers being exactly equal is almost zero, so there's no need to consider equal eigenvalues. In more mathematical terms, real matrices with distinct eigenvalues are dense in the set of all real matrices. Therefore, we can always find a family of matrices $\boldsymbol{M}_{\epsilon}$ such that when $\epsilon > 0$, its eigenvalues are pairwise distinct, and when $\epsilon \to 0$, it equals $\boldsymbol{M}$. In this way, each $\boldsymbol{M}_{\epsilon}$ can be decomposed as $\boldsymbol{U}_{\epsilon}\boldsymbol{\Lambda}_{\epsilon}\boldsymbol{U}_{\epsilon}^{\top}$; taking $\epsilon\to 0$ gives the spectral decomposition of $\boldsymbol{M}$.
Mathematical Induction
Unfortunately, the above argument can only serve as an intuitive but non-rigorous understanding, as converting it into a strict proof is quite difficult. In fact, the simplest way to strictly prove the spectral theorem might be mathematical induction—that is, assuming any $(n-1)$-th order real symmetric matrix can be spectrally decomposed, we prove that $\boldsymbol{M}$ can also be spectrally decomposed.
The key idea of the proof is to decompose $\boldsymbol{M}$ into an eigenvector and its $(n-1)$-dimensional orthogonal subspace, allowing the induction hypothesis to be applied. Specifically, let $\lambda_1$ be a non-zero eigenvalue of $\boldsymbol{M}$, and $\boldsymbol{u}_1$ be the corresponding unit eigenvector, so $\boldsymbol{M}\boldsymbol{u}_1 = \lambda_1 \boldsymbol{u}_1$. We can supplement it with $n-1$ unit vectors $\boldsymbol{Q}=(\boldsymbol{q}_2,\cdots,\boldsymbol{q}_n)$ orthogonal to $\boldsymbol{u}_1$, so that $(\boldsymbol{u}_1,\boldsymbol{q}_2,\cdots,\boldsymbol{q}_n)=(\boldsymbol{u}_1,\boldsymbol{Q})$ becomes an orthogonal matrix. Now consider:
\begin{equation}(\boldsymbol{u}_1,\boldsymbol{Q})^{\\top} \boldsymbol{M} (\boldsymbol{u}_1, \boldsymbol{Q}) = \begin{pmatrix}\boldsymbol{u}_1^{\top} \boldsymbol{M} \boldsymbol{u}_1 & \boldsymbol{u}_1^{\top} \boldsymbol{M} \boldsymbol{Q} \\ \boldsymbol{Q}^{\top} \boldsymbol{M} \boldsymbol{u}_1 & \boldsymbol{Q}^{\top} \boldsymbol{M} \boldsymbol{Q}\end{pmatrix} = \begin{pmatrix}\lambda_1 & \boldsymbol{0}_{1\times (n-1)} \\ \boldsymbol{0}_{(n-1)\times 1} & \boldsymbol{Q}^{\top} \boldsymbol{M} \boldsymbol{Q}\end{pmatrix}\end{equation}
Note that $\boldsymbol{Q}^{\top} \boldsymbol{M} \boldsymbol{Q}$ is an $(n-1)$-th order square matrix and clearly a real symmetric matrix. Thus, by hypothesis, it can be spectrally decomposed into $\boldsymbol{V} \boldsymbol{\Lambda}_2 \boldsymbol{V}^{\top}$, where $\boldsymbol{V}$ is an $(n-1)$-th order orthogonal matrix and $\boldsymbol{\Lambda}_2$ is an $(n-1)$-th order diagonal matrix, giving $(\boldsymbol{Q}\boldsymbol{V})^{\top} \boldsymbol{M} \boldsymbol{Q}\boldsymbol{V}= \boldsymbol{\Lambda}_2$. Based on this, consider $\boldsymbol{U} = (\boldsymbol{u}_1, \boldsymbol{Q}\boldsymbol{V})$, which we can verify is also an orthogonal matrix, and:
\begin{equation}\boldsymbol{U}^{\top}\boldsymbol{M} \boldsymbol{U} = (\boldsymbol{u}_1,\boldsymbol{Q}\boldsymbol{V})^{\top} \boldsymbol{M} (boldsymbol{u}_1, \boldsymbol{Q}\boldsymbol{V}) = \begin{pmatrix}\lambda_1 & \boldsymbol{0}_{1\times (n-1)} \\ \boldsymbol{0}_{(n-1)\times 1} & \boldsymbol{\Lambda}_2\end{pmatrix}\end{equation}
That is, $\boldsymbol{U}$ is precisely the orthogonal matrix that can diagonalize $\boldsymbol{M}$, so $\boldsymbol{M}$ can complete the spectral decomposition, which fulfills the most critical step of mathematical induction.
Singular Decomposition
To this point, all preparations are in place. We can formally prove the existence of SVD and provide a practical calculation scheme.
In the previous section, we introduced spectral decomposition. It is easy to notice its similarity to SVD, but there are two obvious differences: 1. Spectral decomposition only applies to real symmetric matrices, whereas SVD applies to any real matrix; 2. The diagonal matrix $\boldsymbol{\Sigma}$ in SVD is non-negative, whereas $\boldsymbol{\Lambda}$ in spectral decomposition might not be. So, what exactly is the connection between them? It is easy to verify that if the SVD of $\boldsymbol{M}$ is $\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$, then:
\begin{equation}\begin{aligned}
\boldsymbol{M}\boldsymbol{M}^{\top} = \boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\boldsymbol{V}\boldsymbol{\Sigma}^{\top} \boldsymbol{U}^{\top} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{\Sigma}^{\top} \boldsymbol{U}^{\top}\\
\boldsymbol{M}^{\top}\boldsymbol{M} = \boldsymbol{V}\boldsymbol{\Sigma}^{\top} \boldsymbol{U}^{\top}\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top} = \boldsymbol{V}\boldsymbol{\Sigma}^{\top}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}\\
\end{aligned}\end{equation}
Since $\boldsymbol{\Sigma}\boldsymbol{\Sigma}^{\top}$ and $\boldsymbol{\Sigma}^{\top}\boldsymbol{\Sigma}$ are both diagonal matrices, this means the spectral decompositions of $\boldsymbol{M}\boldsymbol{M}^{\top}$ and $\boldsymbol{M}^{\top}\boldsymbol{M}$ are $\boldsymbol{U}\boldsymbol{\Sigma}^2 \boldsymbol{U}^{\top}$ and $\boldsymbol{V}\boldsymbol{\Sigma}^2 \boldsymbol{V}^{\top}$ respectively. This suggests that performing spectral decomposition on $\boldsymbol{M}\boldsymbol{M}^{\top}$ and $\boldsymbol{M}^{\top}\boldsymbol{M}$ separately could yield the SVD of $\boldsymbol{M}$? Yes, that's correct; this can serve as a calculation method for SVD. However, we cannot directly prove through it that the resulting $\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}$ satisfy $\boldsymbol{M}=\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$.
The key to solving the problem is to perform spectral decomposition on only one of $\boldsymbol{M}\boldsymbol{M}^{\top}$ or $\boldsymbol{M}^{\top}\boldsymbol{M}$ and then construct the orthogonal matrix on the other side by other means. Without loss of generality, let the rank of $\boldsymbol{M}$ be $r \leq m$. Consider performing the spectral decomposition of $\boldsymbol{M}^{\top}\boldsymbol{M}$ as $\boldsymbol{V}\boldsymbol{\Lambda} \boldsymbol{V}^{\top}$. Note that $\boldsymbol{M}^{\top}\boldsymbol{M}$ is a positive semi-definite matrix, so $\boldsymbol{\Lambda}$ is non-negative. Assuming the diagonal elements are already sorted from largest to smallest, a rank of $r$ means only the first $r$ eigenvalues $\lambda_i$ are greater than 0. We define:
\begin{equation}\boldsymbol{\Sigma}_{[:r,:r]} = (\boldsymbol{\Lambda}_{[:r,:r]})^{1/2},\quad \boldsymbol{U}_{[:n,:r]} = \boldsymbol{M}\boldsymbol{V}_{[:m,:r]}\boldsymbol{\Sigma}_{[:r,:r]}^{-1}\end{equation}
It can be verified that:
\begin{equation}\begin{aligned}
\boldsymbol{U}_{[:n,:r]}^{\top}\boldsymbol{U}_{[:n,:r]} =&\ \boldsymbol{\Sigma}_{[:r,:r]}^{-1}\boldsymbol{V}_{[:m,:r]}^{\top} \boldsymbol{M}^{\top}\boldsymbol{M}\boldsymbol{V}_{[:m,:r]}\boldsymbol{\Sigma}_{[:r,:r]}^{-1} \\
=&\ \boldsymbol{\Sigma}_{[:r,:r]}^{-1}\boldsymbol{V}_{[:m,:r]}^{\top} \boldsymbol{V}\boldsymbol{\Lambda} \boldsymbol{V}^{\top}\boldsymbol{V}_{[:m,:r]}\boldsymbol{\Sigma}_{[:r,:r]}^{-1} \\
=&\ \boldsymbol{\Sigma}_{[:r,:r]}^{-1}\boldsymbol{I}_{[:r,:m]}\boldsymbol{\Lambda} \boldsymbol{I}_{[:m,:r]}\boldsymbol{\Sigma}_{[:r,:r]}^{-1} \\
=&\ \boldsymbol{\Sigma}_{[:r,:r]}^{-1}\boldsymbol{\Lambda}_{[:r,:r]}\boldsymbol{\Sigma}_{[:r,:r]}^{-1} \\
=&\ \boldsymbol{I}_r \\
\end{aligned}\end{equation}
The convention here is that slice priority is higher than matrix operations like transpose and inverse, e.g., $\boldsymbol{U}_{[:n,:r]}^{\top}=(\boldsymbol{U}_{[:n,:r]})^{\top}$, $\boldsymbol{\Sigma}_{[:r,:r]}^{-1}=(\boldsymbol{\Sigma}_{[:r,:r]})^{-1}$, etc. The above result indicates that $\boldsymbol{U}_{[:n,:r]}$ is part of an orthogonal matrix. Then we have:
\begin{equation}\boldsymbol{U}_{[:n,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:m,:r]}^{\top} = \boldsymbol{M}\boldsymbol{V}_{[:m,:r]}\boldsymbol{\Sigma}_{[:r,:r]}^{-1}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:m,:r]}^{\top} = \boldsymbol{M}\boldsymbol{V}_{[:m,:r]}\boldsymbol{V}_{[:m,:r]}^{\top}\end{equation}
Note that $\boldsymbol{M}\boldsymbol{V}\boldsymbol{V}^{\top} = \boldsymbol{M}$ always holds, and $\boldsymbol{V}_{[:m,:r]}$ is the first $r$ columns of $\boldsymbol{V}$. According to $\boldsymbol{M}^{\top}\boldsymbol{M}=\boldsymbol{V}\boldsymbol{\Lambda} \boldsymbol{V}^{\top}$, we can write $(\boldsymbol{M}\boldsymbol{V})^{\top}\boldsymbol{M}\boldsymbol{V} = \boldsymbol{\Lambda}$. Let $\boldsymbol{V}=(\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_m)$, then $\Vert \boldsymbol{M}\boldsymbol{v}_i\Vert^2=\lambda_i$. Due to the rank-$r$ setting, when $i > r$, $\lambda_i=0$, which implies that $\boldsymbol{M}\boldsymbol{v}_i$ is actually a zero vector. Therefore:
\begin{equation}\begin{aligned}\boldsymbol{M} = \boldsymbol{M}\boldsymbol{V}\boldsymbol{V}^{\top} =&\ (\boldsymbol{M}\boldsymbol{V}_{[:m,:r]}, \boldsymbol{M}\boldsymbol{V}_{[:m,r:]})\begin{pmatrix}\boldsymbol{V}_{[:m,:r]}^{\top} \\ \boldsymbol{V}_{[:m,r:]}^{\top}\end{pmatrix} \\[8pt]
=&\ (\boldsymbol{M}\boldsymbol{V}_{[:m,:r]}, \boldsymbol{0}_{m\times(m-r)} )\begin{pmatrix}\boldsymbol{V}_{[:m,:r]}^{\top} \\ \boldsymbol{V}_{[:m,r:]}^{\top}\end{pmatrix}\\[8pt]
=&\ \boldsymbol{M}\boldsymbol{V}_{[:m,:r]}\boldsymbol{V}_{[:m,:r]}^{\top}
\end{aligned}\end{equation}
This shows that $\boldsymbol{U}_{[:n,:r]}\boldsymbol{\Sigma}_{[:r,:r]}\boldsymbol{V}_{[:m,:r]}^{\top}=\boldsymbol{M}$. Combined with the fact that $\boldsymbol{U}_{[:n,:r]}$ is part of an orthogonal matrix, we have already obtained the critical parts of the SVD of $\boldsymbol{M}$. We only need to pad $\boldsymbol{\Sigma}_{[:r,:r]}$ with zeros to an $n\times m$ size $\boldsymbol{\Sigma}$, and complete $\boldsymbol{U}_{[:n,:r]}$ into an $n\times n$ orthogonal matrix $\boldsymbol{U}$, thus obtaining the full SVD form $\boldsymbol{M}=\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$.
Approximation Theorem
Finally, we must not forget our initial goal: the optimization problem $\eqref{eq:loss-ab}$. With SVD, we can provide the answer:
If the SVD of $\boldsymbol{M}\in\mathbb{R}^{n\times m}$ is $\boldsymbol{U}\boldsymbol{\Sigma} \boldsymbol{V}^{\top}$, then the optimal $r$-rank approximation of $\boldsymbol{M}$ is $\boldsymbol{U}_{[:n,:r]}\boldsymbol{\Sigma}_{[:r,:r]} \boldsymbol{V}_{[:m,:r]}^{\top}$.
This is known as the "Eckart-Young-Mirsky Theorem." In the "Low-Rank Approximation" section of SVD applications, we showed that through SVD, the optimal $r$-rank approximation problem for a general matrix can be simplified to that for a non-negative diagonal matrix. Therefore, the "Eckart-Young-Mirsky Theorem" is equivalent to saying that the optimal $r$-rank approximation of a non-negative diagonal matrix is the matrix formed by retaining only the $r$ largest diagonal elements.
Some readers might think, "Doesn't this obviously follow?" The reality is that although the conclusion is quite intuitive, it is not obviously true. Let's focus on solving:
\begin{equation}\min_{\boldsymbol{A},\boldsymbol{B}}\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{\Sigma}\Vert_F^2\end{equation}
where $\boldsymbol{A}\in\mathbb{R}^{n\times r}, \boldsymbol{B}\in\mathbb{R}^{r\times m}, \boldsymbol{\Sigma}\in\mathbb{R}^{n\times m}, r < \min(n,m)$. If $\boldsymbol{A}$ is given, the optimal solution for $\boldsymbol{B}$ was found in the previous article to be $\boldsymbol{A}^{\dagger} \boldsymbol{\Sigma}$. Thus we have:
\begin{equation}\min_{\boldsymbol{A},\boldsymbol{B}}\Vert \boldsymbol{A}\boldsymbol{B} - \boldsymbol{\Sigma}\Vert_F^2 = \min_\boldsymbol{A}\Vert (\boldsymbol{A}\boldsymbol{A}^{\dagger} - \boldsymbol{I}_n)\boldsymbol{\Sigma}\Vert_F^2\end{equation}
Let the SVD of matrix $\boldsymbol{A}$ be $\boldsymbol{U}_\boldsymbol{A}\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{V}_\boldsymbol{A}^{\top}$, then $\boldsymbol{A}^{\dagger}=\boldsymbol{V}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger} \boldsymbol{U}_\boldsymbol{A}^{\top}$, and:
\begin{equation}\begin{aligned}
\Vert (\boldsymbol{A}\boldsymbol{A}^{\dagger} - \boldsymbol{I}_n)\boldsymbol{\Sigma}\Vert_F^2 =&\ \Vert (\boldsymbol{U}_\boldsymbol{A}\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{V}_\boldsymbol{A}^{\top}\boldsymbol{V}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger} \boldsymbol{U}_\boldsymbol{A}^{\top} - \boldsymbol{I}_n)\boldsymbol{\Sigma}\Vert_F^2 \\
=&\ \Vert (\boldsymbol{U}_\boldsymbol{A}\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger} \boldsymbol{U}_\boldsymbol{A}^{\top} - \boldsymbol{I}_n)\boldsymbol{\Sigma}\Vert_F^2 \\
=&\ \Vert \boldsymbol{U}_\boldsymbol{A} (\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger} - \boldsymbol{I}_n)\boldsymbol{U}_\boldsymbol{A}^{\top}\boldsymbol{\Sigma}\Vert_F^2 \\
=&\ \Vert (\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger} - \boldsymbol{I}_n)\boldsymbol{U}_\boldsymbol{A}^{\top}\boldsymbol{\Sigma}\Vert_F^2 \\
\end{aligned}\end{equation}
From the pseudo-inverse calculation formula, it's known that $\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger}$ is a diagonal matrix where the first $r_\boldsymbol{A}$ diagonal elements are 1 ($r_\boldsymbol{A}\leq r$ is the rank of $\boldsymbol{A}$) and the rest are 0. Thus, $(\boldsymbol{\Sigma}_\boldsymbol{A} \boldsymbol{\Sigma}_\boldsymbol{A}^{\dagger} - \boldsymbol{I}_n)\boldsymbol{U}_\boldsymbol{A}^{\top}$ effectively retains only the last $k=n-r_\boldsymbol{A}$ rows of the orthogonal matrix $\boldsymbol{U}_\boldsymbol{A}^{\top}$. This finally simplifies to:
\begin{equation}\min_\boldsymbol{A}\Vert (\boldsymbol{A}\boldsymbol{A}^{\dagger} - \boldsymbol{I}_n)\boldsymbol{\Sigma}\Vert_F^2 = \min_{k,\boldsymbol{U}}\Vert \boldsymbol{U}\boldsymbol{\Sigma}\Vert_F^2\quad\text{s.t.}\quad k\geq n-r, \boldsymbol{U}\in\mathbb{R}^{k\times n}, \boldsymbol{U}\boldsymbol{U}^{\top} = \boldsymbol{I}_k\end{equation}
Now, according to the $F$-norm definition, we can write:
\begin{equation}\Vert \boldsymbol{U}\boldsymbol{\Sigma}\Vert_F^2=\sum_{i=1}^k \sum_{j=1}^n u_{i,j}^2 \sigma_j^2 =\sum_{j=1}^n \sigma_j^2 \underbrace{\sum_{i=1}^k u_{i,j}^2}_{w_j}=\sum_{j=1}^n \sigma_j^2 w_j\end{equation}
Note that $0 \leq w_j \leq 1$, and $w_1+w_2+\cdots+w_n = k$. Under these constraints, the minimum of the expression on the right must be the sum of the smallest $k$ values of $\sigma_j^2$. Since $\sigma_j$ are sorted from largest to smallest, we have:
\begin{equation}\min_{k,\boldsymbol{U}}\Vert \boldsymbol{U}\boldsymbol{\Sigma}\Vert_F^2=\min_k \sum_{j=n-k+1}^n \sigma_j^2 = \sum_{j=r+1}^n \sigma_j^2\end{equation}
That is to say, the error (squared $F$-norm) between $\boldsymbol{\Sigma}$ and its optimal $r$-rank approximation is $\sum_{j=r+1}^n \sigma_j^2$, which is exactly the error produced by retaining the $r$ largest diagonal elements. Hence, we have proven that "the optimal $r$-rank approximation of a non-negative diagonal matrix is simply the matrix formed by keeping only the $r$ largest diagonal elements." Of course, this only identifies one solution; it doesn't rule out the possibility of multiple solutions.
It is worth noting that the Eckart-Young-Mirsky Theorem holds not only for the $F$-norm but also for the spectral norm. The proof for the spectral norm is actually slightly simpler and will not be expanded upon here. Interested readers can refer to the Wikipedia entry on "Low-rank approximation."
Article Summary
The protagonist of this article is the renowned SVD (Singular Value Decomposition), which many readers may already be familiar with. In this piece, we primarily revolved around the contents related to SVD and low-rank approximation, providing as simple proofs as possible for the existence, calculation, and connection to low-rank approximation of SVD.