By 苏剑林 | April 26, 2025
By Jianlin Su | April 26, 2025 | 19,522 Readers
SVD (Singular Value Decomposition) is a common matrix decomposition algorithm that many readers are likely familiar with; we previously introduced it in "The Road to Low-Rank Approximation (II): SVD". However, have you ever considered that SVD is actually differentiable? When I first learned of this conclusion, I was quite surprised, as intuition suggests that "decompositions" are often non-differentiable. But the truth is, SVD is indeed differentiable in general cases, which means that theoretically, we can embed SVD into models and train them end-to-end using gradient-based optimizers.
The question arises: since SVD is differentiable, what do its derivative functions look like? Next, we will refer to the paper "Differentiating the Singular Value Decomposition" to step-by-step derive the differentiation formulas for SVD.
Assume $\boldsymbol{W}$ is a full-rank $n \times n$ matrix and all singular values are mutually distinct. This is a relatively easy case to discuss, and we will later address which conditions can be relaxed. Next, we set the SVD of $\boldsymbol{W}$ as:
\begin{equation}\boldsymbol{W} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}\end{equation}The so-called SVD differentiation involves finding the gradients or differentials of $\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}$ with respect to $\boldsymbol{W}$. To this end, we first take the differential of both sides of the above equation:
\begin{equation}d\boldsymbol{W} = (d\boldsymbol{U})\boldsymbol{\Sigma}\boldsymbol{V}^{\top} + \boldsymbol{U}(d\boldsymbol{\Sigma})\boldsymbol{V}^{\top} + \boldsymbol{U}\boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\end{equation}Multiplying by $\boldsymbol{U}^{\top}$ on the left and $\boldsymbol{V}$ on the right, and using $\boldsymbol{U}^{\top}\boldsymbol{U} = \boldsymbol{I}, \boldsymbol{V}^{\top}\boldsymbol{V} = \boldsymbol{I}$, we obtain:
\begin{equation}\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V} = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}\label{eq:core}\end{equation}This is the foundation for the subsequent derivation. Note that by differentiating the identities $\boldsymbol{U}^{\top}\boldsymbol{U} = \boldsymbol{I}$ and $\boldsymbol{V}^{\top}\boldsymbol{V} = \boldsymbol{I}$, we also get:
\begin{equation}(d\boldsymbol{U})^{\top}\boldsymbol{U} + \boldsymbol{U}^{\top}(d\boldsymbol{U}) = \boldsymbol{0},\quad (d\boldsymbol{V})^{\top}\boldsymbol{V} + \boldsymbol{V}^{\top}(d\boldsymbol{V}) = \boldsymbol{0}\end{equation}This indicates that $\boldsymbol{U}^{\top}(d\boldsymbol{U})$ and $(d\boldsymbol{V})^{\top}\boldsymbol{V}$ are both skew-symmetric matrices.
A characteristic of skew-symmetric matrices is that their diagonal elements are all zero. Since $\boldsymbol{\Sigma}$ is a diagonal matrix where off-diagonal elements are zero, this suggests we might need to handle the diagonal and off-diagonal elements separately.
First, we define matrices $\boldsymbol{I}$ and $\bar{\boldsymbol{I}}$: $\boldsymbol{I}$ is the identity matrix, where diagonal elements are all 1 and off-diagonal elements are 0; $\bar{\boldsymbol{I}}$ is the complementary matrix of the identity, where diagonal elements are all 0 and off-diagonal elements are 1. Using $\boldsymbol{I}$, $\bar{\boldsymbol{I}}$, and the Hadamard product $\otimes$, we can respectively extract the diagonal and off-diagonal parts of Eq $\eqref{eq:core}$:
\begin{align} \boldsymbol{I}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}) =&\, \boldsymbol{I}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}) = d\boldsymbol{\Sigma} \label{eq:value} \\[8pt] \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}) =&\, \bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}) = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}\label{eq:vector} \end{align}Now let's first look at Eq $\eqref{eq:value}$, which can be equivalently written as:
\begin{equation}d\sigma_i = \boldsymbol{u}_i^{\top}(d\boldsymbol{W})\boldsymbol{v}_i\label{eq:d-sigma}\end{equation}This is the differential of the $i$-th singular value $\sigma_i$, where $\boldsymbol{u}_i, \boldsymbol{v}_i$ are the $i$-th columns of $\boldsymbol{U}$ and $\boldsymbol{V}$ respectively. The spectral norm gradient discussed in "Reflections from Spectral Norm Gradients to New Forms of Weight Decay" is actually just a special case where $i=1$.
Then we look at Eq $\eqref{eq:vector}$ again:
\begin{equation}\bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}) = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V}\label{eq:vector-1}\end{equation}Taking the transpose:
\begin{equation}\begin{aligned} \bar{\boldsymbol{I}}\otimes(\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}) =&\, \boldsymbol{\Sigma}(d\boldsymbol{U})^{\top}\boldsymbol{U} + \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} \\[6pt] =&\, -\boldsymbol{\Sigma}\boldsymbol{U}^{\top}(d\boldsymbol{U}) - (d\boldsymbol{V})^{\top}\boldsymbol{V}\boldsymbol{\Sigma} \end{aligned}\label{eq:vector-2}\end{equation}The second equals sign uses the fact that "$\boldsymbol{U}^{\top}(d\boldsymbol{U})$ and $(d\boldsymbol{V})^{\top}\boldsymbol{V}$ are skew-symmetric matrices." Eq $\eqref{eq:vector-1}$ and Eq $\eqref{eq:vector-2}$ form a system of linear equations for $d\boldsymbol{U}$ and $d\boldsymbol{V}$, which we need to solve.
The solution approach is simple elimination. First, from $\eqref{eq:vector-1}\times\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\times\eqref{eq:vector-2}$, we get:
\begin{equation}\bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}) = \boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma}^2 - \boldsymbol{\Sigma}^2\boldsymbol{U}^{\top}(d\boldsymbol{U})\end{equation}Here we use the fact that diagonal matrix $\boldsymbol{\Sigma}$ satisfies $\boldsymbol{\Sigma}(\bar{\boldsymbol{I}}\otimes \boldsymbol{M}) = \bar{\boldsymbol{I}}\otimes (\boldsymbol{\Sigma}\boldsymbol{M})$ and $(\bar{\boldsymbol{I}}\otimes \boldsymbol{M})\boldsymbol{\Sigma} = \bar{\boldsymbol{I}}\otimes (\boldsymbol{M}\boldsymbol{\Sigma})$. We know that multiplying on the left (right) by a diagonal matrix is equivalent to multiplying each row (column) of the matrix by the corresponding element of the diagonal matrix. So, if we define matrix $\boldsymbol{E}$ where $\boldsymbol{E}_{i,j} = \sigma_j^2 - \sigma_i^2$, then $\boldsymbol{U}^{\top}(d\boldsymbol{U})\boldsymbol{\Sigma}^2 - \boldsymbol{\Sigma}^2\boldsymbol{U}^{\top}(d\boldsymbol{U}) = \boldsymbol{E}\otimes (\boldsymbol{U}^{\top}(d\boldsymbol{U}))$. Thus, the above equation becomes:
\begin{equation}\bar{\boldsymbol{I}}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}) = \boldsymbol{E}\otimes (\boldsymbol{U}^{\top}(d\boldsymbol{U}))\label{eq:dU-0}\end{equation}Which then can be solved as:
\begin{equation}d\boldsymbol{U} = \boldsymbol{U}(\boldsymbol{F}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}))\label{eq:dU}\end{equation}Similarly, from $\boldsymbol{\Sigma}\times \eqref{eq:vector-1} + \eqref{eq:vector-2}\times \boldsymbol{\Sigma}$, we solve for:
\begin{equation}d\boldsymbol{V} = \boldsymbol{V}(\boldsymbol{F}\otimes(\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}))\label{eq:dV}\end{equation}Eq $\eqref{eq:dU}, \eqref{eq:dV}$ are the differentials of the singular vectors. Where:
\begin{equation}\boldsymbol{F}_{i,j} = \left\{\begin{aligned} &\, 1/(\sigma_j^2 - \sigma_i^2), &\, i\neq j \\ &\, 0, &\, i = j \end{aligned}\right.\end{equation}Now that we have the differentials, how do we find the gradients? This is actually a bit tricky—not in terms of technology, but in representation. For example, if $\boldsymbol{W}$ and $\boldsymbol{U}$ are $n \times n$ matrices, the full gradient of $\boldsymbol{U}$ with respect to $\boldsymbol{W}$ is an $n \times n \times n \times n$ 4th-order tensor, and high-order tensors are unfamiliar to most people, including myself.
To bypass the trouble of higher-order tensors, we have two alternatives. First, from a programming perspective, we don't necessarily need to derive the form of the gradient. We can instead write an equivalent forward form based on the differential result and lease it to the framework's automatic differentiation. For instance, from Eq $\eqref{eq:d-sigma}$, we can conclude that the gradient of $\sigma_i$ is equal to the gradient of $\sg{\boldsymbol{u}_i}^{\top} \boldsymbol{W} \sg{\boldsymbol{v}_i}$, which is:
\begin{equation}\nabla_{\boldsymbol{W}} \sigma_i = \nabla_{\boldsymbol{W}} (\sg{\boldsymbol{u}_i}^{\top} \boldsymbol{W} \sg{\boldsymbol{v}_i})\end{equation}Here, the symbol color is changed to $\sg{\blacksquare}$ to represent the stop_gradient operator, preventing the formula from becoming too bloated. Since $\sigma_i$ also happens to be equal to $\boldsymbol{u}_i^{\top}\boldsymbol{W}\boldsymbol{v}_i$, we only need to replace every occurrence of $\sigma_i$ in the code with $\sg{\boldsymbol{u}_i}^{\top} \boldsymbol{W} \sg{\boldsymbol{v}_i}$ to automatically obtain the correct gradient. Generally, we have:
Meaning simply replace all $\boldsymbol{\Sigma}$ with $\boldsymbol{I}\otimes(\sg{\boldsymbol{U}}^{\top} \boldsymbol{W} \sg{\boldsymbol{V}})$.
Similarly, from Eq $\eqref{eq:dU}$, we know:
\begin{equation}\nabla_{\boldsymbol{W}}\boldsymbol{U} = \nabla_{\boldsymbol{W}}(\sg{\boldsymbol{U}}(\sg{\boldsymbol{F}}\otimes(\sg{\boldsymbol{U}}^{\top}\boldsymbol{W}\sg{\boldsymbol{V}\boldsymbol{\Sigma}} + \sg{\boldsymbol{\Sigma}\boldsymbol{V}}^{\top}\boldsymbol{W}^{\top}\sg{\boldsymbol{U}})))\end{equation}It can be verified that $\boldsymbol{U}(\boldsymbol{F}\otimes(\boldsymbol{U}^{\top}\boldsymbol{W}\boldsymbol{V}\boldsymbol{\Sigma} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}\boldsymbol{W}^{\top}\boldsymbol{U}))$ is exactly a zero matrix, so we only need to replace every occurrence of $\boldsymbol{U}$ in the code with:
\begin{equation}\boldsymbol{U} \quad \to \quad \sg{\boldsymbol{U}} + \sg{\boldsymbol{U}}(\sg{\boldsymbol{F}}\otimes(\sg{\boldsymbol{U}}^{\top}\boldsymbol{W}\sg{\boldsymbol{V}\boldsymbol{\Sigma}} + \sg{\boldsymbol{\Sigma}\boldsymbol{V}}^{\top}\boldsymbol{W}^{\top}\sg{\boldsymbol{U}}))\end{equation}This maintains the correct forward result while obtaining the correct gradient. Based on the same principle, the replacement format for $\boldsymbol{V}$ is:
\begin{equation}\boldsymbol{V} \quad \to \quad \sg{\boldsymbol{V}} + \sg{\boldsymbol{V}}(\sg{\boldsymbol{F}}\otimes(\sg{\boldsymbol{V}}^{\top}\boldsymbol{W}^{\top}\sg{\boldsymbol{U}\boldsymbol{\Sigma}} + \sg{\boldsymbol{\Sigma}\boldsymbol{U}}^{\top}\boldsymbol{W}\sg{\boldsymbol{V}}))\end{equation}The second option is to directly solve for the gradient of the loss function with respect to $\boldsymbol{W}$. Specifically, suppose the loss function is a function of $\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}$, denoted as $\mathcal{L}(\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V})$. We directly find $\nabla_{\boldsymbol{W}}\mathcal{L}$, which is a matrix. It can be expressed using $\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}, \nabla_{\boldsymbol{U}}\mathcal{L}, \nabla_{\boldsymbol{\Sigma}}\mathcal{L}, \nabla_{\boldsymbol{V}}\mathcal{L}$. All these quantities are also just matrices, so no higher-order tensors are involved.
In the previous section, we already derived equivalent functions for $\boldsymbol{U}, \boldsymbol{\Sigma}, \boldsymbol{V}$ with the same gradients. Except for the parts wrapped in stop_gradient, these equivalent functions are linear with respect to $\boldsymbol{W}$, so the problem essentially becomes calculating the gradient of linear composite functions. We previously discussed the related methods in the "Matrix Derivatives" section of "The Road to Low-Rank Approximation (I): Pseudo-inverse". Specifically, we have:
Utilizing these basic formulas and the chain rule for composite functions, we can write:
\begin{equation}\begin{aligned} \nabla_{\boldsymbol{W}}\mathcal{L} \quad = \qquad &\,\boldsymbol{U}(\boldsymbol{F}\otimes(\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{U}}\mathcal{L}) - (\nabla_{\boldsymbol{U}}\mathcal{L})^{\top}\boldsymbol{U}))\boldsymbol{\Sigma}\boldsymbol{V}^{\top} \\[6pt] + &\,\boldsymbol{U}(\boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L}))\boldsymbol{V}^{\top} \\[6pt] + &\,\boldsymbol{U}\boldsymbol{\Sigma}(\boldsymbol{F}\otimes(\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L}) - (\nabla_{\boldsymbol{V}}\mathcal{L})^{\top}\boldsymbol{V})))\boldsymbol{V}^{\top} \end{aligned}\end{equation}The entire process involves repeatedly applying basic formulas and the chain rule, as well as the property $\boldsymbol{F}^{\top} = -\boldsymbol{F}$. In principle, there's no difficulty, but it requires careful attention. I suggest readers complete this derivation themselves; it is a very practical matrix calculus exercise. Finally, introducing two notations:
\begin{equation} \sym{\boldsymbol{X}} = \frac{1}{2}(\boldsymbol{X} + \boldsymbol{X}^{\top}),\qquad \skew{\boldsymbol{X}} = \frac{1}{2}(\boldsymbol{X} - \boldsymbol{X}^{\top})\end{equation}We can simplify the gradient result as:
\begin{equation}\nabla_{\boldsymbol{W}}\mathcal{L} = \boldsymbol{U}\Big(2(\boldsymbol{F}\otimes\skew{\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{U}}\mathcal{L})})\boldsymbol{\Sigma} + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L}) + 2\boldsymbol{\Sigma}(\boldsymbol{F}\otimes\skew{\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L})})\Big)\boldsymbol{V}^{\top} \label{eq:w-grad-l}\end{equation}Now let's try a small example: finding the gradient of $\boldsymbol{O}=\mathop{\text{msign}}(\boldsymbol{W})=\boldsymbol{U}\boldsymbol{V}^{\top}$. The concept of $\mathop{\text{msign}}$ was discussed in "Muon Optimizer Appreciation: An Essential Leap from Vectors to Matrices" when introducing Muon.
According to the definition of $\mathop{\text{msign}}$, we have:
\begin{equation}\nabla_{\boldsymbol{U}}\mathcal{L} = (\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V},\qquad \nabla_{\boldsymbol{V}}\mathcal{L} = (\nabla_{\boldsymbol{O}}\mathcal{L})^{\top} \boldsymbol{U}\end{equation}Substituting into Eq $\eqref{eq:w-grad-l}$ yields:
\begin{equation}\begin{aligned} \nabla_{\boldsymbol{W}}\mathcal{L} =&\, 2\boldsymbol{U}\Big((\boldsymbol{F}\otimes\skew{\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}})\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(\boldsymbol{F}\otimes\skew{\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})^{\top}\boldsymbol{U}})\Big)\boldsymbol{V}^{\top} \\[6pt] =&\, 2\boldsymbol{U}\Big((\boldsymbol{F}\otimes\skew{\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}})\boldsymbol{\Sigma} - \boldsymbol{\Sigma}(\boldsymbol{F}\otimes\skew{\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}})\Big)\boldsymbol{V}^{\top} \\[6pt] =&\, 2\boldsymbol{U}\big(\boldsymbol{G}\otimes\skew{\boldsymbol{U}^{\top}(\nabla_{\boldsymbol{O}}\mathcal{L})\boldsymbol{V}}\big)\boldsymbol{V}^{\top} \end{aligned}\end{equation}Where $\boldsymbol{G}_{i,j} = 1/(\sigma_i + \sigma_j)$.
Finally, let's consider a common special case: when $\boldsymbol{W}$ is a positive definite symmetric matrix, its SVD has the form $\boldsymbol{V}\boldsymbol{\Sigma}\boldsymbol{V}^{\top}$, meaning $\boldsymbol{U}=\boldsymbol{V}$. Repeating the previous derivation, we first take the differential of both sides:
\begin{equation}d\boldsymbol{W} = (d\boldsymbol{V})\boldsymbol{\Sigma}\boldsymbol{V}^{\top} + \boldsymbol{V}(d\boldsymbol{\Sigma})\boldsymbol{V}^{\top} + \boldsymbol{V}\boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\end{equation}Then left multiply by $\boldsymbol{V}^{\top}$ and right multiply by $\boldsymbol{V}$:
\begin{equation}\begin{aligned} \boldsymbol{V}^{\top}(d\boldsymbol{W})\boldsymbol{V} =&\, \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} + \boldsymbol{\Sigma}(d\boldsymbol{V})^{\top}\boldsymbol{V} \\[6pt] =&\, \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} + d\boldsymbol{\Sigma} - \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{V}) \end{aligned}\end{equation}From this, it is evident that:
\begin{gather} d\boldsymbol{\Sigma} = \boldsymbol{I}\otimes(\boldsymbol{V}(d\boldsymbol{W})\boldsymbol{V}^{\top}) \\[8pt] \boldsymbol{V}^{\top}(d\boldsymbol{V})\boldsymbol{\Sigma} - \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{V}) = \bar{\boldsymbol{I}}\otimes(\boldsymbol{V}(d\boldsymbol{W})\boldsymbol{V}^{\top}) \end{gather}From the second equation, we can further solve for:
\begin{equation}d\boldsymbol{V} = \boldsymbol{V}(\boldsymbol{K}^{\top}\otimes(\boldsymbol{V}^{\top}(d\boldsymbol{W})\boldsymbol{V}))\end{equation}Where:
\begin{equation}\boldsymbol{K}_{i,j} = \left\{\begin{aligned} &\, 1/(\sigma_i - \sigma_j), &\, i\neq j \\ &\, 0, &\, i = j \end{aligned}\right.\end{equation}Based on this result, we have:
\begin{equation}\nabla_{\boldsymbol{W}}\mathcal{L} = \boldsymbol{V}(\boldsymbol{K}^{\top}\otimes(\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L})) + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L}))\boldsymbol{V}^{\top} \end{equation}Wait, trap! The equation above is actually incorrect. It's the result of the chain rule, but the chain rule only applies to unconstrained differentiation. Here, there is a constraint $\boldsymbol{W}=\boldsymbol{W}^{\top}$, so the correct gradient should also include symmetrization:
\begin{equation}\begin{aligned} \nabla_{\boldsymbol{W}}\mathcal{L} =&\, \boldsymbol{V}\Big(\sym{\boldsymbol{K}^{\top}\otimes(\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L}))} + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L})\Big)\boldsymbol{V}^{\top} \\[6pt] =&\, \boldsymbol{V}\Big(\boldsymbol{K}^{\top}\otimes\skew{\boldsymbol{V}^{\top}(\nabla_{\boldsymbol{V}}\mathcal{L})} + \boldsymbol{I}\otimes(\nabla_{\boldsymbol{\Sigma}}\mathcal{L})\Big)\boldsymbol{V}^{\top}\end{aligned}\end{equation}Another trap is directly substituting $\boldsymbol{U}=\boldsymbol{V}$ into Eq $\eqref{eq:w-grad-l}$ to derive this. That would lead to the $\boldsymbol{K}$ term being doubled. This is because Eq $\eqref{eq:w-grad-l}$ distinguishes between $\nabla_{\boldsymbol{U}}\mathcal{L}$ and $\nabla_{\boldsymbol{V}}\mathcal{L}$, whereas under the positive definite symmetric assumption $\boldsymbol{U}$ and $\boldsymbol{V}$ are identical. Finding the gradient with respect to $\boldsymbol{V}$ actually sums up the original $\nabla_{\boldsymbol{U}}\mathcal{L}$ and $\nabla_{\boldsymbol{V}}\mathcal{L}$ gradients, leading to double counting. For related literature, one can also refer to "Matrix Backpropagation for Deep Networks With Structured Layers".
Some readers might wonder: I can only guarantee initialized matrices have distinct singular values, but how can I ensure the matrix still satisfies this condition after training? The answer is that the gradient will naturally help guarantee it. From Eq $\eqref{eq:dU}$ and $\eqref{eq:dV}$, it's clear that the gradient contains $\boldsymbol{F}$, and since $\boldsymbol{F}_{i,j} = \frac{1}{\sigma_j^2 - \sigma_i^2}$, once two singular values become close, the gradient will become very large, so the optimizer will automatically push them apart.
However, this characteristic also brings numerical instability to actual training, mainly manifested as gradient explosion when $\sigma_i, \sigma_j$ are close. To address this, the paper "Backpropagation-Friendly Eigendecomposition" proposes using "Power Iteration" instead of precise SVD. Later, the paper "Robust Differentiable SVD" proved it is theoretically equivalent to doing a Taylor approximation for $\frac{1}{\sigma_i - \sigma_j}$ in $\boldsymbol{F}_{i,j} = -\frac{1}{(\sigma_i + \sigma_j)(\sigma_i - \sigma_j)}$ (assuming $\sigma_j < \sigma_i$):
\begin{equation}\frac{1}{\sigma_i - \sigma_j} = \frac{1}{\sigma_i}\frac{1}{1-(\sigma_j/\sigma_i)}\approx \frac{1}{\sigma_i}\left(1 + \left(\frac{\sigma_j}{\sigma_i}\right) + \left(\frac{\sigma_j}{\sigma_i}\right)^2 + \cdots + \left(\frac{\sigma_j}{\sigma_i}\right)^N \right)\end{equation}Using Taylor approximation avoids gradient explosion for the case $\sigma_j \to \sigma_i$. Later still, the authors in "Why Approximate Matrix Square Root Outperforms Accurate SVD in Global Covariance Pooling?" extended this to more general Padé approximations. I am not too familiar with this sequence of work, so I won't expand too much on it.
I just have one question: if the goal is simply to avoid numerical explosion, is it necessary to use these tools? Could we not just add a truncation to $\sigma_j/\sigma_i$? For example:
\begin{equation}\frac{1}{1-(\sigma_j/\sigma_i)}\approx \frac{1}{1-\min(\sigma_j/\sigma_i, 0.99)}\end{equation}Wouldn't this simply prevent it from going to infinity? Or is there something I'm missing? If readers are familiar with the context, please feel free to correct me in the comments. Thank you.
Until now, all matrices appearing above have been $n \times n$ matrices, because the derivation started with the assumption that "$\boldsymbol{W}$ is a full-rank $n \times n$ matrix with distinct singular values." In this section, let's discuss how far these conditions can be relaxed.
Simply put, the condition for SVD being differentiable is that "all non-zero singular values are mutually distinct." In other words, the matrix doesn't have to be square, and it doesn't have to be full-rank, but non-zero singular values must still be distinct. This is because if there are identical singular values, SVD is not unique, which fundamentally breaks differentiability. Of course, if only partial derivatives are needed, we can relax the conditions further. For example, to find the derivative of the spectral norm $\sigma_1$, we only need $\sigma_1 > \sigma_2$.
So, how does the differential result change after relaxing the conditions? Let's assume generally that $\boldsymbol{W}\in\mathbb{R}^{n\times m}$ with rank $r$, and the SVD is:
\begin{equation}\boldsymbol{W} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^{\top},\quad\boldsymbol{U}\in\mathbb{R}^{n\times n},\boldsymbol{\Sigma}\in\mathbb{R}^{n\times m},\boldsymbol{V}\in\mathbb{R}^{m\times m}\end{equation}The reason we allow zero singular values is that we only care about $d\boldsymbol{U}_{[:, :r]}$ and $d\boldsymbol{V}_{[:, :r]}$. Under the assumption that non-zero singular values are distinct, these can be uniquely determined. Starting from Eq $\eqref{eq:dU-0}$, the derivation remains general up to that point. Eq $\eqref{eq:dU-0}$ also shows why identical singular values must be avoided—when $\sigma_i = \sigma_j$, $\sigma_j - \sigma_i$ cannot be inverted. Keeping only the parts related to $d\boldsymbol{U}_{[:, :r]}$ in Eq $\eqref{eq:dU-0}$, we get:
\begin{equation}\bar{\boldsymbol{I}}_{[:, :r]}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma}_{[:, :r]} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}) = \boldsymbol{E}_{[:, :r]}\otimes (\boldsymbol{U}^{\top}(d\boldsymbol{U}_{[:, :r]}))\end{equation}Here the definition of $\boldsymbol{E}$ is still $\boldsymbol{E}_{i,j} = \sigma_j^2 - \sigma_i^2$, but $\boldsymbol{E}_{[:, :r]}$ specifically excludes all zeros, so it can be successfully inverted. The result is:
\begin{equation}d\boldsymbol{U}_{[:, :r]} = \boldsymbol{U}(\boldsymbol{F}_{[:, :r]}\otimes(\boldsymbol{U}^{\top}(d\boldsymbol{W})\boldsymbol{V}\boldsymbol{\Sigma}_{[:, :r]} + \boldsymbol{\Sigma}\boldsymbol{V}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}))\end{equation}Furthermore, using the following block partitions:
\begin{equation}\boldsymbol{U} = \begin{pmatrix}\boldsymbol{U}_{[:,:r]} & \boldsymbol{U}_{[:,r:]}\end{pmatrix},\quad \boldsymbol{F}_{[:, :r]} = \begin{pmatrix} \boldsymbol{F}_{[:r, :r]} \\ \boldsymbol{F}_{[r:, :r]}\end{pmatrix}\end{equation}Combined with $\boldsymbol{V}\boldsymbol{\Sigma}_{[:, :r]} = \boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}$, we obtain:
\begin{equation}\begin{aligned} d\boldsymbol{U}_{[:, :r]} =&\, \boldsymbol{U}_{[:, :r]}(\boldsymbol{F}_{[:r, :r]}\otimes(\boldsymbol{U}_{[:, :r]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]} + \boldsymbol{\Sigma}_{[:r, :r]}\boldsymbol{V}_{[:, :r]}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]})) \\[6pt] &\, \qquad + \boldsymbol{U}_{[:, r:]}(\boldsymbol{F}_{[r:, :r]}\otimes(\boldsymbol{U}_{[:, r:]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]})) \end{aligned}\label{eq:dU-r}\end{equation}According to the assumption, when $i > r$, $\sigma_i=0$, so every row of $\boldsymbol{F}_{[r:, :r]}$ is $(\sigma_1^{-2},\sigma_2^{-2},\cdots,\sigma_r^{-2})$. Thus, $\boldsymbol{F}_{[r:, :r]}\otimes$ is equivalent to multiplying on the right by $\boldsymbol{\Sigma}_{[:r, :r]}^{-2}$. Therefore:
\begin{equation}\boldsymbol{F}_{[r:, :r]}\otimes(\boldsymbol{U}_{[:, r:]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}) = \boldsymbol{U}_{[:, r:]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}^{-1}\end{equation}Finally, using $\boldsymbol{U}\boldsymbol{U}^{\top} = \boldsymbol{I}$ and $\boldsymbol{U} = (\boldsymbol{U}_{[:,:r]},\boldsymbol{U}_{[:,r:]})$, we get $\boldsymbol{U}_{[:, r:]}\boldsymbol{U}_{[:, r:]}^{\top} = \boldsymbol{I} - \boldsymbol{U}_{[:, :r]}\boldsymbol{U}_{[:, :r]}^{\top}$. Substituting into Eq $\eqref{eq:dU-r}$ gives:
\begin{equation}\begin{aligned} d\boldsymbol{U}_{[:, :r]} =&\, \boldsymbol{U}_{[:, :r]}(\boldsymbol{F}_{[:r, :r]}\otimes(\boldsymbol{U}_{[:, :r]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]} + \boldsymbol{\Sigma}_{[:r, :r]}\boldsymbol{V}_{[:, :r]}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]})) \\[6pt] &\, \qquad \color{orange}{+ (\boldsymbol{I} - \boldsymbol{U}_{[:, :r]}\boldsymbol{U}_{[:, :r]}^{\top})(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}^{-1}} \end{aligned}\end{equation}This expresses $d\boldsymbol{U}_{[:, r:]}$ as a function of $\boldsymbol{U}_{[:, :r]}, \boldsymbol{\Sigma}_{[:r, :r]}, \boldsymbol{V}_{[:, :r]}$. Under the assumption that non-zero singular values are distinct, these three are uniquely determined, so $d\boldsymbol{U}_{[:, :r]}$ is uniquely determined. Compared to Eq $\eqref{eq:dU}$, this version adds the term in orange. Similarly:
\begin{equation}\begin{aligned} d\boldsymbol{V}_{[:, :r]} =&\, \boldsymbol{V}_{[:, :r]}(\boldsymbol{F}_{[:r, :r]}\otimes(\boldsymbol{V}_{[:, :r]}^{\top}(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]} + \boldsymbol{\Sigma}_{[:r, :r]}\boldsymbol{U}_{[:, :r]}^{\top}(d\boldsymbol{W})\boldsymbol{V}_{[:, :r]})) \\[6pt] &\, \qquad \color{orange}{+ (\boldsymbol{I} - \boldsymbol{V}_{[:, :r]}\boldsymbol{V}_{[:, :r]}^{\top})(d\boldsymbol{W})^{\top}\boldsymbol{U}_{[:, :r]}\boldsymbol{\Sigma}_{[:r, :r]}^{-1}} \end{aligned}\end{equation}This article has provided a detailed derivation of the differentiation formulas for SVD.