By 苏剑林 | Apr 10, 2025
Rank is an important concept in linear algebra, representing the inherent dimensionality of a matrix. However, the strict mathematical definition of rank is often not fully applicable to numerical computation scenarios. This is because rank equals the number of non-zero singular values, and the mathematical understanding of "equal to zero" differs from numerical computation. In mathematics, "equal to zero" means absolutely and strictly zero—even $10^{-100}$ is not zero. But numerical computation is different; in many cases, $10^{-10}$ can be treated as zero.
Therefore, we hope to generalize the concept of rank to a form that better fits the characteristics of numerical computation. This is the origin of the concept of Effective Rank.
It should be noted that there is currently no unified definition of effective rank in academic circles. In the following, we introduce several approaches to defining effective rank from different perspectives. For practical problems, readers can choose the definition that best suits their needs.
We primarily consider rank from the perspective of singular values. For a matrix $\boldsymbol{M}\in\mathbb{R}^{n\times m}$, let $n\leq m$ without loss of generality. Its standard rank is
\begin{equation}\mathop{\text{rank}}(\boldsymbol{M}) \triangleq \max\{i\,|\,\sigma_i > 0\}\leq n\end{equation}where $\sigma_1\geq \sigma_2\geq \cdots\geq\sigma_n \geq 0$ are the singular values of $\boldsymbol{M}$. Intuitively, the concept of effective rank is intended to treat singular values close to zero as zero. Therefore, a basic property of effective rank is that it does not exceed the standard rank, and in special cases, it can degenerate into the standard rank. A simple definition satisfying this property is
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M},\epsilon) \triangleq \max\{i\,|\,\sigma_i > \epsilon\}\end{equation}However, the concepts of "large" and "small" should be relative. 1 is large compared to 0.01, but small compared to 100. Thus, it seems more scientific to standardize by dividing by $\sigma_1$:
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M},\epsilon) \triangleq \max\big\{i\,\big\|\,\sigma_i/\sigma_1 > \epsilon\big\}\end{equation}In addition to directly truncating singular values close to zero, we can also approach this from the perspective of low-rank approximation. In "The Path to Low-Rank Approximation (II): SVD", we proved that the optimal $r$-rank approximation of a matrix is the SVD result obtained by retaining only the largest $r$ singular values. Conversely, we can specify a relative error $\epsilon$ and define the effective rank as the minimum rank required to achieve this relative error:
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M},\epsilon) \triangleq \min\left\{ i\,\,\left\|\,\,\sqrt{\left(\sum_{i=1}^r \sigma_i^2\right)\left/\left(\sum_{i=1}^n \sigma_i^2\right.\right)} \geq 1-\epsilon\right.\right\}\end{equation}The numerical significance of this definition is clearer, but since it only considers the overall error, some examples are not very elegant. For instance, for an $n\times n$ identity matrix, we expect its effective rank to always be $n$, because every singular value is identically 1, and no truncation should occur. However, with the above definition, if $n$ is large enough ($1/n < \epsilon$), the effective rank will be less than $n$.
Although the definitions of effective rank in the previous section are intuitive, they all rely on a hyperparameter $\epsilon$, which is ultimately not concise enough. Our basic understanding now is that the concept of effective rank should only depend on relative magnitudes, so the problem is equivalent to constructing an effective rank from $1\geq \sigma_2/ \sigma_1\geq\cdots\geq\sigma_n/\sigma_1\geq 0$. Given that these values are within $[0,1]$, a clever idea is to simply sum them:
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M}) \triangleq \sum_{i=1}^n\frac{\sigma_i}{\sigma_1}\end{equation}From "The Path to Low-Rank Approximation (II): SVD", we know that the largest singular value $\sigma_1$ is the matrix's Spectral Norm, denoted as $\Vert\boldsymbol{M}\Vert_2$, and the sum of all singular values is actually also a matrix norm, called the "Nuclear Norm", usually denoted as $\Vert\boldsymbol{M}\Vert_*$. Thus, the above expression can be simplified as
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M}) \triangleq \frac{\Vert\boldsymbol{M}\Vert_*}{\Vert\boldsymbol{M}\Vert_2}\label{eq:n-2}\end{equation}The earliest source for this might be "An Introduction to Matrix Concentration Inequalities", where it is called the "Intrinsic Dimension," though related properties were already explored in "Guaranteed Minimum-Rank Solutions of Linear Matrix Equations via Nuclear Norm Minimization".
Similarly, we can construct the effective rank through a sum of squares:
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M}) \triangleq \sum_{i=1}^n\frac{\sigma_i^2}{\sigma_1^2} = \frac{\Vert\boldsymbol{M}\Vert_F^2}{\Vert\boldsymbol{M}\Vert_2^2}\label{eq:f-2}\end{equation}Here $\Vert\cdot\Vert_F$ is the $F$-norm (Frobenius norm). This definition likely stems from "Sampling from large matrices: an approach through geometric functional analysis", where it was called "Numerical Rank"; today it is more commonly known as "Stable Rank" and is one of the popular concepts of effective rank.
From the perspective of computational complexity, Eq. $\eqref{eq:f-2}$ is less costly than Eq. $\eqref{eq:n-2}$. Calculating the nuclear norm requires all singular values, which implies a full SVD; however, $\Vert\boldsymbol{M}\Vert_F^2$ is equal to the sum of the squares of all matrix elements. Therefore, the main computational cost for Eq. $\eqref{eq:f-2}$ is the maximum singular value, which is significantly cheaper than computing all singular values. If desired, one could also generalize Eq. $\eqref{eq:n-2}$ and Eq. $\eqref{eq:f-2}$ to a sum of $k$-th powers, where the result is actually a ratio of a more general Schatten norm to the spectral norm.
If readers search directly for "Effective Rank," they are highly likely to find the paper "The Effective Rank: a Measure of Effective Dimensionality". This is one of the earlier works discussing effective rank, and it proposes a definition based on entropy.
First, since singular values are always non-negative, we can normalize them to obtain a probability distribution:
\begin{equation}p_i = \frac{\sigma_i^{\gamma}}{\sum_{j=1}^n \sigma_j^{\gamma}}\end{equation}where $\gamma > 0$. According to the literature, both $\gamma=1$ and $\gamma=2$ are frequently used (we used $\gamma=2$ in Moonlight). For the following, we use $\gamma=1$ as an example. Given a probability distribution, we can calculate the information entropy (Shannon entropy):
\begin{equation}H = -\sum_{i=1}^n p_i \log p_i\end{equation}Recall that the range of entropy is $[0, \log n]$. Taking the exponential gives $e^H \in [1, n]$. When the distribution is One-Hot, $e^H=1$ (only one non-zero singular value); when the distribution is uniform, $e^H=n$ (all singular values are equal). These correspond exactly to the two special cases of standard rank, which inspires us to define the effective rank as
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M}) \triangleq e^H = \exp\left(-\sum_{i=1}^n p_i \log p_i\right)\label{eq:h-erank}\end{equation}Substituting the definition of $p_i$, we find that it can be further transformed into
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M}) = \exp\left(\log\sum_{i=1}^n \sigma_i -\frac{\sum_{i=1}^n \sigma_i\log\sigma_i}{\sum_{i=1}^n \sigma_i}\right)\end{equation}Clearly, the first term inside the parentheses after taking the exponential is $\Vert\boldsymbol{M}\Vert_*$; the second term is a weighted average of $\log\sigma_i$ with weights $\sigma_i$. In this case, $\log\sigma_i$ will be approximately equal to the largest $\log\sigma_1$. Taking the exponential then yields $\sigma_1=\Vert\boldsymbol{M}\Vert_2$. Thus, the overall result of this formula will be approximately Eq. $\eqref{eq:n-2}$. This demonstrates that defining effective rank based on entropy, while seemingly a completely different path, is actually similar in essence to the norm ratios discussed in the previous section.
We know that standard rank satisfies the triangle inequality $\mathop{\text{rank}}(\boldsymbol{A}+\boldsymbol{B})\leq \mathop{\text{rank}}(\boldsymbol{A}) + \mathop{\text{rank}}(\boldsymbol{B})$. The original paper proved that for (semi-)positive definite symmetric matrices $\boldsymbol{A}$ and $\boldsymbol{B}$, the effective rank defined by Eq. $\eqref{eq:h-erank}$ satisfies $\mathop{\text{erank}}(\boldsymbol{A}+\boldsymbol{B})\leq \mathop{\text{erank}}(\boldsymbol{A}) + \mathop{\text{erank}}(\boldsymbol{B})$. It remains unclear whether this inequality can be generalized to arbitrary matrices. Currently, proving whether effective rank preserves certain inequalities of standard rank is not an easy task.
From the various definitions of effective rank discussed above—especially the transition from singular values to distributions and then to entropy—some readers may have vaguely realized that effective rank shares clear commonalities with sparsity. In fact, effective rank can be understood as a measure of sparsity for the vector of singular values. Unlike standard sparsity metrics, we align the range to $1\leq \mathop{\text{erank}}(\boldsymbol{M}) \leq \mathop{\text{rank}}(\boldsymbol{M}) \leq n$ to align it with the concept of rank, making the degree of sparsity easier to perceive intuitively.
Regarding sparsity measures, we previously had a systematic discussion in "How to Measure the Sparsity of Data?". In theory, the results there can be used to construct definitions for effective rank. Indeed, we have done so. In the "Norm Ratios" section, our construction of effective rank based on the ratio of the Schatten norm to the spectral norm only corresponds to formula (1) in that article. We could also use other formulas, such as (16), which is equivalent to defining effective rank as the square of the ratio between the nuclear norm and the Frobenius norm:
\begin{equation}\mathop{\text{erank}}(\boldsymbol{M}) \triangleq \frac{\Vert\boldsymbol{M}\Vert_*^2}{\Vert\boldsymbol{M}\Vert_F^2} = \frac{(\sum_{i=1}^n\sigma_i)^2}{\sum_{i=1}^n\sigma_i^2}\end{equation}This similarly satisfies $1\leq \mathop{\text{erank}}(\boldsymbol{M}) \leq \mathop{\text{rank}}(\boldsymbol{M})$, making it a valid definition for effective rank.
It is quite remarkable how our understanding of effective rank and sparsity has invisibly formed a closed loop. It must be said that this is a wonderful experience: when I first began studying sparsity measures, I knew nothing about effective rank. While studying effective rank over these past few days, I gradually realized it is essentially connected to sparsity. It seems as if there is a mysterious force quietly linking the knowledge we accumulate across different fields and disciplines, eventually converging in the same correct direction.
This article explores the concept of the effective rank of a matrix. It is an extension of the rank concept in linear algebra for numerical computation scenarios and offers a more effective way to measure the inherent dimensionality of a matrix.