How to Measure Data Sparsity?

By 苏剑林 | May 05, 2023

In machine learning, we often talk about sparsity; for example, we frequently say that attention matrices are usually very sparse. However, have you noticed that we never seem to provide a standard method for measuring the degree of sparsity? In other words, our previous discussions on sparsity have remained at an intuitive level without quantitative analysis. So the question arises: is there a standard method for measuring sparsity? After some searching, I found that there are indeed several available indicators, such as $l_1/l_2$, entropy, etc. However, due to different perspectives of focus, there is no single "standard answer" for measuring sparsity. This article briefly records my findings.

Basic Results

In a narrow sense, "sparsity" refers to the presence of a large number of zeros in the data. Therefore, the simplest sparsity indicator is the proportion of zeros. But if we only use this definition, attention matrices could hardly be considered sparse because the results from a softmax are always positive. Thus, it is necessary to generalize the concept of sparsity. A naive idea is to count the proportion of elements whose absolute value does not exceed $\epsilon$. But how do we determine this $\epsilon$?

1 is very large compared to 0.0001, but very small compared to 10,000. Concepts of "large" and "small" are not absolute. Intuitively, if a sparse vector contains many values close to zero, then the average of its absolute values will definitely be relatively small. Since the concepts of large and small are relative, we might as well divide this average by the maximum value to obtain a relative result. Therefore, a seemingly reasonable indicator is:

\begin{equation}S_0(\boldsymbol{x})=\frac{(|x_1|+|x_2|+\cdots+|x_n|)/n}{\max(|x_1|,|x_2|,\cdots,|x_n|)}\label{eq:start}\end{equation}

where $\boldsymbol{x}=[x_1,x_2,\cdots,x_n]\in\mathbb{R}^n$ is the vector whose sparsity needs to be evaluated (hereafter we assume $\boldsymbol{x}$ is not a constant vector, i.e., it has at least two different elements). Vectors with smaller indicators are considered sparser. However, while this indicator has some rationality, it is not "smooth" enough—primarily because the $\max$ operation is extremely sensitive to outliers and does not reflect statistical characteristics well.

Smooth Homogeneity

Following the line of thought in "Seeking a Smooth Maximum Function", we replace $\max$ with its smooth approximation. The standard smooth approximation of $\max$ is $\text{logsumexp}$:

\begin{equation}\max(|x_1|,|x_2|,\cdots,|x_n|)\approx \frac{1}{k}\log\sum_{i=1}^n e^{k|x_i|}\end{equation}

However, using $\text{logsumexp}$ here doesn't provide a good improvement. First, due to the amplification effect of $e^{k|x_i|}$, $\text{logsumexp}$ is equally susceptible to outliers. Second, $\text{logsumexp}$ lacks positive homogeneity, which is unattractive (if all $x_i$ are multiplied by a positive constant $\alpha$, the sparsity should not change). In "Seeking a Smooth Maximum Function", we also provided a smooth approximation of $\max$ that possesses positive homogeneity, which happens to be the $l_p$ norm ($p > 1$):

\begin{equation}\max(|x_1|,|x_2|,\cdots,|x_n|)\approx \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}\triangleq l_p(\boldsymbol{x})\end{equation}

Observing Equation $\eqref{eq:start}$ again, we can see that its numerator is exactly $l_1(\boldsymbol{x}) / n$. Combining the above results, we obtain a sparsity measurement indicator as:

\begin{equation}S_{1,p}(\boldsymbol{x})=\frac{l_1(\boldsymbol{x})/n}{l_p(\boldsymbol{x})}\label{eq:s1p}\end{equation}

If we are comparing vectors of a fixed dimension, the $/n$ can be omitted. A common choice is $p=2$, which yields the $l_1/l_2$ indicator for sparsity measurement. If we let $p\to\infty$, it essentially returns to indicator $\eqref{eq:start}$.

Ideal Properties

When introducing the concept of "entropy" in "Entropy: From Entropy and the Maximum Entropy Principle to Maximum Entropy Models (I)", we found that the mathematical form of entropy could be determined through several ideal properties it should possess. Can we do the same for sparsity?

The paper "Comparing Measures of Sparsity" made an attempt in this direction. It proposed that a sparsity measure $S(\boldsymbol{x})$ should satisfy the following ideal properties (without loss of generality, assume $\boldsymbol{x}$ is a non-negative vector; if not, take the absolute value of each term):

D1: $S([\cdots,x_i - \alpha,\cdots,x_j + \alpha,\cdots]) > S(\boldsymbol{x})$, where $x_i > x_j$ and $0 < \alpha < \frac{x_i - x_j}{2}$. This property states that if the sum remains constant, a more uniform vector is less sparse.

D2: $S(\alpha\boldsymbol{x}) = S(\boldsymbol{x})$, where $\alpha > 0$. This is easy to understand: sparsity is a relative property. Multiplying all elements by the same factor does not change relative sizes, hence it should not change sparsity.

D3: $S(\alpha + \boldsymbol{x}) > S(\boldsymbol{x})$, where $\alpha > 0$. This is also intuitive: if a positive constant is added to all elements, they all move further away from zero, and sparsity naturally decreases.

D4: $S(\boldsymbol{x}) = S(\boldsymbol{x}\circ\boldsymbol{x}) = S(\boldsymbol{x}\circ\boldsymbol{x}\circ\boldsymbol{x}) = \cdots$, where $\circ$ denotes vector concatenation. This means that simple replication of data does not change sparsity.

P1: Given any $i\in\{1,2,\cdots,n\}$, there exists $\beta_i > 0$ such that for any $\alpha > 0$, $S([\cdots,x_{i-1},x_i + \beta_i + \alpha,x_{i+1},\cdots]) < S([\cdots,x_{i-1},x_i + \beta_i,x_{i+1},\cdots])$. This property suggests that when a single element is large enough, it dominates the sparsity of the entire vector.

P2: $S(\boldsymbol{x}\circ[0]) < S(\boldsymbol{x})$. This is straightforward: adding a zero to a vector should increase its sparsity.

The original paper evaluated various common indicators and found that only an indicator called the "Gini Index" satisfies all six properties simultaneously (but the Gini Index is quite complex and will not be introduced here). However, I should remind readers to carefully examine the derivation process while reading the original paper, as I found its proof that $l_1/l_2$ does not satisfy D3 is incorrect. In fact, $l_1/l_2$ does satisfy D3. As for other derivations, I did not check them in detail; readers should verify them during their own reading.

Reference Proofs

Regarding the two indicators in this article: Indicator $\eqref{eq:start}$ satisfies all properties except D1 (if the $>$ in D1 is changed to $\geq$, it is satisfied as well). The verification of these properties is relatively trivial and will not be discussed in detail here; readers are encouraged to complete them themselves.

As for indicator $\eqref{eq:s1p}$, it can be proven to satisfy all properties except D4. The proofs for D3 and P1 are slightly more complex, so I provide reference proofs for them here.

To prove D3, we only need to show that

\begin{equation}\frac{\left(\sum\limits_{i=1}^n (x_i + \alpha)\right)^p}{\sum\limits_{i=1}^n (x_i + \alpha)^p}\end{equation}

is monotonically increasing with respect to $\alpha > 0$. Taking the logarithm of both sides, we get

\begin{equation}p\log\sum\limits_{i=1}^n (x_i + \alpha) - \log\sum\limits_{i=1}^n (x_i + \alpha)^p\triangleq f(\alpha)\end{equation}

We only need to prove $f'(\alpha) > 0$. Differentiating directly gives

\begin{equation}f'(\alpha) = \frac{pn}{\sum\limits_{i=1}^n (x_i + \alpha)} - \frac{p\sum\limits_{i=1}^n (x_i + \alpha)^{p-1}}{\sum\limits_{i=1}^n (x_i + \alpha)^p}\end{equation}

$f'(\alpha) > 0$ is equivalent to

\begin{equation}\frac{1}{n}\sum\limits_{i=1}^n (x_i + \alpha)^p > \left(\frac{1}{n}\sum\limits_{i=1}^n (x_i + \alpha)\right)\left(\frac{1}{n}\sum\limits_{i=1}^n (x_i + \alpha)^{p-1}\right)\end{equation}

This follows directly from the Power Mean Inequality. The proof for P1 is similar; we only need to prove that when $x_i$ is sufficiently large,

\begin{equation}\frac{\left(x_i + \alpha + \sum\limits_{j\neq i} x_j\right)^p}{(x_i + \alpha)^p + \sum\limits_{j\neq i} x_j^p}\end{equation}

is monotonically decreasing with respect to $\alpha > 0$. Taking the logarithm of both sides:

\begin{equation}p\log\left(x_i + \alpha + \sum\limits_{j\neq i} x_j\right) - \log\left((x_i + \alpha)^p + \sum\limits_{j\neq i} x_j^p\right)\triangleq g(\alpha)\end{equation}

We only need to show $g'(\alpha) < 0$. Differentiating directly:

\begin{equation}g'(\alpha) = \frac{p}{x_i + \alpha + \sum\limits_{j\neq i} x_j} - \frac{p (x_i + \alpha)^{p-1}}{(x_i + \alpha)^p + \sum\limits_{j\neq i} x_j^p}\end{equation}

$g'(\alpha) < 0$ is equivalent to

\begin{equation}p \sum\limits_{j\neq i} x_j^p < p(x_i+\alpha)^{p-1}\sum\limits_{j\neq i} x_j\end{equation}

As long as at least one $x_j$ (for $j \neq i$) is non-zero, then for a sufficiently large $x_i$, the above inequality holds for all $\alpha > 0$.

A Perfect Metric

Although indicator $\eqref{eq:s1p}$ does not satisfy D4, if we modify it slightly to

\begin{equation}S_{1,p}^*(\boldsymbol{x})=\frac{l_1(\boldsymbol{x})/n}{l_p(\boldsymbol{x})/n^{1/p}}=n^{(1-p)/p}\frac{l_1(\boldsymbol{x})}{l_p(\boldsymbol{x})}\label{eq:s1p-plus}\end{equation}

it satisfies D4. It can also be verified that it satisfies P2. Since the other properties do not involve changes in the dimension $n$, they are satisfied just as they were for indicator $\eqref{eq:s1p}$. In other words, $S_{1,p}^*(\boldsymbol{x})$ is a "perfect metric" that satisfies all six properties simultaneously!

From the Power Mean Inequality, we know $l_p(\boldsymbol{x})/n^{1/p} \geq l_1(\boldsymbol{x})/n$, so $S_{1,p}^*(\boldsymbol{x})\leq 1$. Furthermore, because

\begin{equation}\frac{l_p(\boldsymbol{x})}{l_1(\boldsymbol{x})}=\left(\left(\frac{x_1}{l_1(\boldsymbol{x})}\right)^p+\cdots+\left(\frac{x_n}{l_1(\boldsymbol{x})}\right)^p\right)^{1/p}\leq\left(\frac{x_1}{l_1(\boldsymbol{x})}+\cdots+\frac{x_n}{l_1(\boldsymbol{x})}\right)^{1/p}=1\end{equation}

we have $S_{1,p}^*(\boldsymbol{x})\geq n^{(1-p)/p}$. Thus, $S_{1,p}^*(\boldsymbol{x})\in[n^{(1-p)/p},1]$. Furthermore, $S_{1,p}^*(\boldsymbol{x})$ can be more generally extended to:

\begin{equation}S_{q,p}^*(\boldsymbol{x})=\frac{l_q(\boldsymbol{x})/n^{1/q}}{l_p(\boldsymbol{x})/n^{1/p}}=n^{1/p-1/q}\frac{l_q(\boldsymbol{x})}{l_p(\boldsymbol{x})}\in\left[n^{1/p-1/q},1\right]\end{equation}

As long as $p > q > 0$, this is also a "perfect metric" that satisfies the six properties.

When $p=2, q=1$, we have

\begin{equation}S_{1,2}^*(\boldsymbol{x})=\frac{1}{\sqrt{n}}\frac{l_1(\boldsymbol{x})}{l_2(\boldsymbol{x})}\in\left[\frac{1}{\sqrt{n}},1\right]\end{equation}

This result helps answer questions regarding sparsification. For instance, why does L1 regularization favor sparsity? Because L1 appears in the numerator of the formula above; the smaller it is, the sparser the result. Why is L2 regularization unfavorable for sparsity? Because L2 appears in the denominator; the smaller it is, the less sparse the result. To more accurately achieve sparsity, one should use $S_{1,2}^*(\boldsymbol{x})$ as a regularization term, which both minimizes L1 and maximizes L2, directly optimizing the sparsity metric.

In particular, when none of the $x_i$ are zero, we also have the relationship

\begin{equation}S_{1,2}^*(\boldsymbol{x})=\cos(\text{sign}(\boldsymbol{x}),\boldsymbol{x})\end{equation}

where $\text{sign}$ applies the sign function to each component of the vector. This formula is very illustrative: $\text{sign}(\boldsymbol{x})$ can be seen as the densest derived vector of $\boldsymbol{x}$. The degree of sparsity is simply the similarity between $\boldsymbol{x}$ and its densest derivative; a lower similarity naturally represents higher sparsity.

If we treat the elements of $\boldsymbol{x}$ as samples of a random variable $x$, we can also write

\begin{equation}S_{1,2}^*(\boldsymbol{x})=\frac{\mathbb{E}[|x|]}{\sqrt{\mathbb{E}[x^2]}}=\frac{\mathbb{E}[|x|]}{\sqrt{\mathbb{E}[|x|]^2 + \mathbb{V}ar[|x|]}} = \frac{1}{\sqrt{1 + \mathbb{V}ar[|x|] / \mathbb{E}[|x|]^2}}\end{equation}

Extracting the term, the square of $\frac{\mathbb{E}[|x|]}{\sqrt{\mathbb{V}ar[|x|]}}$ is precisely the "signal-to-noise ratio" (ratio of the squared mean to the variance) of $|x|$. This tells us that "the lower the signal-to-noise ratio of $|x|$, the sparser $x$ is."

Connection to Entropy

Now returning to the attention matrix, its defining characteristic is that each row corresponds to a probability distribution, which automatically satisfies $x_i \geq 0$ and $x_1+\cdots+x_n=1$. The most deterministic probability distribution is a one-hot distribution, which is also the sparsest; the most uncertain distribution is the uniform distribution, which is clearly the least sparse. From these two extremes, one can guess that the sparsity of a probability distribution is related to its uncertainty.

We know that the uncertainty of a probability distribution is generally measured by (Shannon) entropy:

\begin{equation}H(\boldsymbol{x}) = -\sum_{i=1}^n x_i \log x_i\in[0,\log n] \end{equation}

Meanwhile, the indicator $\eqref{eq:s1p-plus}$ becomes $\frac{n^{(1-p)/p}}{l_p(\boldsymbol{x})}$, which is a derivative of the $l_p$ norm. Since sparsity and uncertainty may be related, does it mean that entropy and the $l_p$ norm are correlated to some extent?

Indeed they are. In fact, based on the $l_p$ norm, we can construct the Rényi Entropy:

\begin{equation}H_p(\boldsymbol{x}) = -\frac{1}{p-1}\log \sum_{i=1}^n x_i^p \end{equation}

It can be proven that $H_1(\boldsymbol{x}) = \lim\limits_{p\to 1} H_p(\boldsymbol{x}) = H(\boldsymbol{x})$, which corresponds to the classic Shannon entropy when $p\to 1$. When $p \neq 1$, it is general Rényi entropy (in some contexts, Rényi entropy specifically refers to the case where $p=2$). Each type of Rényi entropy can serve as a measure of uncertainty; their ranges are all $[0,\log n]$, reaching the minimum at one-hot distributions and the maximum at uniform distributions. In this sense, all Rényi entropies are equivalent to some extent, which explains the connection between entropy and the $l_p$ norm.

It is worth mentioning that Rényi entropy for $p \neq 1$ is often more computationally friendly. This is because $\sum\limits_{i=1}^n x_i^p$ is necessarily a positive, bounded result, avoiding the $\log 0$ issue during calculation, and only a single $\log$ operation is needed at the very end. In contrast, standard Shannon entropy requires calculating $\log$ for every $x_i$, which increases computational cost and requires special $\text{clip}$ operations to prevent the appearance of $\log 0$.

Summary

This article systematically organized the measurement of sparsity and discussed its relationship with concepts such as L1, L2, and entropy.