The Amazing Johnson-Lindenstrauss Lemma: Applications

By 苏剑林 | September 24, 2021

In the previous article "The Amazing Johnson-Lindenstrauss Lemma: Theoretical Edition", we introduced the theoretical derivation of the Johnson-Lindenstrauss Lemma (JL Lemma) in detail. In this piece, we will focus on its applications.

As a conclusion inherently related to dimensionality reduction, the most basic application of the JL Lemma is, naturally, as a method for reducing dimensions. However, beyond this direct application, many seemingly unrelated algorithms, such as Locality-Sensitive Hashing (LSH) and Randomized SVD, essentially rely on the JL Lemma. Furthermore, for machine learning models, the JL Lemma often provides a theoretical explanation for our choice of dimensions.

Tools for Dimensionality Reduction

The JL Lemma provides a very simple and direct "Random Projection" approach to dimensionality reduction:

Given $N$ vectors $v_1, v_2, \cdots, v_N \in \mathbb{R}^m$, if you want to reduce them to $n$ dimensions, you only need to sample an $n \times m$ matrix $A$ from $\mathcal{N}(0, 1/n)$. Then $Av_1, Av_2, \cdots, Av_N$ are the results after dimensionality reduction.

The simplicity and speed of this approach are beyond doubt. The immediate question readers may have is: how does it compare to methods like PCA or t-SNE?

In fact, just as "existence is reasonable," the fact that more complex methods like PCA and t-SNE have not been eliminated indicates they surely have advantages over random projection. Indeed, the random projection of the JL Lemma merely provides a very basic dimensionality reduction method, showing that even with such a simple method, the required dimension after reduction is only $\mathcal{O}(\log N)$. It serves more as a theoretical proof.

Therefore, if one is truly pursuing accuracy in dimensionality reduction, in most cases, specialized methods like PCA or t-SNE will definitely perform better than random projection. Moreover, as mentioned in the previous article, the JL Lemma is a very sufficient condition. The bounds it provides, such as $n > \frac{24\log N}{\varepsilon^2}$ or even $n > \frac{16\log N}{\varepsilon^2}$, are just very sufficient bounds. For instance, if you take $\varepsilon=0.1$, you get $n > 1600 \log N$, which essentially has no practical value. Switching to more precise methods like PCA or t-SNE allows for loosening this requirement, achieving better results at much lower dimensions.

Localized Hashing

Locality-Sensitive Hashing (LSH) is a scheme for approximately searching for the nearest neighbors under a certain metric. Generally, we rarely link LSH with the JL Lemma, but I believe the choice of hash functions in LSH is actually closely related to it. Simply put, LSH is an algorithm that binarizes vectors, and the binarized vectors approximately maintain the metric. A common scheme is to use random projection to (approximately) maintain the invariance of the cosine value.

Specifically, according to the JL Lemma, if we sample an $n \times m$ matrix $A$ from $\mathcal{N}(0, 1/n)$, then for any $v_i, v_j \in \mathbb{R}^m$, we have $\cos(v_i, v_j) \approx \cos(Av_i, Av_j)$. Of course, random projection is not all there is to LSH. We notice that after projection by $A$, the positive and negative distribution of $Av_i, Av_j$ elements is relatively uniform, so we make a further approximation:

\begin{equation}\cos(v_1,v_j)\approx \cos(Av_i, Av_j)\approx \cos(\text{sign}(Av_i), \text{sign}(Av_j))\end{equation}

That is, each element is binarized to $\pm 1$ based on its sign. This achieves vector binarization while keeping the cosine value approximately unchanged. Once we have binarized vectors, we can build indices or buckets to speed up retrieval; we won't go into detail on those here.

In short, in the LSH process, a key step is also random projection, which itself is closely related to the JL Lemma. Of course, binarization usually results in a significant sacrifice of precision. Therefore, depending on the actual scenario, we don't always "reduce dimensions"; $n$ will not always be less than $m$, and sometimes we might even choose $n > m$. Readers can refer to a previously written article by the author: "How is a binarized word vector model related to fruit flies?"

Randomized Decomposition

Matrix decomposition is a powerful tool for solving many machine learning problems, and Singular Value Decomposition (SVD) is one of the typical methods. However, when the matrix is quite large, the cost of calculating an exact SVD is considerable. In actual scenarios, while the matrix to be decomposed is large, it is often low-rank, making exact SVD calculation unnecessary. This is where "Randomized SVD" comes into play.

Suppose the matrix to be decomposed is $M \in \mathbb{R}^{m \times n}$, where $m, n$ are both quite large. According to the JL Lemma, we can choose a relatively small $k < \min(m, n)$ such that an $n \times k$ matrix $Q$ sampled from $\mathcal{N}(0, 1/k)$ still high-accurately satisfies $QQ^{\top} \approx I$ (an approximate orthogonal matrix), so that $M \approx MQQ^{\top}$. In this way, we only need to perform SVD on the $m \times k$ matrix $B = MQ$ to get $MQ = B = U_B \Sigma_B V_B^{\top}$, then:

\begin{equation}M\approx MQQ^{\top} = U_B\Sigma_B V_B^{\top}Q^{\top} = U_B \Sigma_B (QV_B)^{\top}\end{equation}

This gives an approximate SVD decomposition of the original matrix $M$. Note that the aforementioned $Q$ is only an approximately orthogonal matrix; we can make it strictly orthogonal through QR decomposition (or Gram-Schmidt orthogonalization), which is a minor detail. Throughout this process, what the JL Lemma tells us is that $k$ can be chosen small enough that performing SVD on $B=MQ$ is very low-cost, yet the overall accuracy will not be too poor.

Word Vector Dimensions

We say that the common understanding of the JL Lemma is "accommodating $N$ vectors only requires $\mathcal{O}(\log N)$ dimensional space." Returning to the question of word vector dimension selection, this means if the vocabulary size is $N$, then a word vector dimension of $\mathcal{O}(\log N)$ is sufficient.

What is remarkably startling is that in a previous article "Principle of Minimum Entropy (VI): How to choose the dimension of word vectors?", I calculated a dimension selection formula for the Skip Gram word vector model:

\begin{equation}n > 8.33\log N\end{equation}

The result is identical to the $\mathcal{O}(\log N)$ provided by the JL Lemma! The formula above was estimated based on the idea of entropy and has almost no common ground with the starting point of the JL Lemma, yet it reached the same $\log N$ through different routes.

Moreover, beyond just the main $\log N$ part, we also see that based on the entropy estimation, we calculated the coefficient $8.33$ in front of $\log N$. Previous experimental experience has also shown that the result $8.33 \log N$ aligns well with empirical findings; although it might not be optimal, it is at least in the right ballpark. Does this conversely suggest that we can use entropy to accurately estimate the coefficient in front of $\log N$ for specific problems?

Multi-head Attention

Regarding the Attention mechanism, common interview questions include "Why multi-head?", "What is the difference between a single-head attention with head_size 768 and 12-head attention with head_size 64?", etc. That is to say, for an Attention model like BERT, why reduce head_size to 64 before doing dot products? Is 64 really enough?

This question essentially boils down to whether the Attention mechanism is sufficient to fit any probability pattern. Specifically, the formula for calculating Attention is:

\begin{equation}a_{i,j} = \frac{e^{\langle q_i, k_j\rangle}}{\sum\limits_{j=1}^L e^{\langle q_i, k_j\rangle}}\end{equation}

where $q_i, k_j \in \mathbb{R}^d$. The question of "is it enough" refers to whether the defined $a_{i,j}$ can well-approximate any given probability matrix $p_{i,j}$.

Looking at the definition of $a_{i,j}$, I wonder if any readers find it familiar? If we set aside the context of Attention and treat $q_i, k_j$ as two "word vectors," the definition of $a_{i,j}$ is exactly the same as the Skip Gram model! In other words, looking purely at the calculation formula of the Attention matrix, it is essentially the same as the Skip Gram model. Therefore, the choice of head_size in Attention is essentially the same as the choice of dimension for word vectors.

Let's re-examine the process. To answer "what head_size is enough," the question becomes "can $a_{i,j}$ approximate any arbitrary probability matrix $p_{i,j}$." That is, for a given $p_{i,j}$, can we find a set $q_1, \cdots, q_L, k_1, \cdots, k_L \in \mathbb{R}^d$ such that $a_{i,j}$ is sufficiently close to $p_{i,j}$? This problem is mathematically equivalent to the dimension selection for Skip Gram word vectors.

Therefore, the result for word vector dimension selection can also be applied to Attention's head_size selection, except the vocabulary size becomes the sequence length, i.e., $d > 8.33 \log L$. A common pre-training length is $L=512$; substituting this in gives approximately 52. Again, quite startling, as it is indeed very close to the common head_size=64! So, 64 is truly enough; increasing it further won't lead to significant improvements. It's better to use the extra computation to increase the number of heads.

(Note: Related discussions can also be found in the reference "On the Expressive Power of Self-Attention Matrices".)

Another Summary

This article mainly introduced several direct or indirect applications of the Johnson-Lindenstrauss Lemma (JL Lemma). As we can see, from dimensionality reduction and hashing methods to word vector dimensions and attention head sizes, all are more or less related to the JL Lemma. This further demonstrates the wide scope of the JL Lemma's applicability.

Original Link: https://kexue.fm/archives/8706