The Amazing Johnson-Lindenstrauss Lemma: Theoretical Edition

By 苏剑林 | September 17, 2021

Today we are going to learn about the Johnson-Lindenstrauss Lemma. Since the name is quite long, we will refer to it as the "JL Lemma" hereafter.

In my personal opinion, the JL Lemma is one of the miraculous conclusions that every computer science student must understand. It is a famous result regarding dimensionality reduction and serves as one of the classic examples of the counter-intuitive "curse of dimensionality" phenomena in high-dimensional spaces. It can be said that the JL Lemma is the theoretical foundation for various dimensionality reduction and hashing techniques in machine learning. Furthermore, in modern machine learning, the JL Lemma provides important theoretical support for our understanding and tuning of model hyperparameters such as dimensions.

Logarithmic Dimensions

The JL Lemma can be expressed very popularly as:

Popular Version of the JL Lemma: To accommodate $N$ vectors, you only need $\mathcal{O}(\log N)$ dimensional space.

Specifically, the JL Lemma states that regardless of the original dimensionality of these $N$ vectors, we can reduce them to $\mathcal{O}(\log N)$ while keeping the relative distance errors within a certain range. Imagine how powerful, counter-intuitive, and practical this conclusion is. For instance, in vector retrieval, the original vector dimensionality might be very large, making full retrieval extremely costly. The JL Lemma tells us that we can transform them into $\mathcal{O}(\log N)$ dimensions with retrieval performance remaining nearly unchanged—it's simply a "free lunch"!

Readers might wonder: with such a powerful conclusion, would the corresponding dimensionality reduction method be extremely complex? The answer is quite the opposite; the reduction process involves only random linear projection! There is even a saying that the JL Lemma is a "conclusion that is easier to prove than to understand." That is, proving it mathematically is not particularly difficult, but intuitively grasping this counter-intuitive conclusion is actually harder.

Coincidentally, we have previously introduced two counter-intuitive results: in the article "Distribution of the Angle Between Two Random Vectors in n-dimensional Space", we introduced the idea that "any two vectors in high-dimensional space are almost perpendicular," which is clearly far from our results in 2D or 3D space. In the article "Understanding Model Parameter Initialization Strategies from a Geometric Perspective", this result was further upgraded to "an $n \times n$ matrix sampled from $\mathcal{N}(0, 1/n)$ is almost an orthogonal matrix," which significantly contradicts our standard understanding that "orthogonality is very strict (requiring the transpose to equal the inverse)."

But in fact, these two conclusions are not only correct but also directly related to the JL Lemma. One could say the JL Lemma is a refinement and application of them. Therefore, we first need to characterize these two conclusions with more quantitative language, such as exactly what the probability of being "almost perpendicular" is, or how large the error in "approximate orthogonality" actually is.

Probability Inequalities

For this, we need some probability knowledge, most importantly "Markov's Inequality":

Markov's Inequality: If $x$ is a non-negative random variable and $a > 0$, then \begin{equation}P(x\geq a)\leq \frac{\mathbb{E}[x]}{a}\end{equation}

Note that this inequality does not impose other special restrictions on the distribution followed by $x$, requiring only that the sample space for the random variable is non-negative (or equivalently, the probability of negative $x$ is zero). The proof is quite simple: \begin{equation}\mathbb{E}[x]=\int_0^{\infty}x p(x) \geq \int_a^{\infty}x p(x) \geq \int_a^{\infty} a p(x) = a P(x\geq a)\end{equation}

Markov's inequality requires the random variable to be non-negative, but the variables we usually deal with are not necessarily non-negative, so we usually need a transformation. For example, $x - \mathbb{E}[x]$ is not non-negative, but $|x - \mathbb{E}[x]|$ is non-negative. Thus, using Markov's inequality: \begin{equation}P(|x - \mathbb{E}[x]|\geq a) = P((x - \mathbb{E}[x])^2\geq a^2) \leq \frac{\mathbb{E}[(x - \mathbb{E}[x])^2]}{a^2}=\frac{\mathbb{V}ar[x]}{a^2}\end{equation} This is "Chebyshev's Inequality."

Another classic technique is called the "Cramér-Chernoff Method," which is the primary method we will use later. It uses an exponential function to turn a random variable into a non-negative one: for any $\lambda > 0$, we have \begin{equation}x \geq a \quad\Leftrightarrow\quad \lambda x \geq \lambda a \quad\Leftrightarrow\quad e^{\lambda x} \geq e^{\lambda a}\end{equation} So, using Markov's inequality: \begin{equation}P(x \geq a) = P(e^{\lambda x} \geq e^{\lambda a})\leq e^{-\lambda a}\mathbb{E}[e^{\lambda x}]\end{equation} While the left side is independent of $\lambda$, there is a $\lambda$ on the right side, and this inequality holds for any $\lambda > 0$. Thus, theoretically, we can find the $\lambda$ that minimizes the right side to obtain the sharpest estimate: \begin{equation}P(x \geq a) \leq \min_{\lambda > 0} e^{-\lambda a}\mathbb{E}[e^{\lambda x}]\end{equation}

The Lemma of the Lemma

Now, we can introduce the following result, which is a lemma for the JL Lemma, or one could say, the theoretical foundation for all conclusions in this article:

Unit Norm Lemma: Let $u\in\mathbb{R}^{n}$ be a vector where each component is independently and identically sampled from $\mathcal{N}(0,1/n)$. Let $\varepsilon \in (0, 1)$ be a given constant. We have \begin{equation}P(|\| u\|^2 - 1| \geq \varepsilon) \leq 2\exp\left(-\frac{\varepsilon^2 n}{8}\right)\end{equation}

This lemma tells us that when $n$ is large enough, the probability that the norm of $u$ significantly deviates from 1 is very small (for a given $\varepsilon$, it decreases exponentially with respect to $n$ to 0). Thus, an $n$-dimensional vector sampled from $\mathcal{N}(0,1/n)$ will be very close to a unit vector.

Its proof involves the "Cramér-Chernoff Method": first, $|\| u\|^2 - 1| \geq \varepsilon$ implies $\| u\|^2 - 1 \geq \varepsilon$ or $1 - \| u\|^2\geq \varepsilon$. We need to derive these separately. Without loss of generality, we first derive the probability of $\| u\|^2 - 1 \geq \varepsilon$. According to the Cramér-Chernoff method: \begin{equation}P(\| u\|^2 - 1 \geq \varepsilon) \leq \min_{\lambda > 0} e^{-\lambda \varepsilon}\mathbb{E}\big[e^{\lambda (\| u\|^2 - 1)}\big] = \min_{\lambda > 0} e^{-\lambda (\varepsilon + 1)}\mathbb{E}\big[e^{\lambda \| u\|^2}\big]\end{equation} Writing $u$ in component form $(u_1, u_2, \cdots, u_n)$, where each component is independent and follows $\mathcal{N}(0,1/n)$, we have: \begin{equation}\mathbb{E}\big[e^{\lambda \| u\|^2}\big] = \mathbb{E}\big[e^{\lambda\sum\limits_i u_i^2}\big] = \mathbb{E}\big[\prod_i e^{\lambda u_i^2}\big]=\prod_i \mathbb{E}\big[ e^{\lambda u_i^2}\big]\end{equation} Since $\mathbb{E}\big[ e^{\lambda u_i^2}\big]=\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}}e^{-u_i^2/2}e^{\lambda u_i^2/n} du_i=\sqrt{n/(n-2\lambda)}$, then: \begin{equation}P(\| u\|^2 - 1 \geq \varepsilon) \leq \min_{\lambda > 0} e^{-\lambda (\varepsilon + 1)}\left(\frac{n}{n-2\lambda}\right)^{n/2}\end{equation} The minimum on the right side is attained at $\lambda = \frac{n\varepsilon}{2(1+\epsilon)}$. The derivation is left to the reader. Substituting it back, we get: \begin{equation}P(\| u\|^2 - 1 \geq \varepsilon) \leq e^{n(\log(1+\varepsilon) - \varepsilon)/2}\leq e^{-n\varepsilon^2/8}\end{equation} The proof that $\log(1+\varepsilon) - \varepsilon \leq -\varepsilon^2/4$ is also left to the reader. Similarly, we can derive the probability for $1 - \| u\|^2\geq \varepsilon$, resulting in: \begin{equation}P(1 - \| u\|^2\geq \varepsilon) \leq e^{n(\log(1-\varepsilon) + \varepsilon)/2}\leq e^{-n\varepsilon^2/8}\end{equation} Where it can be proved that $\log(1-\varepsilon) + \varepsilon \leq \log(1+\varepsilon) - \varepsilon$, so the above uses the bound for $\log(1+\varepsilon) - \varepsilon$. Adding the two expressions together, we get $P(|\| u\|^2 - 1| \geq \varepsilon)\leq 2e^{-n\varepsilon^2/8}$. Q.E.D.

Starting from the "Unit Norm Lemma," we can prove the "Orthogonality Lemma":

Orthogonality Lemma: Let $u,v\in\mathbb{R}^{n}$ be two vectors independently and identically sampled from $\mathcal{N}(0,1/n)$. Let $\varepsilon \in (0, 1)$ be a given constant. We have \begin{equation}P(|\langle u, v\rangle| \geq \varepsilon) \leq 4\exp\left(-\frac{\varepsilon^2 n}{8}\right)\end{equation}

This lemma tells us that when $n$ is large enough, the probability that the inner product of $u, v$ significantly deviates from 0 is very small. Combined with the "Unit Norm Lemma," we get the conclusion that "an $n \times n$ matrix sampled from $\mathcal{N}(0, 1/n)$ is almost an orthogonal matrix."

With the "Unit Norm Lemma" as a foundation, the proof is not difficult. We know that if $u,v\sim \mathcal{N}(0,1/n)$, then $\frac{u\pm v}{\sqrt{2}}\sim \mathcal{N}(0,1/n)$. According to the proof of the "Unit Norm Lemma," we have: \begin{equation}P\left(\left\| \frac{u+v}{\sqrt{2}}\right\|^2 - 1 \geq \varepsilon\right) \leq e^{-n\varepsilon^2/8},\quad P\left(1-\left\| \frac{u-v}{\sqrt{2}}\right\|^2 \geq \varepsilon\right) \leq e^{-n\varepsilon^2/8}\end{equation} Note that adding $\left\| \frac{u+v}{\sqrt{2}}\right\|^2 - 1 \geq \varepsilon$ and $1-\left\| \frac{u-v}{\sqrt{2}}\right\|^2 \geq \varepsilon$ leads to $\langle u, v\rangle\geq \varepsilon$, hence: \begin{equation}P(\langle u, v\rangle\geq \varepsilon) \leq P\left(\left\| \frac{u+v}{\sqrt{2}}\right\|^2 - 1 \geq \varepsilon\right) + P\left(1-\left\| \frac{u-v}{\sqrt{2}}\right\|^2 \geq \varepsilon\right) \leq 2e^{-n\varepsilon^2/8}\end{equation} Similarly, we can prove $P(-\langle u, v\rangle\geq \varepsilon) \leq 2e^{-n\varepsilon^2/8}$. Combining both yields the "Orthogonality Lemma."

The Proof Process

Now we can proceed to prove the JL Lemma. Here is its mathematical statement:

Mathematical Version of the JL Lemma: Given $N$ vectors $v_1,v_2,\cdots,v_N\in\mathbb{R}^m$ and $n > \frac{24\log N}{\varepsilon^2}$, and a random matrix $A\in\mathbb{R}^{n\times m}$ where each element is independently and identically sampled from $\mathcal{N}(0,1/n)$, and a constant $\varepsilon \in (0, 1)$, then with a probability of at least $\frac{N-1}{N}$, for all $i\neq j$, the following holds: \begin{equation}(1-\varepsilon)\| v_i - v_j\|^2 \leq \| Av_i - A v_j\|^2 \leq (1+\varepsilon)\| v_i - v_j\|^2\label{eq:bound}\end{equation}

The lemma tells us that regardless of the original vector dimensionality $m$, we only need $n > \frac{24\log N}{\varepsilon^2}$ dimensions to accommodate $N$ vectors such that the deviation in their relative distances does not exceed $\varepsilon$. Furthermore, the JL Lemma provides the dimensionality reduction method: just randomly sample an $n\times m$ matrix $A$ from $\mathcal{N}(0,1/n)$, and the transformation $v\to Av$ will achieve the goal with a probability of at least $\frac{N-1}{N}$. This is truly simple and practical!

The proof process is also a direct application of the "Unit Norm Lemma." First, if $u\in\mathbb{R}^m$ is a given unit vector and $A\in\mathbb{R}^{n\times m}$ is sampled from $\mathcal{N}(0,1/n)$, then each component of $Au$ is independently distributed as $\mathcal{N}(0,1/n)$. The proof is not hard: by definition each component $(Au)_i = \sum\limits_j A_{i,j}u_j$. Since $A_{i,j}$ are independent, $(Au)_i$ are clearly independent. Since $A_{i,j}\sim\mathcal{N}(0,1/n)$, the sum of normal random variables is still normal, so $(Au)_i$ follows a normal distribution with mean $\sum\limits_j u_j\times 0=0$ and variance $\sum\limits_j u_j^2\times \frac{1}{n} = \frac{1}{n}$.

So, essentially, $Au$ is equivalent to an $n$-dimensional vector sampled independently from $\mathcal{N}(0,1/n)$. Now substituting $u=\frac{v_i - v_j}{\| v_i - v_j\|}$ and using the "Unit Norm Lemma": \begin{equation}P\left(\left| \left\| \frac{A(v_i - v_j)}{\| v_i - v_j\|}\right\|^2 - 1\right| \geq \varepsilon\right) \leq 2\exp\left(-\frac{\varepsilon^2 n}{8}\right)\end{equation} This result holds for any $i\neq j$. Considering all pairs $i\neq j$, the probability that at least one pair is $\geq \varepsilon$ is at most: \begin{equation}P\left(\exists (i,j):\,\left| \left\| \frac{A(v_i - v_j)}{\| v_i - v_j\|}\right\|^2 - 1\right| \geq \varepsilon\right) \leq 2 {N\choose 2} \exp\left(-\frac{\varepsilon^2 n}{8}\right)\end{equation} Or conversely, the probability that for all $i\neq j$, $\left| \left\| \frac{A(v_i - v_j)}{\| v_i - v_j\|}\right\|^2 - 1\right| \leq \varepsilon$ (equivalent to $\eqref{eq:bound}$) holds is at least: \begin{equation}1 - 2 {N\choose 2} \exp\left(-\frac{\varepsilon^2 n}{8}\right) = 1 - N(N-1)\exp\left(-\frac{\varepsilon^2 n}{8}\right)\end{equation} Substituting $n > \frac{24\log N}{\varepsilon^2}$, we get: \begin{equation}1 - N(N-1)\exp\left(-\frac{\varepsilon^2 n}{8}\right)\geq 1 - N(N-1)N^{-3}\geq 1-N^{-1}\end{equation} With this, the proof is complete.

The JL Lemma above preserves the Euclidean distance approximately. Many times, retrieval uses the inner product (like cosine similarity) rather than Euclidean distance. For this, we have:

Inner Product Version of the JL Lemma: Given $N$ unit vectors $v_1,v_2,\cdots,v_N\in\mathbb{R}^m$ and $n > \frac{24\log N}{\varepsilon^2}$, if random matrix $A\in\mathbb{R}^{n\times m}$ is sampled from $\mathcal{N}(0,1/n)$, and $\varepsilon \in (0, 1)$ is a given constant, then with a probability of at least $\frac{N-2}{N}$, for all $i\neq j$, the following holds: \begin{equation}\left|\langle Av_i, Av_j\rangle - \langle v_i, v_j\rangle\right|\leq\varepsilon\end{equation}

The proof is simple, mimicking the "Orthogonality Lemma." According to the proof of the JL Lemma, in the same conditions, there is a probability of at least $\frac{N-2}{N}$ that both of the following hold for all $i\neq j$: \begin{equation}\begin{aligned} (1-\varepsilon)\| v_i - v_j\|^2 \leq \| Av_i - A v_j\|^2 \leq (1+\varepsilon)\| v_i - v_j\|^2 \\ (1-\varepsilon)\| v_i + v_j\|^2 \leq \| Av_i + A v_j\|^2 \leq (1+\varepsilon)\| v_i + v_j\|^2 \end{aligned}\end{equation} Multiplying the first by $-1$ gives $-(1+\varepsilon)\| v_i - v_j\|^2 \leq -\| Av_i - A v_j\|^2 \leq -(1-\varepsilon)\| v_i - v_j\|^2$, then adding it to the second gives: \begin{equation}4\langle v_i, v_j\rangle-2\varepsilon(\| v_i\|^2 + \| v_j\|)\leq 4\langle Av_i, Av_j\rangle \leq 4\langle v_i, v_j\rangle + 2\varepsilon(\| v_i\|^2 + \| v_j\|)\end{equation} Since $v_i, v_j$ are unit vectors, the above simplifies to $|\langle Av_i, Av_j\rangle - \langle v_i, v_j\rangle|\leq\varepsilon$.

Extreme Sufficiency

Students who have gone through the proof of the JL Lemma can feel that the $\log N$ in the conclusion essentially arises because the probability term $2\exp\left(-\frac{\varepsilon^2 n}{8}\right)$ in the "Unit Norm Lemma" decays exponentially. We can relax this decay rate to polynomial decay, which results in $\log N$.

Overall, the JL Lemma tells us that to fit $N$ vectors with error $\varepsilon$, you only need $\mathcal{O}\left(\frac{\log N}{\varepsilon^2}\right)$ dimensions. The specific constant in front is not that important. In fact, the JL Lemma provides a very sufficient condition; in practice, the conditions are often much looser. For instance, in the proof of the JL Lemma, if we change the condition to $n > \frac{16\log N}{\varepsilon^2}$, then the probability that equation $\eqref{eq:bound}$ holds is at least: \begin{equation}1 - N(N-1)\exp\left(-\frac{\varepsilon^2 n}{8}\right)\geq 1 - N(N-1)N^{-2}=1/N\end{equation} Note that while $1/N$ is small, it is still greater than 0, so there still exists an $A$ such that $\eqref{eq:bound}$ holds; it's just that the cost of finding $A$ is higher (the success rate is only $1/N$). If we only care about existence, then this is enough.

Moreover, the JL Lemma only considers dimensionality reduction under random linear projection and reaches $n > \frac{16\log N}{\varepsilon^2}$. If we used more sophisticated reduction techniques, like SVD-based reduction, it is possible to get better results (smaller coefficients). If non-linear reduction methods are considered, results could be even better. Therefore, do not worry too much about the constant in $\frac{\log N}{\varepsilon^2}$. We only need to know the scale of $\mathcal{O}\left(\frac{\log N}{\varepsilon^2}\right)$. If you really need to use it, you usually determine the constant based on the actual situation rather than relying on the theoretical result.

To Be Continued

In this article, we introduced the Johnson–Lindenstrauss Lemma (JL Lemma), a miraculous and important conclusion about dimensionality reduction and a manifestation of the unusual nature of high-dimensional space. it tells us that "only $\mathcal{O}(\log N)$ dimensional space is needed to accommodate $N$ vectors," allowing high-dimensional retrieval problems to be transposed into $\mathcal{O}(\log N)$ dimensions.

This article primarily discussed the details of the theoretical proof. In the next article, we will attempt to apply it to understand some machine learning problems. Stay tuned!