The Principle of Minimum Entropy (Part 6): How Should We Choose Word Embedding Dimensions?

By 苏剑林 | August 20, 2020

With the development of NLP, word embedding models like Word2Vec and GloVe are gradually being replaced by Transformer-based models such as BERT. However, classics remain classics; word embedding models still shine in many scenarios and there is still much worth studying. In this article, we address a common point of confusion regarding word embedding models: what dimension is roughly sufficient for word vectors? First, let's give the conclusion—the estimated result provided by the author is:

\begin{equation}n > 8.33\log N\label{eq:final}\end{equation}

To be even more concise, one can simply remember $n > 8\log N$, where $N$ is the vocabulary size and $n$ is the word vector dimension, with $\log$ being the natural logarithm. When $n$ exceeds this threshold, it indicates that the model has sufficient capacity to accommodate these $N$ words (of course, a larger $n$ also increases the risk of overfitting). Thus, when $N=100,000$, the resulting $n$ is approximately 96; so for a word embedding model with 100,000 words, a dimension of 96 is sufficient. If you need to accommodate 5 million words, then $n$ would be approximately 128.

Background

The reason I thought of this question is that yesterday I came across the paper "Word2vec Skip-gram Dimensionality Selection via Sequential Normalized Maximum Likelihood" on arXiv. Unfortunately, I did not find the answer I was looking for in that paper. I did a quick search and found other literature studying the same problem, such as "On the Dimensionality of Word Embedding", but the answers were still not what I wanted.

Why do I say this? Obviously, the most standard answer to this question should be to determine the optimal dimension through repeated experiments; therefore, one cannot expect theoretical analysis to provide an extremely precise answer. The word vector dimensions we commonly use are generally 64, 100, 128, 256, 300, etc. The difference in performance between different dimensions is often not very significant. I only hoped to derive the magnitude of the required dimensions for a general word vector model—such as tens or hundreds—in the most concise and intuitive way possible, without overly complex analysis.

Since I couldn't find a satisfactory existing result, I analyzed it from the perspective of the Principle of Minimum Entropy and obtained an answer close to what I had in mind.

Analysis

This article analyzes word vector models based on the Skip-Gram concept. Most word vector models are essentially variations of it. As for CBOW-type models, in previous experiments, their performance has been similar to Skip-Gram (especially when the data volume is large), so the analysis results for Skip-Gram can be considered generalizable.

Minimizing Entropy

Our starting point is information entropy. We know that entropy is a measure of uncertainty. Language itself possesses a certain degree of uncertainty, and when we use vectors to encode words, the result of the encoding should be equal to or even less than this uncertainty to ensure that the encoding is effective and sufficiently preserves the information of the original language. Therefore, we want to eliminate uncertainty, which means we want to minimize entropy.

It should be noted that word vectors are based on the Skip-Gram model, so what we need to calculate is not the average word entropy, but the average entropy of the entire Skip-Gram model. Assuming the frequency of the word pair $(w_i, w_j)$ is $\tilde{p}(w_i, w_j)$, we can estimate its entropy as:

\begin{equation}\tilde{H}=-\sum_{i, j} \tilde{p}(w_i, w_j)\log \tilde{p}(w_i, w_j)\end{equation}

The training objectives of different word vector models also differ; some fit the joint probability $p(w_i, w_j)$, while others fit the conditional probability $p(w_j|w_i)$. However, the difference is minor; as mentioned earlier, this article only seeks an approximate figure. Therefore, we uniformly assume the word vector model is:

\begin{equation}p(w_i, w_j) = \frac{e^{ \langle \boldsymbol{u}_i, \boldsymbol{v}_j \rangle }}{Z},\quad Z = \sum_{i,j}e^{ \langle \boldsymbol{u}_i, \boldsymbol{v}_j \rangle }\end{equation}

Where $\boldsymbol{u}, \boldsymbol{v}$ represent two different sets of word vectors (center word vectors and context word vectors), $\boldsymbol{u}_i$ represents word $w_i$ and $\boldsymbol{v}_j$ represents word $w_j$. At this point, its information entropy is:

\begin{equation}H=-\sum_{i, j} p(w_i, w_j)\log p(w_i, w_j)=\log Z-\frac{1}{Z}\sum_{i, j} e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle} \langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle\end{equation}

Sampling Approximation

To approximate the calculation of the above formula, we use sampling approximation for the summation, for example:

\begin{equation}Z = \sum_{i, j} e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle} = N^2\times \frac{1}{N^2}\sum_{i, j} e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle}\approx N^2\mathbb{E}_{\boldsymbol{u}_i,\boldsymbol{v}_j}\left[e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle}\right]\end{equation}

Here $N$ is the vocabulary size. Similarly:

\begin{equation}\sum_{i, j} e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle} \langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle\approx N^2\mathbb{E}_{\boldsymbol{u}_i,\boldsymbol{v}_j}\left[e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle} \langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle\right]\end{equation}

So we have the approximation:

\begin{equation}H\approx\log N^2 + \log \mathbb{E}_{\boldsymbol{u}_i,\boldsymbol{v}_j}\left[e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle}\right]-\frac{\mathbb{E}_{\boldsymbol{u}_i,\boldsymbol{v}_j}\left[e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle} \langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle\right]}{\mathbb{E}_{\boldsymbol{u}_i,\boldsymbol{v}_j}\left[e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle}\right]}\end{equation}

Distribution Assumption

Observing existing word vector models, we find that the values of each dimension can be positive or negative, and their absolute values are generally fairly uniform. Here, we might assume that the absolute value of each element is roughly 1, then the modulus length of each word vector would be approximately $\sqrt{n}$ ($n$ is the word vector dimension, which is our target for estimation; if you feel this approximation is not accurate enough, you can adjust it yourself). We further assume that all word vectors are uniformly distributed on an $n$-dimensional hypersphere with radius $\sqrt{n}$. Then $\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle=n\cos\theta$, where $\theta$ is their included angle. Thus:

\begin{equation}H\approx\log N^2 + \log \mathbb{E}_{\theta}\left[e^{n\cos\theta}\right]-\frac{\mathbb{E}_{\theta}\left[e^{n\cos\theta} n\cos\theta\right]}{\mathbb{E}_{\theta}\left[e^{n\cos\theta}\right]}\label{eq:H-theta}\end{equation}

Now $\theta$ corresponds to the angle between any two random vectors in $n$-dimensional space. In "Distribution of the angle between two random vectors in n-dimensional space", we derived its distribution as:

\begin{equation}p_n(\theta) = \frac{\Gamma\left(\frac{n}{2}\right)}{\Gamma\left(\frac{n-1}{2}\right)\sqrt{\pi}}\sin^{n-2} \label{eq:jqfb}\theta\end{equation}

Since the probability density function is determined, for a given $N$ and $n$, the approximation formula \eqref{eq:H-theta} can be fully computed numerically. From $\tilde{H} > H$, we can solve for the corresponding $n$.

Comparison of Results

First, we numerically calculate some results for $h_n=\log \mathbb{E}_{\theta}\left[e^{n\cos\theta}\right]-\frac{\mathbb{E}_{\theta}\left[e^{n\cos\theta} n\cos\theta\right]}{\mathbb{E}_{\theta}\left[e^{n\cos\theta}\right]}$:

$n$	32	64	96	128	256	512
$h_n$	-7.77471	-15.4734	-23.1726	-30.8718	-61.6692	-123.264

Thus, for example, if $n=64, N=100,000$, then $H\approx \log 100000^2 - 15.4734 = 7.55245$. Readers might find it strange: when $n=128, N=100,000$, doesn't $H$ become negative? How can discrete entropy be negative? In fact, this is because in our derivation process, we combined sampling approximation with exact integration. When the spatial dimension $n$ is large enough, even sampling hundreds of thousands of points may not accurately estimate certain statistics, so the sampling approximation step introduces error.

However, this gives us another way to determine $n$: When $H < 0$ occurs, it indicates that $N$ samples are no longer sufficient to estimate the statistics well, which conversely means that the $n$-dimensional space is "more than enough" to accommodate $N$ samples. Therefore, we can simply determine a boundary using $H < 0$ without needing to estimate $\tilde{H}$. (Alternatively, from another perspective: $\tilde{H}$ is always greater than 0, so $H < 0$ is a sufficient condition for $H < \tilde{H}$.)

Finally, we see that $h_n$ is roughly linear with respect to $n$, with $h_n/n\approx -0.24$. Thus $H\approx\log N^2 -0.24n$. Setting it less than 0 allows us to solve for formula \eqref{eq:final}.

Approximate Estimation

Regarding the result $h_n/n\approx -0.24$, we can also obtain it theoretically. According to the definition of $h_n$, we have:

\begin{equation}h_n = \log\frac{\int_0^{\pi} \sin^{n-2}\theta\, e^{n\cos\theta}d\theta}{\int_0^{\pi} \sin^{n-2}\theta d\theta} - n\frac{\int_0^{\pi} \sin^{n-2}\theta\, e^{n\cos\theta}\cos\theta d\theta}{\int_0^{\pi} \sin^{n-2}\theta\,e^{n\cos\theta} d\theta}\end{equation}

Essentially, this is an asymptotic estimation problem for several definite integrals. We mainly use the idea of Laplace's approximation. Assuming $n$ is large enough such that we can ignore the difference between $n$ and $n-2$, then:

\begin{equation}\log\left[\sin^{n-2}\theta\, e^{n\cos\theta}\right]=(n-2)\log \sin\theta + n\cos\theta \approx n (\log \sin\theta + \cos\theta)\end{equation}

The maximum value of $\log\sin\theta + \cos\theta$ in $[0,\pi]$ can be calculated as occurring at $\arctan\sqrt{\frac{\sqrt{5}+1}{2}}\approx 0.904557$. Expanding to the second order at that point, we get:

\begin{equation}\log \sin\theta + \cos\theta\approx 0.377428 -1.11803 (\theta - 0.904557)^2\end{equation}

Thus:

\begin{equation}\begin{aligned} \int_0^{\pi} \sin^{n-2}\theta\, e^{n\cos\theta}d\theta\approx&\, \int_{-\infty}^{\infty} e^{n[0.377428 -1.11803 (\theta - 0.904557)^2]}d\theta\\ \approx&\, \frac{1.67629}{\sqrt{n}}e^{0.377428n} \end{aligned}\end{equation}

And note that when $n$ is large enough, the integrand is significantly non-zero only near $\theta=0.904557$ (or treating the limit of the normal distribution as an approximation of the Dirac delta function), yielding:

\begin{equation}\begin{aligned} \frac{\int_0^{\pi} \sin^{n-2}\theta\, e^{n\cos\theta}\cos\theta d\theta}{\int_0^{\pi} \sin^{n-2}\theta\, e^{n\cos\theta} d\theta}\approx&\, \frac{\int_{-\infty}^{\infty} e^{n[0.377428 -1.11803 (\theta - 0.904557)^2]}\cos\theta d\theta}{\int_{-\infty}^{\infty} e^{n[0.377428 -1.11803 (\theta - 0.904557)^2]}d\theta} \\ \approx&\, \int_{-\infty}^{\infty} \delta(\theta - 0.904557) \cos\theta d\theta\\ =&\,\cos 0.904557 \approx 0.618034 \end{aligned}\end{equation}

Similarly, using Laplace's approximation, we can get $\int_0^{\pi} \sin^{n-2}\theta d\theta\approx\frac{2.50663}{\sqrt{n}}$. Integrating all results, we obtain:

\begin{equation}h_n \approx \log \frac{\frac{1.67629}{\sqrt{n}}e^{0.377428n}}{\frac{2.50663}{\sqrt{n}}}-0.618034n\approx -0.240606 n\end{equation}

This yields the result $h_n/n\approx -0.24$. The above derivation used numerical values; if all analytical values are preserved, the analytical solution for this coefficient is $\frac{1}{2}\log\frac{\sqrt{5}+1}{2}$.

Summary

In this article, we analyzed the problem of selecting word vector dimensions based on the Principle of Minimum Entropy and finally provided an approximate estimation formula. The calculations show that this estimation formula is consistent with our past "alchemy" experience in deep learning.