By 苏剑林 | September 27, 2021
In the previous article "Minimum Entropy Principle (VI): How to choose the dimension of word vectors?", we derived a word vector dimension formula "$n > 8.33 \log N$" based on minimum entropy principles. Then, in "The Amazing Johnson-Lindenstrauss Lemma: Application Edition", we further pointed out that this result is consistent with the $\mathcal{O}(\log N)$ provided by the JL lemma.
Since it looks perfect in theory, readers naturally ask: what about the experimental results? Is the coefficient 8.33 optimal? This article provides a brief summary of relevant content regarding this problem.
Word Vectors
First, we can directly calculate that when $N$ is 100,000, $8.33 \log N \approx 96$, and when $N$ is 5 million, $8.33 \log N \approx 128$. This indicates that, at least in terms of order of magnitude, the results given by the formula are very much in line with the dimensions we actually use, because in the era of word vectors, the dimensions of the word vectors we trained ourselves were around 100 dimensions. Some readers might question that most open-source word vectors currently are 300-dimensional, and even the Embedding layer of BERT reaches 768 dimensions—doesn't this clearly deviate from your results?
In fact, while open-source word vectors like FastText are 300-dimensional, this does not negate the possibility that 128 dimensions could achieve similar effects. As for BERT, it is not a word vector model per se, so the dimension it chooses has no direct relationship with the selection of word vector dimensions. Furthermore, ALBERT has already shown that performing low-rank decomposition on the Embedding layer (reducing it to 128 dimensions) hardly changes the model's performance, so BERT's 768-dimensional Embedding is likely redundant to some extent.
Regarding the evaluation of word vectors, one can refer to the more comprehensive 2015 paper "How to Generate a Good Word Embedding?". The paper shows that the improvement of word vectors becomes quite weak after exceeding 50 dimensions, which serves as a piece of evidence for $n > 8.33 \log N$.
Attention
Another indirect experimental proof for the formula $n > 8.33 \log N$ comes from the attention mechanism. In "The Amazing Johnson-Lindenstrauss Lemma: Application Edition", we analyzed that the calculation formula for the Attention matrix is mathematically equivalent to the Skip Gram model of word vectors. This means the formula $n > 8.33 \log N$ can also be applied to the choice of head_size in attention mechanisms.
In the attention mechanism, $N$ is the sequence length to be processed. A common pre-training length is 512; substituting this gives $8.33 \log 512 \approx 52$. This is very close to the standard head_size of $64$ used in current mainstream models. Therefore, this indirectly proves the usability of $n > 8.33 \log N$. Conversely, if we accept this formula, it explains why the head_size of the attention mechanism only needs to be 64, and indirectly explains why the attention mechanism uses multiple small heads instead of one large head.
Regarding the choice of head_size and expressive power in attention mechanisms, you can also refer to "On the Expressive Power of Self-Attention Matrices".
Graph Networks
If we treat each word as a node and the co-occurrence between words as an edge, then Skip Gram can also be viewed as a simple graph model. Therefore, the results regarding word vector dimension selection can theoretically be used for embedding dimension selection in graph networks.
Results in this area can be found in the paper "Graph Entropy Guided Node Embedding Dimension Selection for Graph Neural Networks". The paper considers both the feature entropy and the structural entropy of the graph. The feature entropy part is similar to Skip Gram and adopts the same approximation as used in "Minimum Entropy Principle (VI): How to choose the dimension of word vectors?". Thus, this part is essentially the formula $n > 8.33 \log N$.
After combining feature entropy and structural entropy, the calculated results were used as the embedding dimension for various graph tasks. Experimental results show that this method indeed achieves superior dimension selection results:
Entropy-based dimension selection
Summary
This article analyzed the usability of the previously derived dimension selection formula $n > 8.33 \log N$. By synthesizing existing experimental results from word vectors, attention mechanisms, and graph networks, it is shown that this formula can yield reasonable dimension estimations. It also suggests that using entropy to further determine the constant for $\log N$ in the JL lemma might be a viable approach.