How Did a Binarized Word Vector Model Get Involved with Fruit Flies?

By 苏剑林 | Feb 09, 2021

Fruit Fly (Image from Google Search)

Some readers might have noticed the ICLR 2021 paper 《Can a Fruit Fly Learn Word Embeddings?》 recently. The paper states it is a binarized word vector model based on biomimetic ideas (the olfactory circuit of the fruit fly). Actually, the algorithmic part of the paper isn't that difficult to read. The main confusion after reading might be, "What does this have to do with fruit flies?" or "Was the author really inspired by fruit flies?" and so on. In this article, let's trace the origins of the algorithm and try to answer how this word vector model got involved with fruit flies.

BioWord

The original paper did not give this word vector model a specific name. For convenience, I will take the liberty of calling it "BioWord." In general, the content of the paper consists of three parts:

1. Constructing a bag-of-words representation vector for each n-gram;
2. Applying the BioHash algorithm to these n-gram vectors to obtain so-called (binarized) static/dynamic word vectors;
3. "Struggling" to tell a story.

We will introduce BioHash later, but in short, it is an existing vector binarization algorithm. From these three points, it is clear that the word vector model itself does not have an obvious connection to biomimetics or fruit flies. Even if there is a connection, it should lie within the BioHash algorithm. However, since this paper is not the one proposing the BioHash algorithm, talking too much about biomimetic aspects seems a bit far-fetched. Let's explain these points in detail below.

First, constructing a bag-of-words representation vector for each n-gram is a very simple approach. The result is a $2V$-dimensional binary vector (i.e., a 0/1 vector), where $V$ is the vocabulary size. As shown in the figure below, the first $V$ dimensions represent the context part, where 1 indicates the word appeared in the context; the last $V$ dimensions are a one-hot vector representing the center word. Then, the authors made a slight adjustment to the BioHash algorithm (but did not clearly explain why they made this modification, and I cannot see how to interpret it). After BioHash, each n-gram vector is mapped to a $K$-dimensional 0/1 vector, where each vector has a (fixed number of) $k$ ones.

Constructing bag-of-words representation for each n-gram

Finally, why do I say the authors "struggled" to tell a story? First, as mentioned, overemphasizing biomimetic aspects feels forced. Second, the authors compared BioWord with word vector models like Word2Vec and GloVe, and its performance was generally inferior to theirs. Third, as a word vector binarization algorithm, BioWord's performance has no clear advantage over existing methods like RandExp or LSH. If it were just that, it would be fine, but what is even more awkward is that the authors forced a comparison with BERT to highlight their "advantages."

The authors first introduced the concepts of static and dynamic word vectors. If the first $V$ dimensions of the bag-of-words vector are all zeros, then the mapped $K$-dimensional vector is the static word vector for that word. If the first $V$ dimensions depend on the context, then the mapped $K$-dimensional vector is called the (BERT-like) dynamic word vector for that word. But this concept is hardly elegant. Following this logic, even for Word2Vec, if you calculate the average word vector of the context and concatenate it with the center word vector, it becomes a dynamic word vector. But this is nothing more than a talking point, and the experimental results showed no advantage. Furthermore, as a word vector model, the authors compared their training costs with BERT to highlight their advantages, which was quite embarrassing. This also shows the "pains" the authors took to tell this story.

BioHash

BioHash comes from the paper 《Bio-Inspired Hashing for Unsupervised Similarity Search》. it is a method for vector binarization. Unlike traditional LSH, it is sample-dependent and "tailor-made" for specific datasets, which usually allows it to obtain better and sparser binary vectors.

Given a set of vectors $\{\boldsymbol{x}_i\}_{i=1}^N$, the BioHash algorithm is roughly as follows:

1. Use K-Means to cluster $\{\boldsymbol{x}_i\}_{i=1}^N$ into $K$ clusters, obtaining $K$ vector centers;
2. Map each $\boldsymbol{x}_i$ to a $K$-dimensional 0/1 vector, where the positions corresponding to the $k$ classes closest to $\boldsymbol{x}_i$ are set to 1, and the others to 0.

The reason I say "roughly" is that there are some discrepancies in the algorithm details. First, the distance used during clustering is not Euclidean distance, but normalized inner product, i.e., $d(\boldsymbol{x}, \boldsymbol{w}) = -\langle \boldsymbol{x}, \boldsymbol{w} / \Vert\boldsymbol{w}\Vert\rangle$. We used this approach before when exploring Capsules; readers can refer to 《Another New Year Feast: From K-Means to Capsule》. Second, when solving for cluster centers, SGD is used instead of the conventional EM algorithm. I don't quite understand this point; while SGD can be done in mini-batches and is more memory-friendly, theoretically the EM algorithm can also be executed in batches and shouldn't run out of memory. Finally, a point I find completely incomprehensible is that the distance used by the authors during clustering is the normalized inner product $-\langle \boldsymbol{x}, \boldsymbol{w} / \Vert\boldsymbol{w}\Vert\rangle$, but when deciding the membership of each sample, they use the unnormalized inner product $-\langle \boldsymbol{x}, \boldsymbol{w}\rangle$ as the distance. This is truly bizarre.

Of course, putting aside the details of BioHash, its performance is still very impressive. So, if there is a suitable scenario, BioHash is worth learning from and using. Attentive readers might notice that BioWord and BioHash share some of the same authors. In fact, they both originated from the same lab. From this, it is not hard to understand the original intention of BioWord to inherit from BioHash. However, applying BioHash to word vector construction, in my opinion, is not particularly elegant in terms of motivation or reported effects.

FlyHash

We haven't addressed the question in the title for a while: What exactly is the connection between BioWord and fruit flies? Or, which part of BioHash reflects a similarity to fruit flies? Tracing the references of BioHash reveals that BioHash is actually an improvement on an algorithm named FlyHash. Therefore, to find the source, we have to look for FlyHash.

As the name suggests, FlyHash is a new vector binarization method inspired mainly by the fruit fly olfactory circuit. It is more efficient than conventional LSH. For readers unfamiliar with LSH, we will provide a brief introduction later. Here, let's introduce FlyHash directly. Actually, both FlyHash and LSH follow the idea of "random projection + binarization," but the fruit fly inspired a new direction for optimization: high dimensionality + low activation.

Specifically, let original data $\boldsymbol{x}_i \in \mathbb{R}^{D}$. FlyHash chooses a random binary matrix $\boldsymbol{W}\in\{0,1\}^{D\times K}$ (fixed after selection), where generally $K > D$ (high dimensionality). After projection, $\boldsymbol{x}_i \boldsymbol{W}$ is a $K$-dimensional vector. Then, a WTA (Winner Take All) operation is performed to achieve "low activation"—"set the $k$ largest elements of $\boldsymbol{x}_i \boldsymbol{W}$ to 1 and the rest to 0." This results in a binary vector with $k$ ones and $K-k$ zeros, which is used as the hash vector for $\boldsymbol{x}_i$.

Since there are only a limited number of $k$ active values, even with dimensionality expansion, the storage and retrieval costs do not increase. Instead, the effectiveness is improved. This is the benefit of the "high dimensionality + low activation" idea brought by the fruit fly. FlyHash was first published in the Science paper 《A neural algorithm for a fundamental computing problem》. For more details on the fruit fly olfactory circuit's "high dimensionality + low activation," refer to that paper. Later, the paper 《Improving Similarity Search with High-dimensional Locality-sensitive Hashing》 further refined the theoretical part. These two papers are of the same lineage.

So, now we can answer "how it got involved with fruit flies." Essentially, any algorithm that includes the "high dimensionality + low activation" idea in the binarization process can be said to be "Inspired by Fly." Since FlyHash uses random projection to get its final result, it must expand to a sufficiently high dimension to ensure effectiveness. BioHash is trained on specific datasets, so it often doesn't need as many dimensions as FlyHash, and its performance is even better. However, BioHash also clearly incorporates the "Winner Take All" idea, so it is also called "Inspired by Fly." And because BioWord uses BioHash, it also claims to be "Inspired by Fly."

LSH

Finally, let's briefly introduce LSH (Locality Sensitive Hashing) for readers who are not familiar with it. A full explanation of LSH could be a long discourse, so here I will mainly mention the part most closely related to FlyHash.

Simply put, LSH is an algorithm for vector binarization where the binarized vectors approximately preserve the metric. A common scheme is to (approximately) preserve the cosine similarity through random projection. From previous articles like 《Angle Distribution of Two Random Vectors in n-Dimensional Space》 and 《Understanding Model Parameter Initialization Strategies from a Geometric Perspective》, we know that any two Gaussian random vectors in high-dimensional space are almost always orthogonal. If we sample $DK$ random numbers from $\mathcal{N}(0,1/n)$ to form a matrix $\boldsymbol{W}\in\mathbb{R}^{D\times K}$, it is "almost" an orthogonal matrix. In this way, when two original vectors $\boldsymbol{x}_i, \boldsymbol{x}_j\in\mathbb{R}^{D}$ are multiplied by $\boldsymbol{W}$, they become two $K$-dimensional vectors, and the angle remains approximately the same:

\begin{equation}\cos(\boldsymbol{x}_i \boldsymbol{W}, \boldsymbol{x}_j \boldsymbol{W})\approx \cos(\boldsymbol{x}_i, \boldsymbol{x}_j)\end{equation}

Thus, if the retrieval metric is the cosine similarity, we can use the projected $\boldsymbol{x} \boldsymbol{W}$ instead of the original vector $\boldsymbol{x}$ for approximate retrieval. Furthermore, the cosine similarity is basically preserved even after binarizing the results:

\begin{equation}\boldsymbol{x}\quad\to\quad\text{sgn}(\boldsymbol{x} \boldsymbol{W})\end{equation}

where $\text{sgn}$ is the sign function, changing values greater than 0 to 1 and values less than or equal to 0 to -1. This maps the original vector to a binary vector. Here, the target dimension $K$ is generally not larger than $D$. Due to the randomness of the projection, we can roughly assume that 1s and -1s are split half-and-half in the results. This is a clear difference from FlyHash's "high dimensionality + low activation," because whether you view 1 or -1 as the activation value, their numbers are roughly the same, which cannot be called "low activation."

Although the above introduction is just a heuristic guide, it is actually supported by rigorous probability theory (theoretical analysis was also done in 《Performer: Linearizing Attention Complexity with Random Projections》). Therefore, LSH is a set of rigorous and quantifiable algorithms, not purely arbitrary approximations. After binarizing the vectors, we can treat them as a bag-of-words model (even if the originals were continuous vectors), and then build indexes for accelerated retrieval, such as inverted indexes. This is the significance of vector binarization. Vector retrieval libraries like Faiss all include LSH algorithms.

Summary

This article traced the origins of a binarized word vector model and explored how it connects to fruit flies. From this, we also learned about the ideas and connections between vector binarization methods like BioHash, FlyHash, and LSH. This article is also my first attempt at reverse-chronological writing; I hope you like it~