Minimum Entropy Principle (IV): "Birds of a Feather" — From Libraries to Word Vectors

By 苏剑林 | December 02, 2018

Reading from the first article to this point, we know that the so-called "Minimum Entropy Principle" is dedicated to reducing learning costs, attempting to complete the same task with the minimum possible cost. Therefore, this entire series can be seen as a "Laziness Guide." So, what is the secret to being lazy? The answer is "routines" (patterns), which is why this series is also called the "Manual of Routines."

In this article, we will introduce the "routines" found in libraries.

First, let me pose a question: When did word vectors first appear? Was it Mikolov's Word2Vec in 2013? Or Bengio's neural language model in 2003? Neither. In fact, word vectors can be traced back thousands of years to those ancient libraries...

A corner of a library

Entering the Library

Are there word vectors in a library? From a thousand years ago? In which book? Should I go and borrow it?

The Routine of Placing Books

In fact, it's not about a specific book, but the routine of how books are placed.

Obviously, the arrangement of books in a library follows a "routine": they are not placed randomly but are categorized. For example, the mathematics section is in one area, literature in another, and computer science in yet another. Within the same category, there are many subcategories; for instance, in mathematics, mathematical analysis is in one sub-area, algebra in another, geometry in another, and so on. Have you ever wondered why they are categorized this way? What are the benefits of categorization? And what does this have to do with minimum entropy?

Some readers might think it's simple: isn't it just to make them easier to find? This answer is not entirely accurate. If it were only for the purpose of finding a book, it would be simple enough to record the coordinate of every book in a database and mark the current coordinates on the floor. To borrow a book, you'd just check the coordinates in the database and go find it. This entire process wouldn't require "book classification" at all. Therefore, classification cannot be explained well solely by considering the ease of finding a single book.

Borrowing Books Effortlessly

The core reason is: We usually do not borrow just one book.

As mentioned before, as long as an index is built, finding one book in a library is not difficult. The problem is: what if you need two? In general, a person's interests and research focus are relatively concentrated. Therefore, it is reasonable to assume that the two books you want to borrow are related. For instance, if you borrow a book on Neural Networks, the probability of borrowing a book on Deep Learning is quite high, but the probability of borrowing Dream of the Red Chamber is very low. With a database, you can find Neural Networks quickly. But what about Deep Learning? If it is nearby, you only need to walk a few more steps. If books were randomly scattered, you might have to walk from the southeast corner to the northwest corner to find Deep Learning. Borrow a few more books, and you'd have to run several laps around the library to gather them all.

Thus, the purpose of book classification becomes obvious. Classification puts similar books together. Since a person tends to borrow similar books at the same time, classification makes the process of finding and borrowing books more efficient for most people! This is another "Laziness Guide." In other words, organizing things by category and placing similar items together also satisfies the Minimum Entropy Principle. In daily life, we also categorize common items and place them within reach based on the same principle.

Library Planning

Now, let's examine this process more closely from a mathematical perspective.

Simplified Borrowing Model

Suppose we go to the library to borrow two books, denoted as $i, j$. Let the cost of finding the first book be $d(i)$, and the cost function between two books be $d(i, j)$. This means after finding the first book $i$, I have to spend $d(i, j)$ more effort to find the second book $j$. We can consider the average of this process for everyone, namely: \begin{equation}S = \sum_{i,j} p(i)p(j|i) [d(i)+d(i,j)] = \sum_{i,j} p(i,j) [d(i)+d(i,j)]\end{equation} where $p(i)$ is the probability that book $i$ is borrowed, and $p(j|i)$ is the probability of borrowing book $j$ after borrowing book $i$. To arrange books properly, the library must minimize $S$.

Now, taking the library entrance as the origin, let's establish a three-dimensional coordinate system. Every book's position can be represented by a vector $\boldsymbol{v}$. Without loss of generality, we can simply consider $d(i)$ as the Euclidean distance from the book to the library origin, and $d(i, j)$ as the Euclidean distance between two books. Then the expression for $S$ becomes: \begin{equation}S = \sum_{i,j} p(i,j) \left[\|\boldsymbol{v}_i\| + \|\boldsymbol{v}_i - \boldsymbol{v}_j\|\right] \label{eq:chengben}\end{equation}

Let's explain the meaning of each term again. Here, $(i,j)$ represents a borrowing habit—borrowing book $j$ after book $i$. $p(i,j)$ represents the probability of this habit occurring, which can be estimated in practice through library records. $\|\boldsymbol{v}_i\| + \|\boldsymbol{v}_i - \boldsymbol{v}_j\|$ represents the total cost of borrowing $i$ and then $j$. Minimizing $\|\boldsymbol{v}_i\|$ means we should place popular books closer to the exit (origin); minimizing $\|\boldsymbol{v}_i - \boldsymbol{v}_j\|$ tells us to place related books together.

Constrained Optimization Planning

If we have the borrowing records, meaning $p(i,j)$ is known, can we obtain the "optimal book arrangement" by minimizing $\eqref{eq:chengben}$? The idea is correct but incomplete, because the minimum value of equation $\eqref{eq:chengben}$ is obviously 0, which would require setting all $\boldsymbol{v}$ to 0—meaning all books are squeezed into the same spot at the exit.

Obviously, this is impossible because books are not infinitely small. There is a minimum distance between books $d_{\min} > 0$. So the complete formulation should be: \begin{equation}\begin{aligned}S =& \min_{\boldsymbol{v}}\sum_{i,j} p(i,j) \left[\|\boldsymbol{v}_i\| + \|\boldsymbol{v}_i - \boldsymbol{v}_j\|\right] \\ &\text{s.t.}\quad\forall i\neq j,\, \|\boldsymbol{v}_i - \boldsymbol{v}_j\| \geq d_{\min} \end{aligned}\label{eq:chengben-2}\end{equation} In other words, this is a constrained optimization problem. Solving this would give us the most reasonable arrangement for the library (theoretically). Of course, for actual library planning, we would introduce more constraints like the shape of the library and the layout of corridors, but $\eqref{eq:chengben-2}$ is sufficient for understanding the fundamental idea.

General Cost Minimization

Now, let's generalize this problem to gain deeper insights from a more abstract perspective.

Uniformization and De-constraining

First, we replace the cost function $\|\boldsymbol{v}_i\| + \|\boldsymbol{v}_i - \boldsymbol{v}_j\|$ with a general $f(\boldsymbol{v}_i, \boldsymbol{v}_j)$, considering: \begin{equation}S = \sum_{i,j} p(i,j) f(\boldsymbol{v}_i,\boldsymbol{v}_j)\label{eq:yibanchengben}\end{equation} Meanwhile, $\boldsymbol{v}$ is no longer restricted to a 3-dimensional vector but can be a general $n$-dimensional vector. We still want to minimize the cost, but we dislike constraints like $\|\boldsymbol{v}_i - \boldsymbol{v}_j\| \geq d_{\min}$ because constrained optimization is often difficult to solve. If we can reflect this constraint directly in the choice of $f$, we would have an elegant "de-constraining" solution.

How do we achieve this? Returning to the library problem, without constraints, the theoretical optimum is to squeeze all books into the exit. To prevent this unreasonable solution, we added a minimum distance $d_{\min} > 0$ to prevent the solution from collapsing. There are many other constraints to consider; for example, we could require that books be placed as uniformly as possible throughout the library. Under this hope, we can also obtain a reasonable solution.

"Uniform as possible" can be understood as a form of normalization constraint. Because it is normalized, it cannot all concentrate on one point, as a single point is not normalized. "Normalization" inspires us to think in the direction of probability—that is, to construct a probability distribution first and use it as a measure for the cost function. Without being too far-fetched, I will directly give one possible choice: \begin{equation}f(\boldsymbol{v}_i,\boldsymbol{v}_j)=-\log\frac{e^{-\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2}}{Z_i},\quad Z_i = \sum_j e^{-\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2}\label{eq:chengben-l2}\end{equation}

Minimum Entropy = Maximum Likelihood

Let's understand this formula step by step. First, if we ignore the denominator $Z_i$, the result is: \begin{equation}-\log \left(e^{-\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2}\right) =\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2\end{equation} In other words, this $f$ is equivalent to a cost function of $\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2$. Then, due to the presence of the denominator, we know: \begin{equation}\sum_j\frac{e^{-\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2}}{Z_i}=1\end{equation} So $e^{-\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2}/Z_i$ actually defines a conditional probability distribution $q(j|i)$. Simply put, this is a softmax operation on $-\left\Vert\boldsymbol{v}_i-\boldsymbol{v}_j\right\Vert^2$, and $\eqref{eq:yibanchengben}$ then becomes: \begin{equation}S = -\sum_{i,j} p(i)p(j|i) \log q(j|i)\label{eq:gailvchengben}\end{equation} For a fixed $i$, minimizing this is exactly equivalent to maximum log-likelihood! Consequently, $q(j|i)$ will get as close as possible to $p(j|i)$. Thus, taking 0 for everything is not necessarily the optimal solution, because taking all as 0 corresponds to a uniform distribution, whereas the true $p(j|i)$ might not be uniform.

Now think about it: we started from the idea of minimum cost, designed an $f(\boldsymbol{v}_i,\boldsymbol{v}_j)$ in the form of a negative log-probability, and found that the final result is maximum likelihood. this result is both unexpected and expected, because the meaning of $-\log q(j|i)$ is entropy. We said we want maximum likelihood, which means minimizing formula $\eqref{eq:gailvchengben}$, and that is literally minimum entropy. Maximum likelihood and minimum entropy actually share the same meaning.

Word2Vec

By slightly changing the object of study, Word2Vec emerges, or even everything2vec~

Diverse Metrics

Purely from a formal standpoint, while the choice in $\eqref{eq:chengben-l2}$ is intuitive, it is not unique. Another viable choice is: \begin{equation}f(\boldsymbol{v}_i,\boldsymbol{v}_j)=-\log\frac{e^{\left\langle\boldsymbol{v}_i,\boldsymbol{v}_j\right\rangle}}{Z_i},\quad Z_i = \sum_j e^{\left\langle\boldsymbol{v}_i,\boldsymbol{v}_j\right\rangle}\label{eq:chengben-dot}\end{equation} This uses the inner product as the distance metric, desiring that similar objects have a larger inner product (smaller negative inner product cost).

Skip Gram

In fact, if $i, j$ represent words within a sentence window, then formula $\eqref{eq:chengben-dot}$ corresponds to the famous word vector model—the Skip Gram model of Word2Vec. That is to say, minimizing: \begin{equation}S = -\sum_{i,j} p(i,j) \log\frac{e^{\left\langle\boldsymbol{v}_i,\boldsymbol{v}_j\right\rangle}}{Z_i}\label{eq:word2vec}\end{equation} is exactly the optimization goal of the Skip Gram model in Word2Vec.

Note: Word2Vec actually distinguishes between the context vector and the center word vector, using two sets of word vectors. However, to intuitively understand the underlying idea, we do not emphasize that distinction here.

Analogical Analysis of Principles

Wait, how did word vectors suddenly appear??

Let's retrace our steps: we treat each word as a book, and each sentence as a "borrowing record" for a person. This way, we know which two "books" are often borrowed together, right? Following our lengthy discussion on the optimal book arrangement for a library, we can find the optimal position for the "books." Theoretically, using $\eqref{eq:chengben-2}, \eqref{eq:chengben-l2}$, or $\eqref{eq:chengben-dot}$ would all work—these are word vectors! If we use formula $\eqref{eq:chengben-dot}$, it is Word2Vec.

Conversely, finding an optimal library arrangement becomes simple: treat every borrowing record as a sentence, every book as a word, set the word vector dimension to 3, and feed it into Word2Vec for training. The resulting word vectors are the optimal layout.

Models like doc2vec, node2vec, and everything2vec are basically done this way.

So, the answer to the initial question is clear: recording the 3D coordinates of every book in a library is actually a real-world "book embedding." Similar books have similar vectors—it perfectly matches the characteristics of word vectors. Therefore, since the existence of libraries, there have been embeddings, even though coordinates weren't recorded and computers didn't exist yet.

Revisiting t-SNE

With "borrowing records," i.e., $p(j|i), p(i)$, we can follow the process above to get an "optimal positional plan." This is the process of vectorization.

But what if we don't have them?

SNE

Then we create them! For example, if we have a set of high-dimensional samples $\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_N$ (like an image dataset) and we want to obtain a low-dimensional representation $\boldsymbol{z}_1, \boldsymbol{z}_2, \dots, \boldsymbol{z}_N$. We construct: \begin{equation}p(\boldsymbol{x}_j|\boldsymbol{x}_i)=\frac{e^{-\Vert \boldsymbol{x}_i-\boldsymbol{x}_j\Vert^2/2\sigma^2}}{\sum\limits_{j}^{j\neq i}e^{-\Vert \boldsymbol{x}_i-\boldsymbol{x}_j\Vert^2/2\sigma^2}}\label{eq:pij}\end{equation} Then we still use equation $\eqref{eq:chengben-l2}$ as the cost function (assuming $p(i)$ is a constant, i.e., a uniform distribution, and the summation excludes the point itself), and optimize $\eqref{eq:yibanchengben}$: \begin{equation}S=-\sum_{i,j}^{i\neq j}p(\boldsymbol{x}_j|\boldsymbol{x}_i)\log q(j|i),\quad q(j|i)=\frac{e^{-\left\Vert\boldsymbol{z}_i-\boldsymbol{z}_j\right\Vert^2}}{\sum\limits_{j}^{j\neq i}e^{-\left\Vert\boldsymbol{z}_i-\boldsymbol{z}_j\right\Vert^2}}\end{equation} This is the dimensionality reduction method known as SNE.

Generally, there are some variants, but we won't sweat the details as they aren't the focus. We just need to understand the basic idea. SNE is essentially a dimension reduction scheme that tries to preserve relative distances. Because it preserves relative distance, it maintains the basic shape better than methods like PCA. PCA only retains principal components and is only suitable for relatively regular data (e.g., centrally clustered, isotropic). The philosophy of SNE can be applied to any connected shape.

t-SNE

SNE already reflects the idea of dimension reduction. However, it has some issues, primarily the "Crowding problem." This problem, simply put, arises because the low-dimensional distribution $\eqref{eq:chengben-l2}$ is also in the form of a negative exponential of distance. The problem with negative exponentials is that they decay to 0 very rapidly at long distances. Since $\boldsymbol{z}$ in $\eqref{eq:chengben-l2}$ is our target to solve, the result of optimization is that almost all points crowd together near some location (because of exponential decay, points that are far away almost never affect the cost), making the effect poor.

To solve this, we can replace formula $\eqref{eq:chengben-l2}$ with a function that decays more slowly, such as a simple fraction: \begin{equation}f(\boldsymbol{z}_i,\boldsymbol{z}_j)=-\log\frac{(1+\left\Vert\boldsymbol{z}_i-\boldsymbol{z}_j\right\Vert^2)^{-1}}{Z_i},\quad Z_i = \sum_{j}^{j\neq i} (1+\left\Vert\boldsymbol{z}_i-\boldsymbol{z}_j\right\Vert^2)^{-1}\label{eq:t}\end{equation} This is called the t-distribution. Combining formula $\eqref{eq:t}$ with $\eqref{eq:pij}$ and $\eqref{eq:yibanchengben}$ results in the dimensionality reduction method called t-SNE. Compared to SNE, it significantly improves the Crowding problem.

Of course, the difference between t-SNE and SNE is not the main point of this article. The point is to reveal the similarities between dimension reduction algorithms like t-SNE and Word2Vec.

Although we don't often use t-SNE as a primary dimension reduction tool in deep learning—as there are more elegant solutions for dimension reduction and clustering, such as "Mutual Information in Deep Learning: Unsupervised Feature Extraction" or "Variational Autoencoders (IV): A One-Step Clustering Scheme"—the essential idea of t-SNE is reflected in many scenarios. Thus, digging into and savoring its principles and connecting them with other knowledge points to form your own system is something worth doing.

Summary of this Article

Based on the idea of minimum cost, this article constructed an idealized model to analyze the principles of book arrangement in a library. This led to a connection with the Minimum Entropy Principle, and we explored its links with Word2Vec and t-SNE. This creates yet more vivid examples of the Minimum Entropy Principle: categorization and "birds of a feather" both reduce costs. For example, we can now understand why pre-trained word vectors can speed up convergence in NLP tasks and sometimes improve the final result: word vectors place words in suitable positions in advance, and their very construction is designed to minimize cost.

At the same time, linking many seemingly unrelated things together can promote mutual understanding and achieve a sense of mastery—the beauty of which is self-evident~