A More Chic Word Vector Model (II): Modeling Language

By 苏剑林 | November 19, 2017

From Conditional Probability to Mutual Information

Currently, the principle behind most word vector models is that the distribution of a word's context can reveal its semantics, much like the saying "show me who your friends are, and I'll tell you who you are." Therefore, the core of word vector models is modeling the relationship of the context. Except for GloVe, almost all word vector models attempt to model the conditional probability $P(w|\text{context})$. For example, Word2Vec's skip-gram model models the conditional probability $P(w_2|w_1)$. However, this quantity has some drawbacks. First, it is asymmetric, meaning $P(w_2|w_1)$ does not necessarily equal $P(w_1|w_2)$. Consequently, when modeling, we must distinguish between context vectors and target vectors, and they cannot reside in the same vector space. Secondly, it is a bounded, normalized quantity, which means we must use methods like Softmax to compress and normalize it, leading to optimization difficulties.

In fact, in the world of NLP, there is a more symmetric quantity that is more important than simple $P(w_2|w_1)$, and that is

\begin{equation} \frac{P(w_1,w_2)}{P(w_1)P(w_2)}=\frac{P(w_2|w_1)}{P(w_2)} \label{eq:1} \end{equation}

Roughly speaking, this quantity represents "how many times more likely two words are to meet in reality compared to meeting by chance." If it is much greater than 1, it indicates a tendency to appear together rather than by random combination; conversely, if it is much less than 1, it means the two avoid each other deliberately. This quantity carries significant weight in the NLP field. For now, let's call it "relevance." Of course, its logarithmic value is even more famous, known as Pointwise Mutual Information (PMI):

\begin{equation} \text{PMI}(w_1,w_2)=\log \frac{P(w_1,w_2)}{P(w_1)P(w_2)} \label{eq:2} \end{equation}

Based on this theoretical foundation, we believe that directly modeling relevance is more reasonable than modeling the conditional probability $P(w_2|w_1)$. Thus, this article unfolds around this perspective. Before that, let's further demonstrate the beautiful properties of mutual information itself.

The Additivity of Mutual Information

Under the naive assumption, relevance (or equivalently, mutual information) possesses a very elegant decomposition property. The naive assumption refers to the independence of features, such that $P(a,b)=P(a)P(b)$, allowing for the decomposition of joint probabilities to simplify the model.

For example, consider the mutual information between two sets of variables $Q$ and $A$, where $Q$ and $A$ are not single features but combinations of multiple features: $Q=(q_1, \dots, q_k)$ and $A=(a_1, \dots, a_l)$. Now consider their relevance, which is:

\begin{equation} \frac{P(Q,A)}{P(Q)P(A)} = \frac{P(q_1,\dots,q_k;a_1,\dots,a_l)}{P(q_1,\dots,q_k)P(a_1,\dots,a_l)} = \frac{P(q_1,\dots,q_k|a_1,\dots,a_l)}{P(q_1,\dots,q_k)} \label{eq:3} \end{equation}

Using the naive assumption, we get:

\begin{equation} \frac{P(q_1,\dots,q_k|a_1,\dots,a_l)}{P(q_1,\dots,q_k)} = \frac{\prod_{i=1}^k P(q_i|a_1,\dots,a_l)}{\prod_{i=1}^k P(q_i)} \label{eq:4} \end{equation}

Using Bayes' theorem, we get:

\begin{equation} \frac{\prod_{i=1}^k P(q_i|a_1,\dots,a_l)}{\prod_{i=1}^k P(q_i)} = \frac{\prod_{i=1}^k P(a_1,\dots,a_l|q_i)P(q_i)/P(a_1,\dots,a_l)}{\prod_{i=1}^k P(q_i)} = \prod_{i=1}^k \frac{P(a_1,\dots,a_l|q_i)}{P(a_1,\dots,a_l)} \label{eq:5} \end{equation}

Applying the naive assumption once more, we obtain:

\begin{equation} \prod_{i=1}^k \frac{P(a_1,\dots,a_l|q_i)}{P(a_1,\dots,a_l)} = \prod_{i=1}^k \frac{\prod_{j=1}^l P(a_j|q_i)}{\prod_{j=1}^l P(a_j)} = \prod_{i=1}^k \prod_{j=1}^l \frac{P(q_i,a_j)}{P(q_i)P(a_j)} \label{eq:6} \end{equation}

This shows that under the naive assumption, the relevance of two multivariate variables equals the product of the pairwise relevance of their single variables. If we take the logarithm of both sides, the result becomes even more striking:

\begin{equation} \text{PMI}(Q,A)=\sum_{i=1}^k \sum_{j=1}^l \text{PMI}(q_i,a_j) \label{eq:7} \end{equation}

In other words, the mutual information between two multivariate variables is equal to the sum of the pairwise mutual information between their single component variables. Put differently, mutual information is additive!

Interlude: Side Story

To help everyone more intuitively understand the principles of word vector modeling, let's imagine ourselves as the "Matchmaker" (Yue Lao) of the language world. Our goal is to determine the "affinity" (fate) between any two words, paving the way for each word to find its best partner~

As the saying goes, " those with affinity meet from a thousand miles away, while those without remain strangers even face to face." For every word, its best partner must be a word with high "affinity." What makes two words have "affinity"? Naturally, it is "you have me in your eyes, and I have you in mine." As mentioned earlier, the skip-gram model cares about the conditional probability $P(w_2|w_1)$, which results in "$w_1$ has $w_2$ in its eyes, but $w_2$ may not have $w_1$ in its eyes." That is to say, $w_2$ is more of a "playboy" in the word world, like stop words such as "of" (的) or "the" (了); they can mix with anyone but may not be sincere toward anyone. Therefore, for "you in me, and me in you," one must simultaneously consider $P(w_2|w_1)$ and $P(w_1|w_2)$, or consider a more symmetric quantity—the "relevance" we discussed earlier. So, the "Matchmaker" decides to use relevance to quantitatively describe the "affinity" value between two words.

Next, the "Matchmaker" begins his work, calculating the "affinity" between words one by one. As he calculates, he discovers serious problems.

First, the numbers are too vast to finish. Be aware that the word world contains tens of thousands, hundreds of thousands, or even millions of words in the future. If we calculate and record the affinity for every pair, it would require a table with billions or trillions of entries. The workload is immense; the Matchmaker might be retired before finishing. Yet, from a responsible standpoint, we cannot ignore the possibility of any two words being together!

Second, $N$ encounters between words are but a drop in the ocean relative to the long river of history. Does the fact that two words haven't met really mean they have no affinity? Just because they haven't met now doesn't mean they never will. As a cautious Matchmaker, one cannot jump to such a hasty conclusion. The relationships between words are complex; thus, even if two words have never crossed paths, one cannot cut them off entirely and must still estimate their affinity value.