A More Distinctive Word Vector Model (III): Models Describing Correlation

By 苏剑林 | Nov 19, 2017

Geometric Word Vectors

Although the "matchmaker" analogy mentioned earlier is just a metaphor, the problems it faces are real. According to traditional NLP methods, we can count the co-occurrence frequency of any two words and the frequency of each word itself, and then calculate their relevance to obtain a "relevance matrix." However, as mentioned before, this co-occurrence matrix is far too large; it must be compressed and reduced in dimensionality. At the same time, data smoothing is required to give a reasonable estimate for the relevance of word pairs that have not appeared together.

In existing machine learning solutions, we already have some experience in reducing the dimensionality of huge matrices, such as SVD and pLSA. SVD is for the dimensionality reduction of any matrix, while pLSA is for the dimensionality reduction of the transition probability matrix $P(j|i)$. The ideas behind both are similar: they decompose a large matrix $\mathbf{A}$ into the product of two small matrices $\mathbf{A} \approx \mathbf{BC}$, where the number of rows in $\mathbf{B}$ equals the number of rows in $\mathbf{A}$, the number of columns in $\mathbf{C}$ equals the number of columns in $\mathbf{A}$, and their own sizes are much smaller than $\mathbf{A}$. If no constraints are placed on $\mathbf{B}$ and $\mathbf{C}$, it is SVD; if positive normalization constraints are placed on $\mathbf{B}$ and $\mathbf{C}$, it is pLSA.

But for a relevance matrix, the situation is slightly different; it is positive but not normalized. We need to design a new compression scheme for it. Borrowing from the experience of matrix factorization, we can imagine placing all words in an $n$-dimensional space, representing them by a vector in that $n$-dimensional space, and assuming that their relevance is a certain function of their inner product (why the inner product? Because matrix multiplication itself is essentially performing continuous inner products):

\begin{equation} \frac{P(w_i, w_j)}{P(w_i)P(w_j)} = f(\langle \mathbf{v}_i, \mathbf{v}_j \rangle) \label{eq:8} \end{equation}

Where the bold $\mathbf{v}_i, \mathbf{v}_j$ represent the word vectors corresponding to words $w_i, w_j$. From a geometric perspective, we are placing words into an $n$-dimensional space, using points in that space to represent a word.

Because geometry feels intuitive to us while semantics feels complex, in an ideal situation, we hope to reflect semantic relationships through geometric relationships. Below, we will determine the undetermined function $f$ based on the geometric properties we desire. In fact, a similar approach was taken in the GloVe word vector paper, which was very enlightening, though the derivation in GloVe is not particularly elegant. Please note that the viewpoint here is novel—we determine our model from the properties we desire, rather than the other way around where we derive properties from a given model.

Airport - Airplane + Train = Train Station

One of the most frequently mentioned characteristics of word vectors is "word analogy," such as the classic "King - Man + Woman = Queen" (whether this property is essential for word vectors is controversial, but it is at least a plus). However, since Chinese and English contexts differ, this example is difficult to replicate in Chinese corpora. Of course, there are plenty of such examples, and there's no need to stick to "Western examples." For instance, in Chinese corpora, it is easy to find "Airport - Airplane + Train = Train Station," more accurately:

\begin{equation} \mathbf{v}(\text{airport}) - \mathbf{v}(\text{airplane}) + \mathbf{v}(\text{train}) = \mathbf{v}(\text{train station}) \label{eq:9} \end{equation}

Why do word vectors possess this property? A recent article, "Skip-Gram – Zipf + Uniform = Vector Additivity," conducted a theoretical analysis of this phenomenon. Based on some strong assumptions, the article derived this result. What we are about to do now, which might be quite striking, is: Directly use this characteristic as one of the definitions of the word vector model!

Specifically, the additivity of word meaning is directly reflected as the additivity of word vectors; this property is the definition of the word vector model. We intend to start from this property and work backward to find the function $f$ that was not yet determined in the previous section. In doing so, we not only find a reasonable basis for determining $f$ but also explain the linear arithmetic characteristics of word vectors—because this is the definition of the word vector model, not a corollary of it.

Since it is a linear operation, we can rearrange the terms to get "Airport + Train = Train Station + Airplane." Now let's think about what this equation expresses purely from a semantic perspective. As mentioned at the beginning of the article, word vector model assumptions are basically derived from context distributions to infer word meanings. Given "Airport + Train = Train Station + Airplane," it clearly says that the common context of "Airport" and "Train" is basically the same as the common context of "Train Station" and "Airplane." To put it bluntly, semantic equivalence is equivalent to saying, "If two people's standards for choosing a partner are very close, then they surely have many things in common." At this point, the form of $f$ is ready to emerge!

Model Form

Because the degree of correlation between words is described by relevance, if "Airport + Train = Train Station + Airplane," we would have:

\begin{equation} \frac{P(\text{airport}, \text{train}; w)}{P(\text{airport}, \text{train})P(w)} = \frac{P(\text{train station}, \text{airplane}; w)}{P(\text{train station}, \text{airplane})P(w)} \label{eq:10} \end{equation}

Here $w$ is any word in the context. Since we are not particularly concerned with word order but only with the average distribution of the context itself, we can use the naive assumption to simplify the above formula. According to formula (6), we get:

\begin{equation} \frac{P(\text{airport}, w)}{P(\text{airport})P(w)} \times \frac{P(\text{train}, w)}{P(\text{train})P(w)} = \frac{P(\text{train station}, w)}{P(\text{train station})P(w)} \times \frac{P(\text{airplane}, w)}{P(\text{airplane})P(w)} \label{eq:11} \end{equation}

Substituting the assumed formula \eqref{eq:8}, we get:

\begin{equation} f(\langle \mathbf{v}_{\text{airport}}, \mathbf{v}_w \rangle) f(\langle \mathbf{v}_{\text{train}}, \mathbf{v}_w \rangle) = f(\langle \mathbf{v}_{\text{airplane}}, \mathbf{v}_w \rangle) f(\langle \mathbf{v}_{\text{train station}}, \mathbf{v}_w \rangle) \label{eq:12} \end{equation}

Finally, substituting formula \eqref{eq:9} gives:

\begin{align} f(\langle \mathbf{v}_{\text{airport}}, \mathbf{v}_w \rangle) f(\langle \mathbf{v}_{\text{train}}, \mathbf{v}_w \rangle) / f(\langle \mathbf{v}_{\text{airplane}}, \mathbf{v}_w \rangle) &= f(\langle \mathbf{v}_{\text{airport}} - \mathbf{v}_{\text{airplane}} + \mathbf{v}_{\text{train}}, \mathbf{v}_w \rangle) \nonumber \\ &= f(\langle \mathbf{v}_{\text{airport}}, \mathbf{v}_w \rangle + \langle \mathbf{v}_{\text{train}}, \mathbf{v}_w \rangle - \langle \mathbf{v}_{\text{airplane}}, \mathbf{v}_w \rangle) \label{eq:13} \end{align}

Since $\mathbf{v}_w$ is arbitrary, the above equation is equivalent to holding:

\begin{equation} f(x+y-z) = f(x)f(y)/f(z) \end{equation}

Assuming continuity conditions, the general solution to the above equation (the solution process can usually be found in general mathematical analysis textbooks) is:

\begin{equation} f(x) = e^{\alpha x} \end{equation}

Which is the exponential form. We now have the following result: to make the resulting word vectors additive, we need to model relevance using an exponential model:

\begin{equation} \frac{P(w_i, w_j)}{P(w_i)P(w_j)} = e^{\langle \mathbf{v}_i, \mathbf{v}_j \rangle} \label{eq:14} \end{equation}

Equivalently, modeling Pointwise Mutual Information (PMI):

\begin{equation} PMI(w_i, w_j) = \langle \mathbf{v}_i, \mathbf{v}_j \rangle \label{eq:15} \end{equation}

At this point, we have completed the derivation of the model form. From a functional perspective, it resembles an SVD decomposition of the mutual information matrix.

Forgetting Normalization

We did not divide by a normalization factor to complete probability normalization as is usually done in probabilistic models. The consequence of this is that for the model in this article, and likewise for the GloVe model, we cannot discuss anything regarding normalization, otherwise, it would lead to contradictory results.

In fact, this is a method of trading space for time. Because we haven't divided by a normalization factor to normalize, but must keep the results close to normalized, we must pre-calculate and store the mutual information for all co-occurrence items. This often requires a significant amount of memory. The benefit gained from this step is that all co-occurrence items are actually quite limited (the number of "word pairs" is always fewer than the number of sentences). Therefore, when you have a large-scale corpus and sufficient memory, using the GloVe model is often much faster than using the word2vec Skip-gram model.

Furthermore, since the model in this article and the word2vec Skip-gram model differ essentially by a normalization factor, it is clear that whether some of the derivation processes in this article can be directly migrated to the word2vec Skip-gram model depends mainly on whether the Skip-gram model's normalization factor is close to 1 after training.

Reprint Address: https://kexue.fm/archives/4671