A More Elegant Word Vector Model (Part 4): Solving the Model

By 苏剑林 | November 19, 2017

Loss Function

Now, we define the loss function in order to solve for each word vector. Using $\tilde{P}$ to represent the estimated frequency of $P$, we can directly define the loss as the following expression:

\[\sum_{w_i,w_j}\left(\langle \boldsymbol{v}_i, \boldsymbol{v}_j\rangle-\log\frac{\tilde{P}(w_i,w_j)}{\tilde{P}(w_i)\tilde{P}(w_j)}\right)^2 \tag{16}\]

Compared to GloVe, this approach is simpler in both the number of parameters and the model form; therefore, we call it "Simpler GloVe." The GloVe model is defined as:

\[\sum_{w_i,w_j}\left(\langle \boldsymbol{v}_i, \boldsymbol{\hat{v}}_j\rangle+b_i+\hat{b}_j-\log X_{ij}\right)^2 \tag{17}\]

In the GloVe model, a distinction is made between center word vectors and context vectors, and the model suggests outputting the sum of these two sets of word vectors, which is said to yield better results. This is a somewhat strained trick, but not a major issue. The biggest problem is that the parameters $b_i, \hat{b}_j$ are also trainable, which makes the model severely ill-posed! Consider:

\begin{equation} \begin{aligned} &\sum_{w_i,w_j}\left(\langle \boldsymbol{v}_i, \boldsymbol{\hat{v}}_j\rangle+b_i+\hat{b}_j-\log \tilde{P}(w_i,w_j)\right)^2 \\ =&\sum_{w_i,w_j}\left[\langle \boldsymbol{v}_i+\boldsymbol{c}, \boldsymbol{\hat{v}}_j+\boldsymbol{c}\rangle+\Big(b_i-\langle \boldsymbol{v}_i, \boldsymbol{c}\rangle - \frac{\|\boldsymbol{c}\|^2}{2}\Big)\right. \\ &\qquad\qquad\qquad\qquad\left.+\Big(\hat{b}_j-\langle \boldsymbol{\hat{v}}_j, \boldsymbol{c}\rangle - \frac{\|\boldsymbol{c}\|^2}{2}\Big)-\log X_{ij}\right]^2 \end{aligned} \tag{18} \end{equation}

This means that if you have a set of solutions, adding an arbitrary constant vector to all word vectors results in another valid set of solutions! This is a serious problem because we cannot predict which specific solution will be obtained. If the added vector is a very large constant vector, various metrics (such as the cosine similarity between any two words) become meaningless as they will all approach 1. In fact, verification of word vectors generated by GloVe reveals that the magnitudes of vectors for stop words are much larger than those for ordinary words. This means that when a group of words is considered together, the influence of stop words is overly dominant, which is clearly detrimental to subsequent model optimization. (Although based on current experimental results regarding GloVe, I might just be a bit obsessive about this.)

Mutual Information Estimation

To solve the model, the first problem to address is how to calculate $P(w_i,w_j), P(w_i), P(w_j)$. $P(w_i)$ and $P(w_j)$ are simple and can be estimated via direct statistics. But what about $P(w_i,w_j)$? What constitutes a "co-occurrence" of two words? Different applications may have different schemes. For example, we could consider two words to have "met" if they appear in the same article. This scheme is usually helpful for topic classification, but the computational cost is too high. A more common scheme is to select a fixed integer, denoted as window. For any given word, the window number of words appearing before and after it are considered to have "met" that word.

A detail worth noting: should the co-occurrence of a center word with itself be counted? According to the definition of a window (words whose distance from the center word does not exceed window), it should be included. However, including it has little predictive value since this term always exists. Excluding it might lower the mutual information between a word and itself. Therefore, we adopt a small trick: do not count identical co-occurrence terms and let the model learn this itself. That is, even if the center word appears in its own context (excluding the exact center position), it is not included in the loss. Since the volume of data is far greater than the number of parameters, this term can usually be learned implicitly.

Weighting and Downsampling

The GloVe model defines the following weighting formula:

\[\lambda_{ij}=\Big(\min\{x_{ij}/x_{max}, 1\}\Big)^{\alpha} \tag{19}\]

Where $x_{ij}$ represents the co-occurrence frequency of the word pair $(w_i, w_j)$, and $x_{max}, \alpha$ are fixed constants, typically $x_{max}=100, \alpha=3/4$. This means word pairs with low co-occurrence frequency are down-weighted as they are more likely to be noise. Thus, the final GloVe loss is:

\[\sum_{w_i,w_j}\lambda_{ij}\left(\langle \boldsymbol{v}_i, \boldsymbol{v}_j\rangle+b_i+b_j-\log \tilde{P}(w_i,w_j)\right)^2 \tag{20}\]

In our text model, we continue to use this weights but with some choice. First, raising the frequency to the power of $\alpha$ increases the weight of low-frequency items, which is basically consistent with the approach in word2vec. What is worth considering is the min truncation operation. If this truncation is performed, the weight of high-frequency words is significantly reduced, somewhat like the downsampling of high-frequency words in word2vec. This can improve the learning effect for low-frequency words, but the possible consequence is that the magnitudes of high-frequency word vectors are not learned well. We can see this in the section "The Meaning of Vector Length." Generally, different scenarios have different requirements, so in the released source code, we allow the user to define whether to truncate this weight.

Adagrad

Like GloVe, we also use the Adagrad algorithm for optimization. The reason for using Adagrad is that it is perhaps the simplest algorithm with an adaptive learning rate currently available.

However, I noticed that the implementation of the Adagrad algorithm in the GloVe source code is wrong!! I don't know if that specific implementation in GloVe was an intentional improvement or a clerical error (a clerical error seems unlikely?), but if I were to use their iteration process for the Simpler GloVe model in this article without any changes, it would easily lead to various unsolvable NaNs! If written as standard Adagrad, NaNs do not appear.

For a selected word pair $w_i, w_j$, we obtain the loss:

\[L=\lambda_{ij}\left(\langle \boldsymbol{v}_i, \boldsymbol{v}_j\rangle-\log\frac{\tilde{P}(w_i,w_j)}{\tilde{P}(w_i)\tilde{P}(w_j)}\right)^2 \tag{21}\]

Its gradients are:

\begin{equation} \begin{aligned}\nabla_{\boldsymbol{v}_i} L=\lambda_{ij}\left(\langle \boldsymbol{v}_i, \boldsymbol{v}_j\rangle-\log\frac{\tilde{P}(w_i,w_j)}{\tilde{P}(w_i)\tilde{P}(w_j)}\right)\boldsymbol{v}_j\\ \nabla_{\boldsymbol{v}_j} L=\lambda_{ij}\left(\langle \boldsymbol{v}_i, \boldsymbol{v}_j\rangle-\log\frac{\tilde{P}(w_i,w_j)}{\tilde{P}(w_i)\tilde{P}(w_j)}\right)\boldsymbol{v}_i \end{aligned} \tag{22} \end{equation}

Then, simply update according to the Adagrad algorithm formula. The default initial learning rate is chosen as $\eta=0.1$. The iteration formula is:

\begin{equation} \left\{\begin{aligned}\boldsymbol{g}_{\gamma}^{(n)} =& \nabla_{\boldsymbol{v}_{\gamma}^{(n)}} L\\ \boldsymbol{G}_{\gamma}^{(n)} =& \boldsymbol{G}_{\gamma}^{(n-1)} + \boldsymbol{g}_{\gamma}^{(n)}\otimes \boldsymbol{g}_{\gamma}^{(n)}\\ \boldsymbol{v}_{\gamma}^{(n)} =& \boldsymbol{v}_{\gamma}^{(n-1)} - \frac{\boldsymbol{g}_{\gamma}^{(n)}}{\sqrt{\boldsymbol{G}_{\gamma}^{(n)}}}\eta \end{aligned}\right.,\,\gamma=i,j \tag{23} \end{equation}

From the formula, it can be seen that the Adagrad algorithm is basically insensitive to the scaling of the loss. In other words, multiplying the loss by 10 has almost no change on the final optimization effect. However, in stochastic gradient descent, multiplying the loss by 10 is equivalent to multiplying the learning rate by 10.