Gossip on "Noise Contrastive Estimation": The Beauty of the Winding Path

By 苏剑林 | June 13, 2018

When mentioning Noise Contrastive Estimation, or "Negative Sampling," everyone likely thinks of Word2Vec immediately. In fact, its meaning goes far beyond that. Noise Contrastive Estimation (NCE) is a roundabout yet exceptionally exquisite technique. It allows us to estimate the parameters of a probability distribution when the normalization factor (also known as the partition function) is impossible to calculate directly. In this article, let us appreciate the subtle elegance of NCE's "winding path."

Note: Due to different starting points, the "Noise Contrastive Estimation" introduced in this article is actually more inclined towards the so-called "Negative Sampling" technique, but the two are essentially the same and will not be distinguished here.

Origin of the Problem

The root of the problem lies in the inextricable exponential probability distributions.

Exponential Family Distributions

In many problems, exponential family distributions appear. That is, for the probability $p(\boldsymbol{x})$ of a variable $\boldsymbol{x}$, we write it as: \begin{equation} p(\boldsymbol{x}) = \frac{e^{G(\boldsymbol{x})}}{Z} \label{eq:1} \end{equation} where $G(\boldsymbol{x})$ is a certain "energy" function of $\boldsymbol{x}$, and $Z=\sum_{\boldsymbol{x}} e^{G(\boldsymbol{x})}$ is the normalization constant, also called the partition function. This type of distribution is also known as the "Boltzmann distribution."

In machine learning, there are two main sources of exponential family distributions. The first source is softmax: when we make classification predictions, we usually use softmax as the final activation for the output of the fully connected layer; this is a discrete Boltzmann distribution on a finite number of points. The second source is the Principle of Maximum Entropy: when we introduce a feature and can already estimate its expected value, the maximum entropy model tells us its distribution should be in an exponential form of the features. (Refer to "Entropy is Unaffordable: From Entropy and the Principle of Maximum Entropy to Maximum Entropy Models (II)".)

The Difficult Partition Function

Overall, exponential family distributions are a very practical class of distributions. They can be found in machine learning, mathematics, and physics. However, they possess a significant problem: they are not easy to calculate—specifically, the partition function is difficult to compute.

Precisely, there are two reasons why it is difficult to calculate. One is that the volume of computation is too large, such as in the scenario of language models (including Word2Vec). Because one needs to predict the current word's distribution based on context, it requires summing over hundreds of thousands or even millions of items (depending on vocabulary size) to calculate the normalization factor. In this case, it is not that it cannot be calculated, but that the computational cost is too high to bear. The other situation is that it simply cannot be calculated at all. For example, suppose $p(x)=\frac{e^{-ax^2-bx^4}}{Z}$, then: \begin{equation} Z = \int e^{-ax^2-bx^4} dx \label{eq:2} \end{equation} This integral cannot be solved analytically in a simple way, let alone for more complex functions. Now we might feel from this perspective why the Gaussian distribution is so commonly used; it is because, because, because... if we changed the distribution, we wouldn't be able to calculate it anymore...

In machine learning, if the goal is only classification or prediction, it doesn't matter whether the normalization factor is calculated because we only need to compare relative values to pick the maximum. However, before prediction, we face the problem of training, which is parameter estimation. Specifically, $G(\boldsymbol{x})$ actually contains some unknown parameters $\boldsymbol{\theta}$. To be precise, it should be written as $G(\boldsymbol{x};\boldsymbol{\theta})$, so the probability distribution is: \begin{equation} p(\boldsymbol{x})=\frac{e^{G(\boldsymbol{x};\boldsymbol{\theta})}}{Z(\boldsymbol{\theta})} \label{eq:3} \end{equation} We need to infer $\boldsymbol{\theta}$ from samples of $\boldsymbol{x}$. Usually, we use maximum likelihood, but without calculating $Z(\boldsymbol{\theta})$, we cannot calculate the likelihood function, making it impossible to proceed.

NCE to the Rescue

Fortunately, NCE was born, and it successfully bypassed this difficulty. For cases where the partition function cannot be calculated, it provides a possibility to proceed; for cases where the volume of computation for the partition function is too large, it provides a scheme to reduce the computational cost.

Turning into a Binary Classification Problem

The idea of NCE is simple: it hopes that we compare real samples with a set of "noise samples" to discover the patterns of the real samples.

Specifically, the energy is still the original $G(\boldsymbol{x};\boldsymbol{\theta})$, but this time we do not calculate the probability $p(\boldsymbol{x})$ directly because the normalization factor is hard to compute. Instead, we calculate: \begin{equation} p(1|\boldsymbol{x})=\sigma\Big(G(\boldsymbol{x};\boldsymbol{\theta})-\gamma\Big)=\frac{1}{1+e^{-G(\boldsymbol{x};\boldsymbol{\theta})+\gamma}} \label{eq:4} \end{equation} Here, $\boldsymbol{\theta}$ is still the original parameter to be optimized, while $\gamma$ is a newly introduced parameter to be optimized.

Then, the loss function for NCE becomes: \begin{equation} \mathop{\text{argmin}}_{\boldsymbol{\theta},\gamma} - \mathbb{E}_{\boldsymbol{x}\sim \tilde{p}(\boldsymbol{x})}\log p(1|\boldsymbol{x})- \mathbb{E}_{\boldsymbol{x}\sim U(\boldsymbol{x})}\log p(0|\boldsymbol{x}) \label{eq:5} \end{equation} where $\tilde{p}(\boldsymbol{x})$ is the real distribution of samples, and $U(\boldsymbol{x})$ is a "uniform" distribution or another fixed distribution that is easy to sample from.

In short, the approach of NCE is to transform it into a binary classification problem, classifying real samples as 1 and samples drawn from another distribution as 0.

Equivalence to the Original Distribution

The question now is whether the $\boldsymbol{\theta}$ estimated from equation $\eqref{eq:5}$ is the same as the result of a direct maximum likelihood estimation from equation $\eqref{eq:3}$ (if it were theoretically possible).

The answer is essentially the same. We rewrite the loss in equation $\eqref{eq:5}$ as: \begin{equation} -\int \tilde{p}(\boldsymbol{x})\log p(1|\boldsymbol{x}) d\boldsymbol{x}- \int U(\boldsymbol{x})\log p(0|\boldsymbol{x})d\boldsymbol{x} \label{eq:6} \end{equation} Since $\tilde{p}(\boldsymbol{x})$ and $U(\boldsymbol{x})$ are independent of the parameters $\boldsymbol{\theta}$ and $\gamma$, changing the loss to the following form will not affect the optimization results: \begin{equation} \begin{aligned}&\int \big(\tilde{p}(\boldsymbol{x})+U(\boldsymbol{x})\big) \left(\tilde{p}(1|\boldsymbol{x}) \log \frac{\tilde{p}(1|\boldsymbol{x})}{p(1|\boldsymbol{x})} + \tilde{p}(0|\boldsymbol{x})\log \frac{\tilde{p}(0|\boldsymbol{x})}{p(0|\boldsymbol{x})}\right)d\boldsymbol{x}\\ =&\int \big(\tilde{p}(\boldsymbol{x})+U(\boldsymbol{x})\big) KL\Big(\tilde{p}(y|\boldsymbol{x})\Big\Vert p(y|\boldsymbol{x})\Big) d\boldsymbol{x}\end{aligned} \label{eq:7} \end{equation} where \begin{equation} \tilde{p}(1|\boldsymbol{x})=\frac{\tilde{p}(\boldsymbol{x})}{\tilde{p}(\boldsymbol{x})+U(\boldsymbol{x})} \label{eq:8} \end{equation} Equation $\eqref{eq:7}$ is the integral of the KL divergence. Since KL divergence is non-negative, when "the hypothesized distribution form is satisfied and fully optimized," equation $\eqref{eq:7}$ should be 0, so that we have $\tilde{p}(y|\boldsymbol{x})= p(y|\boldsymbol{x})$, which is: \begin{equation} \frac{\tilde{p}(\boldsymbol{x})}{\tilde{p}(\boldsymbol{x})+U(\boldsymbol{x})}=\tilde{p}(1|\boldsymbol{x})=p(1|\boldsymbol{x})=\sigma\Big(G(\boldsymbol{x};\boldsymbol{\theta})-\gamma\Big) \label{eq:9} \end{equation} Solving for $\tilde{p}(\boldsymbol{x})$ gives: \begin{equation} \begin{aligned}\tilde{p}(\boldsymbol{x})=&\frac{p(1|\boldsymbol{x})}{p(0|\boldsymbol{x})}U(\boldsymbol{x})\\ =&\exp\Big\{G(\boldsymbol{x};\boldsymbol{\theta})-\gamma\Big\}U(\boldsymbol{x})\\ =&\exp\Big\{G(\boldsymbol{x};\boldsymbol{\theta})-\big(\gamma-\log U(\boldsymbol{x})\big)\Big\}\end{aligned} \label{eq:10} \end{equation} If $U(\boldsymbol{x})$ is chosen as a uniform distribution, then $U(\boldsymbol{x})$ is just a constant. Thus, the final effect indicates that $\gamma - \log U(\boldsymbol{x})$ plays the role of $\log Z$, while the distribution remains the original distribution $\eqref{eq:3}$ and $\boldsymbol{\theta}$ remains the original $\boldsymbol{\theta}$.

This demonstrates that NCE is an ingenious indirect scheme for optimizing equation $\eqref{eq:3}$: seemingly roundabout, it results in equivalence, and the computational complexity of equation $\eqref{eq:5}$ is greatly reduced, as it only depends on the number of samples.

Some Interludes

Some topics related to NCE are collected here.

Brief Overview of NCE and Negative Sampling

The systematic proposal of NCE was in the 2010 paper "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models". Subsequently, training large-scale neural language models basically adopted NCE or similar losses. The title of the paper actually reveals the key point of NCE: it is a "parameter estimation principle" for "unnormalized models," specifically designed to handle scenarios where the normalization factor is difficult to compute.

But in fact, the idea of "Negative Sampling" had been used earlier. For example, at ICML 2008, Ronan Collobert and Jason Weston, in their paper "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning", had already used the negative sampling method to train word vectors. You know, that was four or five years before Word2Vec was released! For the story of word vectors and language models, please refer to licstar's "Word Vectors and Language Models".

Based on the same need to reduce computational costs, Google's Word2Vec later also used negative sampling techniques. In many tasks, it even performed better than Huffman-based Softmax, especially in the "word analogy" experiments. We will analyze the wisdom behind this soon.

Word2Vec

Now let's apply this to Word2Vec to analyze a few things. Taking the Skip-gram model as an example, the goal of Word2Vec is: \begin{equation} p(w_j|w_i)=\frac{e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle}}{Z_i} \label{eq:11} \end{equation} where $\boldsymbol{u}_i, \boldsymbol{v}_j$ are parameters to be optimized, representing two different sets of word vector spaces for the center word and context. Obviously, the problem here is that the normalization factor calculation cost is massive. The solutions include Huffman Softmax and Negative Sampling. We are not concerned with Huffman Softmax here; we just need to know it is an approximation of the original standard Softmax. Let's look at Negative Sampling. Word2Vec changes the optimization target to: \begin{equation} \mathop{\text{argmin}}_{\boldsymbol{u},\boldsymbol{v}} - \mathbb{E}_{w_j\sim \tilde{p}(w_j|w_i)}\log \sigma\Big(\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle\Big) - \mathbb{E}_{w_j\sim \tilde{p}(w_j)}\log \Big[1-\sigma\Big(\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle\Big)\Big] \label{eq:12} \end{equation} This formula might look dizzying, but in short, it expresses the idea: "Skip-grams appearing in the corpus are treated as positive samples, and words sampled randomly are treated as negative samples."

First and most obviously, compared to equations $\eqref{eq:4}$ and $\eqref{eq:5}$, equation $\eqref{eq:12}$ omits the introduction of $\gamma$ as a training parameter, essentially defaulting to $\gamma=0$. Is this allowed? It is said that someone has indeed conducted comparative experiments, and the results showed that the trained $\gamma$ does fluctuate around 0, so this default operation is basically reasonable.

Secondly, for negative samples, Word2Vec does not "sample every word uniformly," but rather samples based on the total frequency of each word itself. In this way, equation $\eqref{eq:10}$ becomes: \begin{equation} \tilde{p}(w_j|w_i)=\frac{p(1|w_i, w_j)}{p(0|w_i, w_j)}p(w_j)=e^{\langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle}\tilde{p}(w_j) \label{eq:13} \end{equation} That is to say, the final fitting effect is: \begin{equation} \log \frac{\tilde{p}(w_j|w_i)}{\tilde{p}(w_j)} = \langle \boldsymbol{u}_i, \boldsymbol{v}_j\rangle \label{eq:14} \end{equation} As everyone can see, the left side is the mutual information of the two words! Originally, our fitting target was that the inner product of two words equals the (logarithm of the) conditional probability $\tilde{p}(w_j|w_i)$; now, through the Negative Sampling of Word2Vec, the inner product of two words is their mutual information.

Now we can roughly explain why Word2Vec's Negative Sampling works better than Huffman Softmax. Huffman Softmax is just an approximation of Softmax; it essentially still fits $\tilde{p}(w_j|w_i)$, while the Negative Sampling technique fits the mutual information $\log\frac{\tilde{p}(w_j|w_i)}{\tilde{p}(w_j)}$. We know that Word2Vec relies on co-occurrence to reflect word meaning, and mutual information can reflect the "true" co-occurrence relationship between words better than conditional probability $\tilde{p}(w_j|w_i)$. In other words, $\tilde{p}(w_j|w_i)$ might reflect a relationship like "I know Jay Chou, but Jay Chou doesn't know me," while mutual information reflects "You know me, and I know you," the latter being better at capturing semantic relationships.

In another word vector model I constructed previously, "A More Elegant Word Vector Model (III): Models Describing Correlation", it was also shown that models constructed based on mutual information can theoretically explain many experimental results like "word analogy." This also indirectly confirms that the combination of "Skip-gram + Negative Sampling" based on mutual information is an excellent combination for Word2Vec. Therefore, the fundamental reason is not a question of which is inherently superior between Huffman Softmax and Negative Sampling, but rather that their optimization targets are already different.

The Train Has Reached the Final Station

The purpose of this article was to introduce NCE, an exquisite parameter estimation technique, and point out that it can be used to estimate parameters in probability distributions when it is difficult to complete normalization. In principle, this is a universal method, and it is likely that in some scenarios, it is the only possible solution.

Finally, we used Word2Vec as a specific example for a simple analysis, discussed some detailed issues when using NCE, and coincidentally explained why Negative Sampling is good~