By 苏剑林 | January 07, 2021
Recently, I have entered a new niche: performing text generation tasks based on the idea of discrete optimization. Simply put, we quantitatively write down the target for the text we want to generate, construct a distribution, and then search for the maximum point of this distribution or sample from it. This process usually does not require training on labeled data. Since language is discrete, continuous function optimization methods like gradient descent are unavailable. Furthermore, because this distribution usually does not have a form that is easy to sample from, direct sampling is also unfeasible. Therefore, specially designed sampling algorithms are required, such as Rejection Sampling, MCMC (Markov Chain Monte Carlo), MH sampling (Metropolis-Hastings Sampling), Gibbs Sampling, and so on.
Some readers might find this familiar; does it feel like returning to those head-spinning years of learning LDA (Latent Dirichlet Allocation)? That's right—the aforementioned sampling algorithms are also the essential foundation for understanding the LDA model. In this article, we will review these various sampling algorithms, which will appear in the rich text generation applications to be introduced later.
In many cases, we need to generate a target text $\boldsymbol{x}$ based on some specific information $\boldsymbol{c}$. In mathematical terms, this is a conditional language model $p(\boldsymbol{x}|\boldsymbol{c})$. However, we cannot obtain enough corpora of pairs $(\boldsymbol{x},\boldsymbol{c})$ to directly train a conditional language model in a supervised manner. Instead, we can only train an unconditional language model $p(\boldsymbol{x})$, but we can manually design a metric to quantitatively describe the relationship between $\boldsymbol{x}$ and $\boldsymbol{c}$. In this case, how to perform conditional text generation based on the unconditional language model $p(\boldsymbol{x})$ and the relationship between $\boldsymbol{x}$ and $\boldsymbol{c}$ becomes our object of study. We can call this "Constrained Text Generation."
For example, in sentence generation using keywords, $\boldsymbol{c}$ is the set of keywords. We can define an indicator function:
\begin{equation}\chi(\boldsymbol{x}, \boldsymbol{c})=\left\{\begin{aligned}&1,\,\,\text{if }\boldsymbol{x}\text{ contains the keyword set }\boldsymbol{c} \\ &0,\,\,\text{if }\boldsymbol{x}\text{ does not contain the keyword set }\boldsymbol{c}\end{aligned}\right. \end{equation}Then define
\begin{equation}\rho(\boldsymbol{x}, \boldsymbol{c}) = p(\boldsymbol{x})\chi(\boldsymbol{x}, \boldsymbol{c})\end{equation}$p(\boldsymbol{x})$ ensures the fluency of the generated sentence, and $\chi(\boldsymbol{x}, \boldsymbol{c})$ ensures that the generated sentence contains the required keywords. The problem then becomes the maximization operation $\mathop{\text{argmax}}\limits_{\boldsymbol{x}} \rho(\boldsymbol{x}, \boldsymbol{c})$ or the sampling operation $\boldsymbol{x}\sim \rho(\boldsymbol{x}, \boldsymbol{c})$. Of course, $\rho(\boldsymbol{x}, \boldsymbol{c})$ is not a probability distribution yet; it only becomes a true probability distribution after normalization:
\begin{equation}\frac{\rho(\boldsymbol{x}, \boldsymbol{c})}{\sum\limits_{\boldsymbol{x}}\rho(\boldsymbol{x}, \boldsymbol{c})} = \frac{p(\boldsymbol{x})\chi(\boldsymbol{x}, \boldsymbol{c})}{\sum\limits_{\boldsymbol{x}}p(\boldsymbol{x})\chi(\boldsymbol{x}, \boldsymbol{c})}\end{equation}But the denominator is usually difficult to calculate explicitly. This means that for the distribution to be sampled, we only know that it is proportional to some function $\rho(\boldsymbol{x}, \boldsymbol{c})$, but we do not know the exact expression of the distribution.
Similar examples are not rare, such as text summarization. What is text summarization? It is essentially using fewer words $\boldsymbol{x}$ to express as much of the same meaning as the original text $\boldsymbol{c}$ as possible. In this case, we can define:
\begin{equation}\rho(\boldsymbol{x}, \boldsymbol{c}) = p(\boldsymbol{x})\cdot \text{sim}(\boldsymbol{x}, \boldsymbol{c})\cdot \chi(\boldsymbol{x}, \boldsymbol{c})\end{equation}Here, $\text{sim}(\boldsymbol{x}, \boldsymbol{c})$ is some text similarity function, and $\chi(\boldsymbol{x}, \boldsymbol{c})$ is an indicator function for length—i.e., if the length of $\boldsymbol{x}$ is within a certain range (which may depend on $\boldsymbol{c}$), it is 1; otherwise, it is 0. In this case, we also obtain an unnormalized probability distribution $\rho(\boldsymbol{x}, \boldsymbol{c})$ that needs to be maximized or sampled from. Obviously, this goal means that we want to obtain a piece of text that is as semantically similar to the original text as possible and satisfies certain length constraints. Isn't this the very purpose of a summary? Therefore, the core starting point of this approach is: we must clearly and quantitatively define the target we want to generate, and then perform the next steps.
So, putting the previous background aside, the problem we face now is that we have a distribution $p(\boldsymbol{x})$ where we only know $p(\boldsymbol{x})\propto \rho(\boldsymbol{x})$, i.e.,
\begin{equation}p(\boldsymbol{x}) = \frac{\rho(\boldsymbol{x})}{\sum\limits_{\boldsymbol{x}} \rho(\boldsymbol{x})}\end{equation}where the denominator cannot be calculated explicitly. In this series of articles, $\boldsymbol{x}$ represents text—that is, a sequence of discrete elements—but the subsequent inferences also apply to scenarios where $\boldsymbol{x}$ is a continuous vector. Now we want to search for the maximum position $\mathop{\text{argmax}}\limits_{\boldsymbol{x}} p(\boldsymbol{x})$ or perform sampling $\boldsymbol{x}\sim p(\boldsymbol{x})$. As we will see later, searching for the maximum value can actually be seen as a special case of sampling, so we are mainly concerned with sampling methods.
As mentioned before, the reason we need to design special algorithms to complete sampling is that direct sampling from $p(\boldsymbol{x})$ is difficult. We need to understand the difficulties of sampling to truly understand the key points of the sampling algorithms designed later. Where is the difficulty? If the candidate space of $\boldsymbol{x}$ is not large—say, even if there are 1 million candidate values—we can calculate every $p(\boldsymbol{x})$ and then perform regular categorical sampling. However, the candidate space of $\boldsymbol{x}$ is usually far more than 1 million. For example, if $\boldsymbol{x}$ has 10 components and each component has 10,000 choices (corresponding to the vocabulary size), then the total number of permutations is $10^{40}$. It is impossible to pre-calculate the probability of every permutation and sample accordingly.
What should be done then? As the saying goes, "a journey of a thousand miles begins with a single step." We have to take it one step at a time. That is to say, if I cannot directly achieve a "1 out of $10^{40}$" selection, can I do a "1 out of $10^4$" selection 10 times? This corresponds to the so-called "autoregressive generation":
\begin{equation}p(\boldsymbol{x})=p(x_1) p(x_2|x_1) p(x_3|x_1, x_2) \cdots p(x_n|x_1,\cdots,x_{n-1}) = \prod_{t=1}^n p(x_t|\boldsymbol{x}_{< t})\end{equation}In this way, we can first sample an $x_1$ from $p(x_1)$, then sample an $x_2$ from $p(x_2|x_1)$, and so on recursively. However, autoregressive generation only corresponds to unconditional language models or supervised Seq2Seq models. If we want to add some constraints to the generation process of an unconditional language model, as in the examples mentioned earlier, the resulting model is no longer autoregressive and cannot be sampled using this recursive method.
Therefore, we inevitably need the various sampling algorithms introduced later. They also use the "one step at a time" idea, but the distribution forms they use are more general.
In articles such as "Optimization from a Sampling Perspective: A Unified View of Differentiable and Non-differentiable Optimization" and "How to Partition a Validation Set Closer to the Test Set?", we introduced the concept of "Importance Sampling." That is, if we want to estimate the expectation $\mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}[f(\boldsymbol{x})]$, but $p(\boldsymbol{x})$ is not a distribution that is easy to sample from, we can find a distribution $q(\boldsymbol{x})$ that is close to $p(\boldsymbol{x})$ and easy to sample from. Then, according to the following transformation:
\begin{equation} \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}[f(\boldsymbol{x})] = \sum_{\boldsymbol{x}} p(\boldsymbol{x}) f(\boldsymbol{x}) = \sum_{\boldsymbol{x}} q(\boldsymbol{x})\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}f(\boldsymbol{x}) = \mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}\left[\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}f(\boldsymbol{x})\right] \end{equation}it is converted into sampling from $q(\boldsymbol{x})$ to calculate the expectation of $\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}f(\boldsymbol{x})$, which means weighting each sample by $\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}$. Therefore, it is called "Importance Sampling." If we only know $p(\boldsymbol{x})\propto \rho(\boldsymbol{x})$, importance sampling can still be performed because
\begin{equation}1 = \sum_{\boldsymbol{x}} p(\boldsymbol{x}) = \sum_{\boldsymbol{x}} q(\boldsymbol{x})\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})} = \mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}\left[\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}\right]\end{equation}So
\begin{equation} \mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}\left[\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}f(\boldsymbol{x})\right] = \mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}\left[\frac{p(\boldsymbol{x}) / q(\boldsymbol{x})}{\mathbb{E}_{\boldsymbol{x}\sim q(\boldsymbol{x})}[p(\boldsymbol{x}) / q(\boldsymbol{x})]}f(\boldsymbol{x})\right] \end{equation}As a result, we find that the above formula only depends on the relative value of $p(\boldsymbol{x})$, not its absolute value. Therefore, it is also okay to replace $p(\boldsymbol{x})$ with $\rho(\boldsymbol{x})$, which is proportional to it, ultimately simplifying to:
\begin{equation} \mathbb{E}_{\boldsymbol{x}\sim p(\boldsymbol{x})}[f(\boldsymbol{x})] \approx \frac{\sum\limits_{i=1}^N \rho(\boldsymbol{x}_i) / q(\boldsymbol{x}_i) \cdot f(\boldsymbol{x}_i)}{\sum\limits_{i=1}^N \rho(\boldsymbol{x}_i) / q(\boldsymbol{x}_i)},\quad \boldsymbol{x}_1,\cdots,\boldsymbol{x}_N\sim q(\boldsymbol{x})\end{equation}The importance sampling in the previous section achieves the conversion of expectations from a complex distribution to expectations from a simple distribution, but this is not our real goal. What we want to achieve is pulling samples from the distribution $p(\boldsymbol{x})$, rather than estimating some expectation of it. The idea is still the same as importance sampling: introduce a distribution $q(\boldsymbol{x})$ that is easy to sample from, and then randomly filter out some samples so that the remaining samples follow the distribution $p(\boldsymbol{x})$.
Specifically, assuming there is a function $\alpha(\boldsymbol{x})\in [0, 1]$, we carry out sampling according to the following process, known as "Rejection Sampling":
Rejection Sampling: Sample a specimen $\boldsymbol{x}$ from $q(\boldsymbol{x})$, sample a random number $\varepsilon$ from $U[0,1]$. If $\varepsilon \leq \alpha(\boldsymbol{x})$, then accept the sample; otherwise, reject it and repeat the process.
So, what is the true probability distribution of the $\boldsymbol{x}$ sampled at this time? It's not difficult. Since the probability of sample $\boldsymbol{x}$ being retained is $\alpha(\boldsymbol{x})$, its relative probability is $q(\boldsymbol{x})\alpha(\boldsymbol{x})$. We only need to re-normalize it:
\begin{equation}\frac{q(\boldsymbol{x})\alpha(\boldsymbol{x})}{\sum\limits_{\boldsymbol{x}} q(\boldsymbol{x})\alpha(\boldsymbol{x})}\end{equation}to get the true probability distribution corresponding to rejection sampling. From this form, it can also be seen that multiplying the acceptance rate by a number between 0 and 1 theoretically does not change the distribution corresponding to rejection sampling.
This process inspires us that rejection sampling allows us to sample from a distribution proportional to $q(\boldsymbol{x})\alpha(\boldsymbol{x})$. Then, according to $p(\boldsymbol{x})=q(\boldsymbol{x})\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}$, we can let $\alpha(\boldsymbol{x})=\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}$ be the acceptance probability to perform rejection sampling starting from $q(\boldsymbol{x})$. The result is equivalent to sampling from $p(\boldsymbol{x})$. Of course, it is not that simple. According to the normalization property of probability, unless $q(\boldsymbol{x})$ is identical to $p(\boldsymbol{x})$, $\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}$ cannot always be within $[0, 1]$. But this doesn't matter; as long as $\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}$ has an upper bound, we can choose a sufficiently large constant $M$ such that $\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})\cdot M}\in [0, 1]$. At this point, using $\alpha(\boldsymbol{x})=\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})\cdot M}$ as the acceptance probability works. As we just said, multiplying by a constant will not affect the distribution corresponding to rejection sampling. In other words, this process also does not depend on a perfectly accurate $p(\boldsymbol{x})$; $p(\boldsymbol{x})$ can be replaced with $\rho(\boldsymbol{x})$, which is proportional to it.
Regarding the acceptance rate $\alpha(\boldsymbol{x})$, although theoretically it is only required that $\alpha(\boldsymbol{x})\in[0, 1]$, in practice it is better to have $\max\limits_{\boldsymbol{x}}\alpha(\boldsymbol{x}) = 1$. This is because an excessively low acceptance rate will lead to too many rejections (almost every sample is rejected), resulting in low sampling efficiency and an excessive cost to generate one reasonable sample. Similarly, although theoretically the only requirements for $q(\boldsymbol{x})$ are that it is easy to sample from and that $\frac{p(\boldsymbol{x})}{q(\boldsymbol{x})}$ has an upper bound, in practice $q(\boldsymbol{x})$ and $p(\boldsymbol{x})$ should still be as similar as possible. Otherwise, it may still cause the acceptance rate to be too low, making the sampling cost unacceptably high. Therefore, although rejection sampling seems to provide a solution for sampling from almost any distribution $p(\boldsymbol{x})$, the design of the approximate distribution $q(\boldsymbol{x})$ remains a significant challenge in practical applications.
Starting from this article, we have opened a new "pit," attempting to complete certain text generation tasks (constrained text generation) from the perspective of discrete optimization. It identifies a quantitative evaluation target and then obtains the desired output by maximizing this target or sampling from it, without requiring labeled data to supervise the training of a new model. In this process, the tools to be used are mainly sampling algorithms. This article first introduced the very basic Importance Sampling and Rejection Sampling. We will continue to improve this series of articles in the future, so stay tuned.