By 苏剑林 | June 16, 2020
In the decoding process of Seq2Seq, we generate tokens recursively one by one until the <eos> tag appears; this is the so-called "autoregressive" generation model. However, readers who have studied Seq2Seq may have noticed that this autoregressive decoding occasionally exhibits a phenomenon where it "simply cannot stop." This is mainly characterized by the repeated appearance of a certain fragment, such as "The weather today is great great great great great..." or "Don't you think what I said is right right right right right...", but it stubbornness refuses to produce the <eos> tag. The ICML 2020 paper "Consistency of a Recurrent Language Model With Respect to Incomplete Decoding" systematically discusses this phenomenon and proposes some countermeasures. This article will briefly introduce the main content of that paper.
Decoding Algorithms
For autoregressive models, we build a conditional language model as follows:
\begin{equation}p(y_t|y_{\lt t}, x)\label{eq:p}\end{equation}
The decoding algorithm then outputs the corresponding $y=(y_1,y_2,\dots,y_T)$ for a given $x$ based on the known model. Decoding algorithms can be roughly divided into two categories: Deterministic Decoding and Stochastic Decoding. The original paper discusses the "unending generation" problem for both categories, so we need to understand these two types of decoding algorithms.
Deterministic Decoding
Deterministic decoding algorithms mean that once the input text is fixed, the decoded output text is also fixed. This category includes Greedy Search and Beam Search. In fact, Greedy Search is a special case of Beam Search, so we only need to discuss Beam Search.
In Beam Search, we need to fix a "beam size" $k$ and decode token by token from left to right, keeping only the $k$ sequences with the highest total scores at each step. For example, if $k=2$ and the token space is $V=\{a,b,c,d\}$, the decoding process would look like this:
- First step: Calculate $p(y_1|y_0,x)$ ($y_0$ is the fixed starting tag
<bos>), and keep the two largest, e.g., $\{a,b\}$, recording their scores (log-probabilities).
- Second step: Calculate $p(y_2|y_0,a,x)$ and $p(y_2|y_0,b,x)$. At this point, there are a total of $k\times |V|=8$ candidate sequences. Keep the two with the highest total scores (the current token score plus the scores of $a$ and $b$ themselves), e.g., $\{(a,c),(b,d)\}$, and record their respective total scores.
- Third step: Calculate $p(y_3|y_0,a,c,x)$ and $p(y_3|y_0,b,d,x)$. There are again $k\times |V|=8$ candidate sequences. Keep the two with the highest total scores (the current token score plus the scores of $(a,c)$ and $(b,d)$), e.g., $\{(a,c,d),(a,c,c)\}$, and record their total scores.
...And so on. Each sequence stops only when <eos> appears. Finally, the best one is selected from the $k$ completed sequences. There are generally two choices: one is to output the one with the highest total score, and the other is to output the one with the highest average score (divided by the number of tokens). Sometimes length penalties are added according to actual needs.
Stochastic Decoding
Stochastic decoding algorithms, as the name suggests, mean that even if the input text is fixed, the decoded output text is not fixed. For example, random sampling from a trained language model is such an algorithm (refer to "Now you can play with Chinese GPT2 using Keras"). For Seq2Seq, we often want deterministic results, so in most scenarios, we use Beam Search. However, the results of Beam Search can be overly homogeneous (i.e., "safe" responses like "Okay," "I don't know," "Thank you"), or sometimes we want to increase output diversity (such as the SimBERT model we open-sourced for generating similar sentences). In these cases, stochastic decoding algorithms are needed, which include three cases: Native Random Decoding, Top-k Sampling, and Nucleus Sampling.
Native Random Decoding is simple: at each step, a token is randomly sampled according to its probability. For example, the first step calculates $p(y_1|y_0,x)$ and samples a token, say $c$; then the second step calculates $p(y_2|y_0,c,x)$ and samples another token, say $a$; the third step calculates $p(y_3|y_0,c,a,x)$ and continues sampling; ...; this continues until <eos> is sampled.
Top-k Sampling comes from the article "Hierarchical Neural Story Generation". It essentially adds a truncation to native random decoding: only the $k$ tokens with the highest probabilities are kept at each step, and then sampling is performed after re-normalization. This is done to strike a compromise between "high score" and "diversity." Obviously, when $k=1$, it is equivalent to Greedy Search.
Nucleus Sampling comes from the article "The Curious Case of Neural Text Degeneration". Similar to top-k sampling, it also truncates the sampling space. The truncation method is: fix $p\in(0, 1)$, and keep only the top tokens whose cumulative probability just exceeds $p$. Therefore, it is also called top-p sampling.
In addition to top-k and top-p truncation, there are some adaptive truncation methods. For example, the paper "Sparse Sequence-to-Sequence Models" replaces the softmax function of the final predicted distribution with a sparse version of softmax, which automatically sets the probability of most impossible tokens to 0 without needing to manually choose $k$ or $p$.
Knowing When to Stop
From the model design of Seq2Seq and the decoding algorithms introduced above, there is no theoretical guarantee that the decoding process will definitely stop. That is to say, nothing ensures that the <eos> tag will necessarily appear; this can only be learned by the model itself. When the model hasn't learned well enough, the "unending generation" phenomenon occurs. The original paper analyzes different decoding algorithms and proposes corresponding strategies to make the decoding process "know when to stop."
Bounded Hidden Vectors
The classic way to model the probability $\eqref{eq:p}$ is:
\begin{equation}p(y_t|y_{\lt t}, x)=softmax(Wh_t+b),\quad h_t=f(y_{\lt t}, x)\end{equation}
In other words, a hidden vector $h_t$ is calculated first, followed by a fully connected layer and softmax activation. In this form, the original paper states that if for any $t$, $\Vert h_t\Vert$ is bounded, then native random decoding can "stop appropriately."
This sounds like a very strong and practical conclusion, doesn't it? Making $\Vert h_t\Vert$ bounded is a very simple thing—for example, by adding Layer Norm. Does this mean adding Layer Norm can solve all problems? Not quite. The above conclusion is theoretically correct. The reasoning process is: because $\Vert h_t\Vert$ is bounded, for any $t$ and any token, $p(y_t|y_{\lt t}, x)$ has a positive lower bound (because $Wh_t$ won't be infinitely large, $e^{Wh_t}$ won't be infinitely large, and after normalization, it won't be infinitely close to 0). This implies that there exists a positive number $\epsilon > 0$ such that $p(\text{<eos>}|y_{\lt t}, x)\geq \epsilon$ always holds. Since the probability is a positive constant, as long as you sample enough steps, you will eventually have a chance to sample <eos>, so it won't stay stuck forever.
Does this reasoning process sound a bit ridiculous? Yes, it can stop, but you might have to sample a huge number of steps. It feels like saying "as long as you buy enough lottery tickets, you will eventually win the jackpot"—it doesn't have much definite practical value. After sampling enough steps, tokens that were meant to loop or repeat might have already repeated many times; even if it stops, the resulting output might be meaningless, perhaps even worse than simply truncating by length.
Actively Adding <eos>
Note that the above conclusion only holds for native random decoding. It does not necessarily hold for top-k or Nucleus sampling because, after truncation, <eos> might not appear in the sampling space. Of course, we can manually add <eos> into the sampling space, leading to the following conclusion:
If for any $t$, $\Vert h_t\Vert$ is bounded, and we also add <eos> to the sampling space, then top-k random decoding and Nucleus sampling can "stop appropriately."
However, this is somewhat like stating the obvious...
Self-Truncation Design
Note that the two conclusions above can only be used for stochastic decoding. For deterministic decoding, since there is no randomness, we cannot guarantee hitting <eos>. To this end, the original paper proposes a self-truncation design: find a way to make $p(\text{<eos>}|y_{\lt t}, x)$ have a positive lower bound, and ensure this lower bound increases as $t$ increases, eventually tending toward 1.
This self-truncation design is not complex. It defines $p(\text{<eos>}|y_{\lt t}, x) = 1 - \alpha(h_t)$, where
\begin{equation}\begin{aligned}
\alpha(h_0)=&\,\sigma\left(w_{\text{<eos>}}^{\top} h_0 + b_{\text{<eos>}}\right)
\\
\alpha(h_t)=&\,\sigma\left(w_{\text{<eos>}}^{\top} h_t + b_{\text{<eos>}}\right)\left[1 - p(\text{<eos>}|y_{\lt {t-1}}, x)\right]
\end{aligned}\end{equation}
Here $\sigma(\cdot)$ is responsible for mapping $\mathbb{R}$ to $[0, 1-\epsilon]$; for example, one can use $\sigma(\cdot)=(1-\epsilon)\text{sigmoid}(\cdot)$. After designing $p(\text{<eos>}|y_{\lt t}, x)$, the remaining token probabilities are calculated using the original softmax method and then multiplied by $\alpha(h_t)$.
Now we have
\begin{equation}\begin{aligned}
p(\text{<eos>}|y_{\lt t}, x)=&\,1 - \sigma\left(w_{\text{<eos>}}^{\top} h_t + b_{\text{<eos>}}\right)\left[1 - p(\text{<eos>}|y_{\lt {t-1}}, x)\right]\\
=&\,1 - \prod_{i=0}^t\sigma\left(w_{\text{<eos>}}^{\top} h_i + b_{\text{<eos>}}\right)\\
\geq&\, 1 - (1 - \epsilon)^{t+1}
\end{aligned}\end{equation}
Clearly, as long as $t > -\ln 2/\ln (1-\epsilon)$, $p(\text{<eos>}|y_{\lt t}, x) > 0.5$. This means that for greedy search, it must stop within $-\ln 2/\ln (1-\epsilon)$ steps. As $p(\text{<eos>}|y_{\lt t}, x)$ approaches 1, it is clear that Beam Search will also stop within finite steps.
Personal Evaluation
The main content of the original paper is roughly as described. On the whole, it does provide some new understanding of decoding algorithms and offers some effective strategies to alleviate the "unending generation" problem. However, as an ICML paper, I feel the perspective of the original paper is not very high and overall seems a bit superficial.
Most of the original paper is spent using mathematical language to rephrase existing content—such as what decoding algorithms are, what top-k random decoding is, what Beam Search is, what "unending generation" is, etc. The paper gives these mathematical definitions, which isn't to say it's meaningless, but it doesn't add much value to the actual problem being discussed. Removing that part leaves very little content. Secondly, the conclusions are too weak. The strategies for stochastic decoding have been critiqued above; the conclusions are correct but have basically no practical value. As for the self-truncation design for deterministic decoding, it is actually quite forced—it feels like a crude truncation and lacks elegance.
The most critical issue is that throughout the paper, regarding the "unending generation" problem, it answers "what it is" and "what to do," but it does not explore "why." It provides no useful information regarding the essence of why "unending generation" occurs, and thus fails to derive strategies that are closer to the essence of the problem. This is what I find quite difficult to accept.
Summary
This article introduced decoding algorithms for Seq2Seq, discussed the "unending generation" phenomenon that may occur during decoding, and introduced the strategies provided in an ICML 2020 paper to address it.