Playing with Keras: Automatic Title Generation via seq2seq

By 苏剑林 | September 01, 2018

Although I have claimed to be working in NLP for quite a while now, I had never truly run the classic masterpiece of NLP combined with Deep Learning—seq2seq. Recently, my interest was sparked, and I decided to learn and practice seq2seq, naturally ending with a Keras implementation.

There are many things seq2seq can do. I have chosen a relatively simple task: generating titles (in Chinese) based on article content, which can also be understood as a form of automatic summarization. I selected this task primarily because "article-title" corpus pairs are relatively easy to find, allowing for quick experimentation.

Introduction to seq2seq

The so-called seq2seq refers to general sequence-to-sequence transformation tasks, such as machine translation, automatic summarization, etc. The characteristic of such tasks is that the input sequence and the output sequence are not aligned. If they were aligned, we would call it sequence labeling, which is much simpler than seq2seq. Therefore, although sequence labeling tasks can also be understood as sequence-to-sequence transformations, we generally do not include sequence labeling when discussing seq2seq.

The key to implementing seq2seq yourself is understanding its principles and architecture. Once those are clear, implementation is not complicated regardless of the framework. Early on, there was a third-party Keras seq2seq library, but the author has since abandoned updates, perhaps feeling that such a simple thing no longer needed a dedicated library. Another useful reference is the article "A ten-minute introduction to sequence-to-sequence learning in Keras" written on the official Keras blog last year.

Basic Structure

Suppose the source sentence is $X=(a,b,c,d,e,f)$ and the target output is $Y=(P,Q,R,S,T)$. A basic seq2seq structure is shown in the figure below.

Basic seq2seq Architecture

Although the diagram has many lines and might look confusing, the structure is actually quite simple. The left side is the encoder for the input. It is responsible for encoding the input (which may be of variable length) into a fixed-size vector. Many models can be chosen for this, such as RNN structures like GRU or LSTM, or CNN+Pooling, or Google's pure Attention. Theoretically, this fixed-size vector contains all the information of the input sentence.

The decoder is responsible for decoding the vector we just encoded into our desired output. Unlike the encoder, we emphasize that the decoder is "unidirectionally recursive" because the decoding process is carried out recursively. The specific process is:

1. All output sequences start with a universal <start> token and end with an <end> token. These two tokens are also treated as words/characters;

2. Input <start> into the decoder to get a hidden layer vector. Mix this vector with the output of the encoder, then feed it into a classifier. The output of the classifier should be $P$;

3. Input $P$ into the decoder to get a new hidden layer vector. Mix it again with the encoder output and feed it into the classifier. The classifier should output $Q$;

4. Continue this recursion until the classifier outputs the <end> token.

This is the decoding process of a basic seq2seq model. During decoding, the result of each step is fed into the next step until the <end> token is output.

Training Process

In fact, the diagram above also illustrates the general training process of seq2seq. Since we have labeled data pairs during training, we can preconfigure the input and output for each step of the decoder. Thus, the process actually involves "inputting $X$ and $Y[:-1]$ to predict $Y[1:]$," essentially shifting the target $Y$ by one position for training. This training method is called Teacher-Forcing.

The decoder can also use GRU, LSTM, or CNN structures. However, it is crucial to emphasize that this "predicting the future" characteristic only exists during training. It does not exist during the inference (prediction) phase. Therefore, when the decoder executes each step, it cannot use inputs from future steps. Consequently, if using an RNN structure, generally only unidirectional RNNs are used; if using CNNs or pure Attention, the subsequent parts must be masked (for convolution, this means multiplying the kernel by a $0/1$ matrix so the convolution can only read the current and "leftward" inputs; for Attention, a similar mask is applied to the query sequence).

Sensitive readers might notice that this training scheme is "local" and not truly end-to-end. For example, when predicting $R$, we assume $Q$ is already known—meaning $Q$ was successfully predicted in the previous step, but this is not guaranteed during inference. If an error occurs in an earlier step, it may cause a chain reaction, rendering subsequent training or prediction steps meaningless.

Some scholars have considered this issue. For instance, the paper "Sequence-to-Sequence Learning as Beam-Search Optimization" incorporates the entire decoding search process into the training process, and uses pure gradient descent (without reinforcement learning). This is a very worthy reference. However, the computational cost of local training is lower, and in most cases, we only use local training for seq2seq.

Beam Search

We have mentioned the decoding process several times, but it is not yet complete. In fact, for seq2seq, we are modeling: $$p(Y|X)=p(Y_1|X)p(Y_2|X,Y_1)p(Y_3|X,Y_1,Y_2)p(Y_4|X,Y_1,Y_2,Y_3)p(Y_5|X,Y_1,Y_2,Y_3,Y_4)$$ Clearly, during decoding, we hope to find the $Y$ with the maximum probability. How do we do that?

If at the first step $p(Y_1|X)$, we directly choose the one with the highest probability (which we expect to be target $P$), then substitute it into the second step $p(Y_2|X,Y_1)$, and again choose the highest probability $Y_2$, and so on—selecting the current maximum probability output at each step—this is called greedy search. It is the lowest-cost decoding scheme. However, note that this scheme may not yield the optimal result. Suppose we didn't choose the $Y_1$ with the highest probability in the first step; substituting a different $Y_1$ into the second step might yield a very large conditional probability $p(Y_2|X,Y_1)$, such that the product of the two exceeds the result of the greedy algorithm.

However, if we were to enumerate all paths to find the optimum, the computational load would be unacceptably large (this is not a Markov process, so dynamic programming cannot be used). Therefore, seq2seq uses a compromise method: beam search.

This algorithm is similar to dynamic programming, but it is simpler even in cases where dynamic programming is applicable. Its idea is: at each calculation step, only keep the topk candidate results with the highest current probabilities. For example, if topk=3, in the first step, we only keep the three $Y_1$ values that maximize $p(Y_1|X)$. Then we substitute them respectively into $p(Y_2|X,Y_1)$ and take the top three $Y_2$ for each. This gives us $3^2=9$ combinations. At this point, we calculate the total probability for each combination and again only keep the top three. This repeats recursively until the first <end> token appears. Essentially, it still falls within the scope of greedy search, but it preserves more possibilities during the greedy process. Ordinary greedy search is equivalent to topk=1.

Improving seq2seq

The seq2seq model shown earlier is standard, but it encodes the entire input into a fixed-size vector and then decodes using that vector. This means the vector must theoretically contain all the information from the original input, placing extremely high demands on the encoder and decoder, especially in tasks like machine translation where information must remain invariant. This model is equivalent to asking us to "write the corresponding English translation immediately after reading a Chinese passage once," requiring powerful memory and decoding capabilities. In reality, humans don't have to do it this way; we repeatedly flip back to refer to the original text. This leads to the following two techniques.

Attention

Attention is now basically a "standard" module for seq2seq models. Its idea is: at each decoding step, don't just combine the fixed-size vector encoded by the encoder (reading the whole text); also look back at every individual character or word from the original source (detailed local reading). The two work together to determine the output of the current step.

seq2seq with Attention

As for the specific implementation of Attention, I have previously written an article introducing it; please refer to "A Shallow Reading of 'Attention is All You Need' (Introduction + Code)". Attention is generally divided into multiplicative and additive. I introduced the multiplicative Attention systematically presented by Google. Interested readers can look up additive Attention. As long as you grasp the three elements—query, key, and value—Attention is not difficult to understand.

Prior Knowledge

Returning to the task of generating article titles with seq2seq, the model can be simplified, and some prior knowledge can be introduced. For example, since both the input and output languages are Chinese, the Embedding layers of the encoder and decoder can share parameters (i.e., using the same set of word vectors). This significantly reduces the number of model parameters.

Furthermore, there is a very useful piece of prior knowledge: most words in a title appear in the article (Note: they just appear; they are not necessarily continuous, and we cannot say the title is contained within the article, otherwise it would become a standard sequence labeling problem). Thus, we can use the set of words in the article as a prior distribution and add it to the classification model during decoding, making the model more inclined to select words already present in the article during decoding.

Specifically, at each prediction step, we obtain a total vector $x$ (as mentioned before, this should be a concatenation of the decoder's current hidden state, the encoder's encoding vector, and the Attention encoding between the current decoder and encoder). This is then connected to a fully connected layer to finally obtain a vector $y=(y_1, y_2, \dots, y_{|V|})$ of size $|V|$, where $|V|$ is the number of words in the vocabulary. After $y$ passes through softmax, we get the original probability: $$p_i = \frac{e^{y_i}}{\sum_i e^{y_i}}$$ This is the original classification scheme. The scheme for introducing a prior distribution is: for each article, we obtain a $0/1$ vector $\chi=(\chi_1, \chi_2, \dots, \chi_{|V|})$ of size $|V|$, where $\chi_i=1$ means the word appeared in the article, otherwise $\chi_i=0$. Such a $0/1$ vector is passed through a scaling and translation layer to get: $$\hat{y}=s \otimes \chi+t=(s_1\chi_1+t_1, s_2\chi_2+t_2,\dots,s_{|V|}\chi_{|V|}+t_{|V|})$$ where $s, t$ are trainable parameters. Then, this vector is averaged with the original $y$ before being passed into the softmax: $$y \leftarrow \frac{y + \hat{y}}{2}, \quad p_i = \frac{e^{y_i}}{\sum_i e^{y_i}}$$ Experiments showed that introducing this prior distribution helps speed up convergence and generates titles of higher and more stable quality.

Keras Reference

It's time for the happy Hour of Open Source~

Basic Implementation

Based on the descriptions above, I collected a corpus of over 800,000 news articles to attempt training an automatic title generation model. For simplicity, I chose characters as the basic unit and introduced four additional markers representing mask, unk, start, and end. For the encoder, I used a dual-layer bidirectional LSTM, and for the decoder, a dual-layer unidirectional LSTM. Specific details can be found in the source code (Python 2.7 + Keras 2.2.4 + Tensorflow 1.8):

https://github.com/bojone/seq2seq/blob/master/seq2seq.py

Using 64,000 articles as one epoch, after training for 50 epochs (over an hour), the model could generate titles that looked decent:

Article Content: On August 28, online reports claimed that user data from chain hotels under the Huazhu Group were suspected of being leaked. From the content posted by the seller, the data includes guest information from more than 10 hotel brands under Huazhu, such as Hanting, Joyue, Orange, and Ibis. The leaked information includes Huazhu official website registration profiles, identity information from hotel check-in registration, and hotel room records, as well as guest names, mobile numbers, emails, ID numbers, login account passwords, etc. The seller is selling this package of about 500 million records. Threat Hunter, a third-party security platform, verified the 30,000 pieces of data provided by the seller and believes the authenticity is very high. On the same afternoon, Huazhu Group issued a statement saying it had quickly launched an internal verification and reported it to the police immediately. That evening, Shanghai police announced that they had received a report from Huazhu Group and had intervened in the investigation.
Generated Title: 《Hotel User Data Suspected of Leakage》

Article Content: Sina Sports News: On October 16 Beijing time, the NBA China Games Guangzhou stop kicked off as scheduled, and the Rockets won again, defeating the Nets 95-85. Yao Ming gradually found his rhythm, playing 18 minutes and 39 seconds, shooting 5-of-8, scoring 10 points and 5 rebounds, and recording 1 block. The Rockets concluded their China trip successfully with a two-game winning streak.
Generated Title: 《Live: Rockets Win Two in a Row, Rockets Win Again, Yao Ming 10 Points 5 Rebounds in Guangzhou Stop》

Of course, these are just two better examples. There are many bad ones, so it's certainly not enough for direct engineering use; many "black tech" optimizations would be required.

Masking

In seq2seq, proper masking is vital. Masking means hiding information that should not be read or is useless, usually by multiplying it with a $0/1$ vector. Keras's built-in masking mechanism is very unfriendly—some layers do not support masking, and ordinary LSTMs see their speed drop by almost half when masking is enabled. Therefore, I now directly use 0 as the mask marker and write my own Lambda layer for conversion. This way, speed is essentially unaffected, and it supports embedding into any layer. Refer to the code above for details.

Note that in the past, we usually did not distinguish between mask and unk (unknown words). However, if adopting my scheme, it is better to distinguish them, because although we don't know the specific meaning of an unknown word, it is still a real word and has at least a placeholder role, whereas mask is information we wish to erase completely.

Decoding Side

Beam search decoding is implemented in the code. Readers can test the impact of different topk values on the decoding results.

One thing to mention is that the implementation of decoding in the reference code is quite "lazy," which significantly slows down the decoding speed. Theoretically, after obtaining the output of the current time step, we only need to pass it into the next iteration of the LSTM to get the output for the next time step. However, this requires rewriting the LSTM on the decoder side (distinguishing between the training phase and the testing phase while sharing weights), which is relatively complex and not very beginner-friendly. Therefore, I used a very crude solution: re-running the entire model for every single step of prediction. As a result, the code is minimal, but it gets slower as the sequence grows, changing the computational complexity from $O(n)$ to $O(n^2)$.

Final Words

Ran another example with Keras. Not bad, not bad. I will firmly continue to hold the banner of Keras high~

Corpus for automatic title tasks is easy to find, and it is a task with relatively low difficulty within seq2seq, making it suitable for everyone to practice. Friends who want to get into this field, hurry up and join in!