From Language Models to Seq2Seq: Transformer is Like a Play, It All Depends on the Mask

By 苏剑林 | September 18, 2019

I believe that over the past year (especially the last six months), everyone has frequently seen reports on various Transformer-related works (such as BERT, GPT, XLNet, etc.), along with the continuous refreshing of evaluation metrics for various basic tasks. At the same time, many blogs and columns have provided popular science and interpretations of these models.

As the saying goes, "laymen watch the spectacle, while insiders look at the mechanism." We not only need to understand these works at the "what it is" level, but we also need to think about "why." This "why" is not just "why do it this way," but also "why can it be done this way." For instance, when discussing XLNet's Permutation Language Model, we might have understood its benefits from various introductions. However, it's worth thinking a step further:

Why can the Transformer implement a Permutation Language Model? How is it implemented? Can an RNN implement it?

This article analyzes the fundamental reasons why many Transformer models can be played so "brilliantly" from the perspective of masking the Attention matrix. As the title suggests, "Transformer is like a play, it all depends on the mask"—this is one of the important technical "inner workings" of various fancy Transformer models.

By reading this article, you may learn about:

The relationship between Attention matrix masking methods and various pre-training schemes;
How to directly use a pre-trained BERT model to perform Seq2Seq tasks.

Background

Since "Attention is All You Need", Transformer-like models based on pure Attention have gradually become popular, and the emergence of BERT pushed this trend to a new height. Subsequently, various works based on large-scale pre-trained Transformer models have emerged continuously. Some use existing models for applications, some attempt to better explain and visualize these models, and others improve architectures or pre-training methods to achieve better results. Overall, these works based on pre-training are emerging in an endless stream, creating a dazzling array of options. To some extent, if you haven't fine-tuned BERT yet, you are already considered to be lagging behind mainstream NLP technology.

Fancy Pre-training

As is well known, the traditional means of model pre-training is the language model. For example, the ELMo model is based on the BiLSTM architecture and uses two directional language models to pre-train two directions of LSTMs respectively. Later, OpenAI's GPT and GPT-2 also unswervingly adhered to the ancestral (standard, unidirectional) language model for pre-training.

However, there are even more varied ways to play with pre-training. For example, BERT used what is called a "Masked Language Model" for pre-training, though this is just a variant of the ordinary language model. Then there is XLNet, which proposed a more thorough "Permutation Language Modeling"; and the UniLM model, which directly uses a single BERT architecture to do Seq2Seq. You can use it as a pre-training method, or simply use it for Seq2Seq tasks...

With such a variety of tricks, one cannot help but wonder: why is it only in the era of Transformer popularity that this phenomenon of "a hundred flowers blooming and a hundred schools of thought contending" in various large-scale pre-training models has appeared?

Transformer Exclusive

In fact, besides the unidirectional language model and its simple variant, the masked language model, the Seq2Seq pre-training of UniLM and the permutation language model pre-training of XLNet can basically be said to be specifically customized for the Transformer architecture. To put it plainly, if it were an RNN architecture, it simply could not be pre-trained using a permutation language model. As for the Seq2Seq pre-training method, it would require introducing two models (encoder and decoder) simultaneously, rather than being handled by a single model like the Transformer architecture.

The secret lies primarily in the Attention matrix. Attention is essentially equivalent to calculating the similarity between inputs in pairs, which constitutes an $n^2$ size similarity matrix (i.e., the Attention matrix, where $n$ is the sentence length; in this article, Attention refers to Self-Attention). This means its memory footprint is of the $\mathcal{O}(n^2)$ magnitude, whereas RNN and CNN models are only $\mathcal{O}(n)$. Therefore, Attention usually consumes more video memory. However, there are advantages as well as disadvantages; a larger spatial occupancy also means more possibilities. We can add various prior constraints to this $\mathcal{O}(n^2)$ level Attention matrix to make it perform tasks more flexibly. Simply put, only pure Attention models have enough "capacity" to carry so many "tricks."

The way to add prior constraints is to Apply different forms of masks to the Attention matrix, which is the focus of this article.

Analysis

In the article "A Brief Reading of 'Attention is All You Need' (Introduction + Code)", I already provided a basic introduction to Attention; here is just a brief review. The mathematical form of Attention is:

\begin{equation}Attention(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^{\top}}{\sqrt{d_k}}\right)\boldsymbol{V}\end{equation}

Here $\boldsymbol{Q} \in \mathbb{R}^{l_q \times d_q}, \boldsymbol{K} \in \mathbb{R}^{l_k \times d_q}, \boldsymbol{V} \in \mathbb{R}^{l_k \times d_v}$ represent the query, key, and value vector sequences, respectively. We can consider key and value to be in a one-to-one correspondence, while $\boldsymbol{Q}\boldsymbol{K}^{\top}$ calculates the inner product of each pair of query and key vectors. After normalization with $\text{softmax}$, we obtain an $l_q \times l_k$ Attention matrix. It describes the correlation strength between any two elements of the query and the key. All the stories we will tell next take place within this Attention matrix. Finally, multiplying it by $\boldsymbol{V}$ is equivalent to performing a weighted sum of the various vectors of $\boldsymbol{V}$ according to this correlation strength, eventually outputting an $l_q \times d_v$ vector sequence.

Currently, the most commonly used Attention method is Self-Attention, where $\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}$ are all transformed linearly from the same vector sequence, and the Transformer is a combination of Self-Attention and Position-Wise Feed-Forward layers (equivalent to a 1D convolution with a kernel size of 1). Thus, the Transformer is a sequence-to-sequence transformation based on Attention.

In this section, we will analyze the masking methods of the Attention matrix in detail, which correspond to the implementation principles of unidirectional language models, permutation language models, and Seq2Seq.

Unidirectional Language Model

A language model can be said to be an unconditional text generation model. If readers do not yet understand text generation models, they can check relevant materials and refer to the article "Playing with Keras: Seq2Seq Automatic Title Generation". A unidirectional language model is equivalent to "remembering" the training corpus through the following conditional probability distribution:

\begin{equation}p(x_1, x_2, x_3, \dots, x_n) = p(x_1) p(x_2|x_1) p(x_3|x_1, x_2) \dots p(x_n|x_1, \dots, x_{n-1})\end{equation}

What we generally call a "language model" refers to a unidirectional (more narrowly, just forward) language model. The key point of a language model is to prevent seeing "future information." As shown in the equation above, when predicting $x_1$, there is no external input; when predicting $x_2$, only $x_1$ can be input; when predicting $x_3$, only $x_1, x_2$ can be input; and so on.

RNN models are naturally suitable for language models because they are inherently recursive. If a CNN is used, the convolution kernel needs to be masked, meaning the parts of the kernel corresponding to the right side must be set to zero. What about the Transformer? It requires an Attention matrix in the form of a lower triangular matrix:

Unidirectional Mask

As shown in the figure, each row of the Attention matrix actually represents the output, while each column represents the input, and the Attention matrix represents the correlation between output and input. Assuming the white squares all represent 0, the first row indicates that "Bei" can only be associated with the start token <S>, and the second row indicates that "Jing" can only be associated with <S> and "Bei," and so on. Therefore, one only needs to introduce a lower triangular mask into the Transformer's Attention matrix and train the input and output with a one-position shift to implement a unidirectional language model.

Permutation Language Model

The Permutation Language Model is a concept proposed by XLNet, mainly used for XLNet's pre-training. Speaking of XLNet, I think the pre-training method of Permutation Language Modeling is very interesting, but I don't like its replacement of the basic architecture with Transformer-XL. I think anyone with resources could try the combination of "BERT + Permutation Language Model pre-training"; perhaps there would be surprising discoveries.

Like a standard language model, a Permutation Language Model performs conditional probability decomposition, but the decomposition order of the Permutation Language Model is random:

\begin{equation} \begin{aligned} p(x_1, x_2, x_3, \dots, x_n) &= p(x_1) p(x_2|x_1) p(x_3|x_1, x_2) \dots p(x_n|x_1, x_2, \dots, x_{n-1}) \\ &= p(x_3) p(x_1|x_3) p(x_2|x_3, x_1) \dots p(x_n|x_3, x_1, \dots, x_{n-1}) \\ &= \dots \\ &= p(x_{n-1}) p(x_1|x_{n-1}) p(x_n|x_{n-1}, x_1) \dots p(x_2|x_{n-1}, x_1, \dots, x_3) \end{aligned} \end{equation}

In short, any "appearance order" of $x_1, x_2, \dots, x_n$ is possible. In principle, each order corresponds to a model, so in theory, there are $n!$ language models. Models based on the Transformer, however, can integrate all these orders into a single model!

How is this achieved? Taking the generation of "Beijing welcomes you" (北京欢迎你) as an example, suppose a random generation order is "<S> → Ying (迎) → Jing (京) → Ni (你) → Huan (欢) → Bei (北)". We only need to mask the Attention matrix as shown in the middle sub-figure below to achieve the goal:

Different Language Model Masks

Similar to the unidirectional language model mentioned earlier, the 4th row has only one blue square, indicating "Ying" can only correlate with the start token <S>, while the 2nd row has two blue squares, indicating "Jing" can only correlate with <S> and "Ying," and so on. Visually, this is like "shuffling" the lower triangular mask of the unidirectional language model.

In other words, implementing a language model of a specific order is equivalent to shuffling the original lower triangular mask in a certain way. Precisely because Attention provides such an $n \times n$ Attention matrix, we have enough degrees of freedom to mask this matrix in different ways to achieve diverse effects.

Speaking of this, readers might have an implementation doubt: the shuffled mask doesn't seem to follow any pattern; does one have to randomly generate such an arbitrary-looking mask matrix every time? In fact, there is a simpler, mathematically equivalent training scheme. This training scheme stems from the fact that pure Attention models are essentially unordered; the word order in them is actually added through Position Embeddings. That is to say, what we input is not just the tokens themselves, but also the position IDs where the tokens are located; in other words, you think you entered the sequence "[Bei, Jing, Huan, Ying, Ni]", but you actually entered the set "{(Bei, 1), (Jing, 2), (Huan, 3), (Ying, 4), (Ni, 5)}".

Since it is just a set and independent of order, we can completely input it in a different order, such as the "<S> → Ying → Jing → Ni → Huan → Bei" just mentioned. We can input them in the order of "(Ying, 4), (Jing, 2), (Ni, 5), (Huan, 3), (Bei, 1)". That is, shuffle the tokens into "Ying, Jing, Ni, Huan, Bei" and input them into the Transformer, but the position of the first token is no longer 1, but 4, and so on. After this conversion, the Mask matrix can be restored to a lower triangular matrix, so we only need to shuffle at the input level, which makes the operation much simpler.

Seq2Seq

Now we come to our "highlight":
Combining Transformer architectures like BERT with Seq2Seq.
Why call it a highlight? Because, in principle, any NLP problem can be transformed into Seq2Seq, which is a truly universal model. So if Seq2Seq can be achieved, any task can theoretically be implemented.

There are two well-known works that combine BERT with Seq2Seq: MASS and UniLM. Both are works from Microsoft, and both were published in the same month! MASS still follows the standard Seq2Seq architecture, using BERT-like Transformer models as the encoder and decoder respectively; its main contribution is providing a pre-training scheme for Seq2Seq ideas. What is truly interesting is UniLM, which provides an elegant way to let us directly use a single BERT model to complete Seq2Seq tasks without distinguishing between encoder and decoder. And achieving this takes almost no effort—just a special mask is needed.

(Interlude: The actual sequence of events is that I independently thought of the idea of using a single BERT model for Seq2Seq two weeks ago, and then I searched for materials and found that this idea had already been implemented—it was UniLM.)

UniLM directly treats Seq2Seq as sentence completion. If the input is "What do you want to eat?" and the target sentence is "White cut chicken," UniLM concatenates these two sentences into one: [CLS] What do you want to eat [SEP] White cut chicken [SEP]. After this transformation, the simplest scheme is to train a language model and input "[CLS] What do you want to eat [SEP]" to predict "White cut chicken" word by word until "[SEP]" appears, as shown in the left figure below:

Seq2Seq Mask Variants

However, the left figure is just the most naive scheme. It also includes "What do you want to eat" in the prediction range (making Attention in this part unidirectional, i.e., the corresponding part of the Mask matrix is lower triangular). In fact, this is unnecessary and constitutes an extra constraint. What really needs to be predicted is the "White cut chicken" part, so we can remove the mask for the "What do you want to eat" part, resulting in the mask in the right figure above.

In this way, the Attention for the input part is bidirectional, while the Attention for the output part is unidirectional, satisfying the requirements of Seq2Seq without extra constraints. This is the idea provided in UniLM for using a single BERT model to complete Seq2Seq tasks. One only needs to add a mask of the aforementioned shape without modifying the model architecture, and the BERT Masked Language Model pre-training weights can be directly used, leading to faster convergence. This fits the original intention of a universal model—"With BERT in hand, I have the world"—and I personally think this is a very elegant solution.

UniLM Seq2Seq Diagram

Experiment

In fact, the masks mentioned above have basically been integrated into bert4keras, which I wrote. Readers can directly use bert4keras to load BERT pre-training weights and call the aforementioned mask schemes to perform corresponding tasks. Below, we provide an example of using the UniLM idea to build a fast-converging Seq2Seq model.

Open Source Code

The test task for this code is still the previous title generation. The code is adapted from the code in "Playing with Keras: Seq2Seq Automatic Title Generation." Thanks to the encapsulation of bert4keras, the implementation of the model part is very simple and clean. This time, the original THUCNews dataset was used directly. Readers can download the dataset and source code themselves to test and reproduce it.

Details can be found at: task_seq2seq_autotitle.py

How good is this effect? Through experiments, on the title generation task, starting from the first epoch (1000 iterations), it is already able to generate basically readable titles. Correspondingly, when using LSTM previously, it usually took dozens of times more iterations to achieve the same effect.

Training Progress

Brief Explanation

Below is a brief explanation of the key parts of the code.

First, the input format is still `token_id` and `segment_id`. For example:

tokens = ['[CLS]', u'你', u'想', u'吃', u'啥', '[SEP]', u'白', u'切', u'鸡', '[SEP]']
token_ids = [token_dict[t] for t in tokens]
segment_ids = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

`segment_ids` are used to distinguish the input sentence from the target sentence. 0 corresponds to the input sentence, and 1 corresponds to the target sentence. Only the built-in `tokenizer.encode` is needed to generate these `token_id` and `segment_id` values.

As for building the model, it takes only a few lines:

model = build_transformer_model(
    config_path,
    checkpoint_path,
    application='unilm',
    keep_tokens=keep_tokens
)

model.summary()

y_in = model.input[0][:, 1:] # Target tokens
y_mask = model.input[1][:, 1:]
y = model.output[:, :-1] # Predicted tokens, shifted by one position from targets

# Cross-entropy as loss, masking out the prediction of the input part
cross_entropy = K.sparse_categorical_crossentropy(y_in, y)
cross_entropy = K.sum(cross_entropy * y_mask) / K.sum(y_mask)

Note that in `build_transformer_model`, as long as `application='unilm'` is set, it will automatically load BERT's MLM part and pass in the corresponding mask. The rest is just writing the loss function. Additionally, there is a `keep_tokens` option, which is used to streamline the Embedding layer. For Chinese BERT, the total number of tokens is about 20,000, which means that predicting a generated token is a 20,000-class classification problem. But in fact, nearly half of the tokens will never be predicted (theoretically), so these 20,000 classes waste some calculation. Therefore, an option is provided here where we can maintain a token list ourselves and pass in the corresponding IDs to keep only that part of the tokens, thereby reducing the amount of calculation (after streamlining, it is generally only half the original size, or even less).

The rest involves decoding steps such as beam search, which are no different from general Seq2Seq. I will not go into detail here; you can refer to "Playing with Keras: Seq2Seq Automatic Title Generation" and the code.

Summary

This article systematically summarizes the masking techniques for the Attention matrix in Transformer and provides a Seq2Seq implementation using the UniLM scheme. For monolingual Seq2Seq text generation tasks, adopting the UniLM approach and loading BERT's MLM pre-training weights can effectively and quickly implement and improve the generation results. It is worth a try.