By 苏剑林 | May 25, 2020
The box of deep learning is far blacker than we imagine.
Writing at the Start
The physicist Richard Feynman is credited with saying [Source]: "If you think you understand quantum mechanics, you don't understand quantum mechanics." I increasingly feel that the "quantum mechanics" in this sentence could be replaced with "deep learning." Although deep learning has proven its effectiveness in more and more fields, our explainability of it remains quite weak. Of course, in recent years, many efforts have been committed to opening this black box of deep learning, but unfortunately, these works are basically "hindsight" explanations—proposing explanations that barely convince oneself based on existing experimental results, failing to construct and understand the model's principles from the top down, let alone making forward-looking predictions.
This article focuses on the self-attention mechanism. Intuitively, self-attention is considered one of the more interpretable models; it automatically captures correlations between tokens through self-with-self attention. In fact, in the paper "Attention is All You Need", a seemingly reasonable visualization effect was provided:
Example of Attention visualization in "Attention is All You Need"
But is this really how self-attention works? Is this "token-to-token" attention necessary? Recently, Google's new paper "Synthesizer: Rethinking Self-Attention in Transformer Models" conducted some "whimsical" explorations of the self-attention mechanism, and the results therein might overturn our understanding of self-attention.
Self-Attention
The popularity of self-attention models began with "Attention is All You Need" published by Google in 2017. For a popular science introduction, readers can also refer to my previous work "A Light Read of 'Attention is All You Need' (Introduction + Code)". Its foundation is Scaled-Dot Attention, defined as follows:
\begin{equation}Attention(\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V}) = softmax\left(\frac{\boldsymbol{Q}\boldsymbol{K}^{\top}}{\sqrt{d_k}}\right)\boldsymbol{V}\end{equation}
Where $\boldsymbol{Q}\in\mathbb{R}^{n\times d_k}, \boldsymbol{K}\in\mathbb{R}^{m\times d_k}, \boldsymbol{V}\in\mathbb{R}^{m\times d_v}$, and softmax is normalized along the $m$ dimension. As for self-attention, for the same $\boldsymbol{X}\in \mathbb{R}^{n\times d}$, projection matrices $\boldsymbol{W}_q,\boldsymbol{W}_k,\boldsymbol{W}_v\in\mathbb{R}^{d\times d'}$ are used to obtain $\boldsymbol{Q}=\boldsymbol{X}\boldsymbol{W}_q,\boldsymbol{K}=\boldsymbol{X}\boldsymbol{W}_k,\boldsymbol{V}=\boldsymbol{X}\boldsymbol{W}_v$, and then Attention is computed:
\begin{equation}\begin{aligned}
SelfAttention(\boldsymbol{X}) =&\, Attention(\boldsymbol{X}\boldsymbol{W}_q, \boldsymbol{X}\boldsymbol{W}_k, \boldsymbol{X}\boldsymbol{W}_v)\\
=&\, softmax\left(\frac{\boldsymbol{X}\boldsymbol{W}_q \boldsymbol{W}_k^{\top}\boldsymbol{X}^{\top}}{\sqrt{d_k}}\right)\boldsymbol{X}\boldsymbol{W}_v&
\end{aligned}\end{equation}
As for Multi-Head Attention, it is merely the Attention operation repeated multiple times under different parameters and concatenating the outputs, which is a fairly straightforward enhancement. For further generalizations, one can refer to "Breaking Barriers, Building a More Powerful Transformer".
Unconstrained Imagination
Essentially, self-attention uses an $n\times n$ matrix $\boldsymbol{A}$ and a $d\times d'$ matrix $\boldsymbol{W}_v$ to transform an originally $n\times d$ matrix $\boldsymbol{X}$ into an $n\times d'$ matrix $\boldsymbol{A}\boldsymbol{X}\boldsymbol{W}_v$. Here, the matrix $\boldsymbol{A}$ is generated dynamically:
\begin{equation}\boldsymbol{A}=softmax\left(\boldsymbol{B}\right),\quad\boldsymbol{B}=\frac{\boldsymbol{X}\boldsymbol{W}_q \boldsymbol{W}_k^{\top}\boldsymbol{X}^{\top}}{\sqrt{d_k}}\end{equation}
Essentially, matrix $\boldsymbol{B}$ consists of combinations of inner products between pairs of vectors in $\boldsymbol{X}$, which is why we call it "token-to-token" Attention.
Comparison between Synthesizer self-attention and standard self-attention
So, we come to the question raised earlier: Is "token-to-token" necessary? Can matrix $\boldsymbol{B}$ be generated in other ways? This paper from Google explores several "whimsical" new forms and conducts experiments, collectively calling these forms "Synthesizer."
Dense Form
The first form is called Dense in the original paper: $\boldsymbol{B}$ needs to be of size $n\times n$, and $\boldsymbol{X}$ is $n\times d$, so only a $d\times n$ transformation matrix $\boldsymbol{W}_a$ is needed to turn it into $n\times n$:
\begin{equation}\boldsymbol{B}=\boldsymbol{X}\boldsymbol{W}_a\end{equation}
This is actually equivalent to fixing $\boldsymbol{K}$ as a constant matrix $\boldsymbol{W}_a^{\top}$. Of course, the original paper makes it a bit more complex by using two Dense layers:
\begin{equation}\boldsymbol{B}=\text{relu}\left(\boldsymbol{X}\boldsymbol{W}_1 + \boldsymbol{b}_1\right)\boldsymbol{W}_2 + \boldsymbol{b}_2\end{equation}
But the underlying idea remains the same.
Random Form
We just said the Dense form is equivalent to fixing $\boldsymbol{K}$ as a constant matrix. Can we be even more "fanciful": fix $\boldsymbol{Q}$ as a constant matrix? In this case, the entire $\boldsymbol{B}$ is equivalent to a constant matrix:
\begin{equation}\boldsymbol{B}=\boldsymbol{R}\end{equation}
The original paper actually experimented with this form, calling it Random. As the name suggests, $\boldsymbol{B}$ is randomly initialized and can optionally be updated or not during training. According to the original paper, fixed-pattern Attention first appeared in the paper "Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation"; the difference is that the Attention matrix there was calculated by a function, whereas Google's paper uses complete random initialization. Formally, Random is actually equivalent to a Depthwise Separable Convolution operation.
Low-rank Factorization
The two new forms mentioned above often face the problem of having too many parameters, so it is natural to think of using low-rank factorization to reduce the number of parameters. For Dense and Random, the original paper also proposed and verified corresponding low-rank factorization forms, called Factorized Dense and Factorized Random respectively.
Factorized Dense uses the Dense method to generate two $n\times a, n\times b$ matrices $\boldsymbol{B}_1, \boldsymbol{B}_2$, where $ab=n$; then $\boldsymbol{B}_1$ is repeated $b$ times, and $\boldsymbol{B}_2$ is repeated $a$ times to obtain corresponding $n\times n$ matrices $\tilde{\boldsymbol{B}}_1, \tilde{\boldsymbol{B}}_2$. Finally, they are multiplied element-wise (personally, I feel $\tilde{\boldsymbol{B}}_2$ should probably be transposed before multiplication to be more reasonable, but the original paper does not mention this) to synthesize an $n\times n$ matrix:
\begin{equation}\boldsymbol{B}=\tilde{\boldsymbol{B}}_1 \otimes \tilde{\boldsymbol{B}}_2\end{equation}
Factorized Random is very easy to understand. Instead of a single $n\times n$ matrix $\boldsymbol{R}$, it becomes two $n\times k$ matrices $\boldsymbol{R}_1, \boldsymbol{R}_2$, and then:
\begin{equation}\boldsymbol{B}=\boldsymbol{R}_1\boldsymbol{R}_2^{\top} \end{equation}
Mixed Mode
Up to now, including standard self-attention, we have 5 different schemes for generating matrix $\boldsymbol{B}$, and they can also be mixed together:
\begin{equation}\boldsymbol{B}=\sum_{i=1}^N \alpha_i \boldsymbol{B}_i\end{equation}
Where $\boldsymbol{B}_i$ represents different forms of self-attention matrices, and $\sum\limits_{i=1}^N \alpha_i=1$ are learnable parameters.
Result Analysis
The preceding sections introduced several new forms of self-attention collectively known as Synthesizer. Their common feature is that they do not maintain the "token-to-token" form, especially Random, which completely abandons the dynamic characteristics of original attention and becomes a static matrix.
So, how effective are these new forms of self-attention? And how do they challenge our understanding of the self-attention mechanism?
Machine Translation
The first evaluation task is machine translation, which provides a detailed comparison of the effects of various self-attention forms:
Performance comparison of Synthesizer on machine translation tasks
I don't know what readers think, but for me, these results from Synthesizer challenge the existing cognition of self-attention. The table shows that except for the fixed Random, the performance of all self-attention forms is basically the same, and even the fixed Random has decent effects. This indicates that our past understanding and explanation of self-attention were too limited and failed to reveal the true reason for self-attention's effectiveness.
Summarization and Dialogue
Next are the results on summarization and dialogue generation tasks:
Performance comparison of Synthesizer on summarization and dialogue tasks
In the automatic summarization task, standard attention works better, but in the dialogue generation task, the result is the opposite: standard self-attention is the worst, while Dense (D) and Random (R) are the best. Furthermore, when Dense and Random are mixed with standard attention (i.e., D+V and R+V), the effect worsens. This indicates that standard attention has no "absolute" advantage, and though the Synthesizers look like "degradations" of standard attention, they are actually independent and each has its own advantages.
Pre-training + Fine-tuning
Finally, as general readers, we might be more concerned about the effect of "pre-training + fine-tuning," that is, how models like BERT perform after replacing self-attention. The original paper did conduct this experiment, though the baseline was T5 rather than BERT, and the results are as follows:
Performance comparison of Synthesizer in "Pre-training + Fine-tuning"
In these results, Dense and Random appear inferior compared to standard self-attention. This suggests that while Dense and Random may perform better on specific tasks, their transferability might be weaker. However, it cannot be denied that self-attentions like Random significantly improve computational efficiency because they bypass the matrix operation of $\boldsymbol{Q}\boldsymbol{K}^{\top}$. Therefore, if a way can be found to solve this transferability issue, it is possible that the Transformer model family will undergo a major overhaul.
Summary
This article introduced Google's new work, Synthesizer, which is a reflection and exploration of the currently popular self-attention mechanism. The paper proposes several new self-attention mechanisms and conducts fairly comprehensive experiments. The experimental results will likely challenge our existing understanding of the self-attention mechanism; it is well worth reading.