By 苏剑林 | June 07, 2022
Position encoding is a crucial part of the Transformer. In "Transformer Position Encodings That Make Researchers Rack Their Brains", we summarized several common designs for position encoding. Broadly speaking, we categorize Transformer position encodings into two types: "absolute position encoding" and "relative position encoding," with the latter generally performing better in many NLP/CV experiments.
However, it is observable that currently, almost all relative position encodings operate on the Attention matrix before the Softmax. This method of application actually possesses a theoretical defect that prevents the Transformer from becoming a "universal approximator." This article will analyze this problem and explore some solutions.
Simple Probe
As the name suggests, position encoding is used to supplement the model with positional information. So, how do we determine if a model has sufficient capability to recognize positions? I previously conceived a simple probe experiment:
For a model with the capability to identify positions, it should be able to accurately achieve the following mapping:
\begin{equation}
\begin{array}{lc}
\text{Input: } & [0, 0, \cdots, 0, 0] \\
& \downarrow \\
\text{Output: } & [1, 2, \cdots, n-1, n]
\end{array}
\end{equation}
That is, given $n$ zeros as input, it should be able to output position indices $1 \sim n$ in order. The idea behind this probe experiment is simple: if a model can do this, it indicates that identifying positions is an inherent capability of the model itself, independent of external input, which is exactly what we want. It's not hard to see that since absolute position is directly applied to the input, it can easily pass this probe test.
Unable to Perform
However, when I considered Transformer models with relative position encodings using this simple probe experiment, I found that almost all of them fail this task.
Specifically, except for the design proposed in "Self-Attention with Relative Position Representations", all other relative position encodings (including RoPE proposed by me) only modify the Attention matrix before the Softmax. Consequently, the Attention matrix with relative position information remains a stochastic matrix (i.e., each row sums to 1).
On the other hand, for a Transformer model, the sole source of interaction between tokens is the $\boldsymbol{A}\boldsymbol{V}$ step of Self-Attention, which can be written as $\boldsymbol{o}_i = \sum\limits_j a_{i,j}\boldsymbol{v}_j$. Identical inputs mean that every $\boldsymbol{v}_j$ is the same, therefore:
\begin{equation}\boldsymbol{o}_i = \sum_j a_{i,j}\boldsymbol{v}_j = \sum_j a_{i,j}\boldsymbol{v} = \left(\sum_j a_{i,j}\right)\boldsymbol{v} = \boldsymbol{v}\end{equation}
This means every $\boldsymbol{o}_i$ is also the same. In other words, every position in the model outputs the same result from beginning to end, so it is fundamentally impossible for the model to output distinct values like $[1, 2, \cdots, n-1, n]$.
Similar findings also appeared in the recent paper "Your Transformer May Not be as Powerful as You Expect", where the authors constructed slightly different examples to demonstrate the fitting capability defect of relative position encoding Transformers. The two ideas coincide perfectly. Furthermore, the beginning of this article mentions "universal approximation"; if this counterexample is resolved, can "universal approximation" be achieved? That paper also provides corresponding theoretical analysis to affirm this fact, which won't be detailed here.
Preliminary Solution
A little thought reveals that the problem mainly stems from the fact that each row of the Attention matrix sums to 1. To solve this, one needs to find a way to break this constraint. To this end, "Your Transformer May Not be as Powerful as You Expect" further proposed the following design based on their findings:
\begin{equation}\boldsymbol{O} = (\boldsymbol{A}\odot \boldsymbol{C})\boldsymbol{V}\quad \text{or equivalently}\quad\boldsymbol{o}_i = \sum_j a_{i,j}c_{i,j}\boldsymbol{v}_j\end{equation}
where $\boldsymbol{C}$ is a trainable parameter matrix, and $\odot$ is the element-wise product (Hadamard product). To ensure the entire model still only contains relative position information (as this article discusses the defects of relative position encoding Transformers), we must constrain $\boldsymbol{C}$ to be a Toeplitz matrix, i.e., $c_{i,j}=g(i-j)$.
With the addition of $\boldsymbol{C}$, the collective $\boldsymbol{A}\odot \boldsymbol{C}$ clearly does not necessarily have rows summing to 1, thereby breaking the restriction and solving the problem (for more experimental results, please refer to the original paper). However, this introduces a new parameter matrix, and since $\boldsymbol{C}$ itself is of finite size, it does not support variable-length inputs well (or the matrix $\boldsymbol{C}$ would need to take the form of $c_{i,j}=g(\text{clip}(i-j, p_{\min}, p_{\max}))$). Overall, it feels insufficiently concise and elegant.
Removing the Denominator
Returning again to the source of the problem: the sum of each row in the Attention matrix equals 1. What operation leads to this phenomenon? The answer is obvious: Softmax.
\begin{equation}a_{i,j} = \frac{e^{b_{i,j}}}{\sum\limits_j e^{b_{i,j}}}\end{equation}
Here $\boldsymbol{B}=(b_{i,j})$ is the matrix before Softmax. Clearly, it is the step of "dividing by $\sum\limits_j e^{b_{i,j}}$" that results in $\sum\limits_j a_{i,j}=1$. So, a very direct thought is:
If I don't want $\sum\limits_j a_{i,j}=1$, why not just stop dividing by $\sum\limits_j e^{b_{i,j}}$?
In fact, it works! Experimental results show that a Transformer that does not divide by this denominator can successfully complete the aforementioned probe test. At this point, one cannot help but admire the "foresight" of GAU. The new attention it proposed directly uses $\text{relu}^2$ activation and then simply divides by $n$ for normalization, avoiding $\sum\limits_j a_{i,j}=1$ and thereby enhancing the model's theoretical capability (though perhaps the authors didn't think that far; it might be more of my imagination).
New Normalization
However, we found in "It seems Attention and Softmax go better together..." that attention designs without probability normalization, like those in GAU, might suffer from poor extrapolation capabilities. In other words, performing probability normalization leads to the theoretical defect mentioned earlier, while simply dividing by $n$ for normalization might result in poor extrapolation. Is there a scheme that can balance both?
Let's brainstorm further. From the perspective of norms, $\sum\limits_j e^{b_{i,j}}$ is actually the $l_1$ norm of the vector $e^{b_{i,:}}$. Thus, Softmax is essentially an $l_1$ normalization operation on the vector $e^{b_{i,:}}$. To avoid $\sum\limits_j a_{i,j}=1$ while retaining normalization, would switching to other normalization operations work? For example, $l_2$ normalization:
\begin{equation}a_{i,j} = \frac{e^{b_{i,j}}}{\sqrt{\sum\limits_j e^{2b_{i,j}}}}\end{equation}
According to my tests, this $l_2$-normalized Attention can indeed successfully complete the probe experiment. So, does this change help in the NLP pre-training scenarios we care about more? I also conducted corresponding comparative experiments, and the results are twofold:
1. For the standard Attention + FFN combination, before applying $l_2$ normalization to Attention, one must reduce the initial variance of the $\boldsymbol{W}_V, \boldsymbol{W}_O$ of Attention. The experimental results were slightly worse than the conventional $l_1$-normalized Attention.
2. For the full-GAU architecture, $l_2$-normalized Attention can be applied directly without changing initialization. The experimental results were slightly better than the conventional $l_1$-normalized Attention.
The difference likely originates from their inherent initialization methods. In the standard Attention + FFN combination, the initial Attention matrix is close to a uniform matrix (all numbers the same), whereas in "Does Gated Attention Unit (GAU) still need Warmup?", we analyzed that the initial Attention matrix of GAU is closer to an identity matrix (times some constant).
A Turn of Events
Looking back at the whole text, we found that because "every $\boldsymbol{v}_j$ is the same," a "model where $\sum\limits_j a_{i,j}=1$ cannot complete the probe experiment." But what if not all $\boldsymbol{v}_j$ are the same?
We know that starting from BERT, mainstream Transformer models have designed inputs like "[CLS] SENT [SEP]". That is, some marker tokens are attached before and after the input. If we consider these marker tokens as part of the model rather than the input (meaning the input is "[CLS] 0 0 ... 0 0 [SEP]" instead of all zeros), is it possible to complete the probe?
I also experimented with this and found that after supplementing the input with marker tokens, the probe experiment can indeed be completed without modifying other parts of the relative position encoding Transformer. This result is somewhat ironic—it turns out the authors of BERT also had great "foresight." The added special tokens [CLS] and [SEP] also assist in positioning. The theoretical defect we analyzed so long was actually resolved by two special tokens. This reminds one of the conclusion in "How Much Position Information Do Convolutional Neural Networks Encode?", which states "CNNs identify absolute positions through padding"; there is a shared logic between the two.
Of course, this doesn't mean our previous reflections were meaningless. For example, for the GAU model, switching Attention to $l_2$ normalization definitely has the effect of accelerating convergence and slightly improving performance. Furthermore, since $l_2$ normalization is acceptable, can $e^{b_{i,j}}$ be replaced by general activation functions (such as removing the non-negativity constraint)? I also briefly experimented with "$\text{swish}(b_{i,j})$ + $l_2$ normalization" and found it has some feasibility. From this perspective, Attention under $l_2$ normalization actually has more room for expansion.
Closing
This article analyzed an implicit defect of the relative position encoding Transformer and explored corresponding countermeasures, leading to reflections on the non-negativity and normalization methods of the Attention matrix.