By 苏剑林 | April 03, 2023
I never expected that the Bias term could be linked to the length extrapolation of Transformers!
Length extrapolation is an ideal property we hope Transformers possess. I have systematically introduced this issue in "Transformer Upgrade Road: 7. Length Extrapolation and Local Attention" and "Transformer Upgrade Road: 8. Length Extrapolation and Positional Robustness". As for the Bias term (offset term), the current mainstream view is that when the model is large enough, the Bias term does not play a special role. Therefore, many models choose to remove the Bias term, with representative examples being Google's T5 and PaLM. Our subsequent RoFormerV2 and GAU-α also followed this practice.
So, how are these two seemingly "completely unrelated" things connected? Can the Bias term really enhance the length extrapolation of Transformers? Let's dive in.
Hidden Easter Egg
First, why think about exploring the connection between the Bias term and length extrapolation? This is because while I was revisiting the GAU paper "Transformer Quality in Linear Time" a few days ago, I discovered a "hidden easter egg" that I hadn't noticed before—additive relative position encoding. The pseudo-code is:

Pseudo-code for GAU's additive relative position encoding
Here we mainly look at the part for $n \geq 512$. If written as a formula, it is roughly:
\begin{equation}\boldsymbol{q}_m^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n \quad\to\quad \boldsymbol{q}_m^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n+ \boldsymbol{a}^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{b}\label{eq:rel-bias}\end{equation}
where $\boldsymbol{\mathcal{R}}_m, \boldsymbol{\mathcal{R}}_n$ are the rotation matrices of RoPE, and $\boldsymbol{a}, \boldsymbol{b}$ are two learnable parameters.
I had noticed this additive relative position encoding before, but my comment at the time was just "I don't understand why several types of position encoding are used simultaneously." Recently, I have been thinking about the length extrapolation problem, so I became sensitive to this form. It can be proved that when $\boldsymbol{a} = \boldsymbol{b} = [\sqrt{\lambda}, 0, \sqrt{\lambda}, 0, \dots, \sqrt{\lambda}, 0]^{\top}$, the result is exactly the Sandwich method introduced in "Transformer Upgrade Road: 7," which can improve length extrapolation. Its principle is that $\boldsymbol{a}^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{b}$ exhibits a decreasing trend with respect to $\|m-n\|$. When added to the attention matrix, it serves to localize attention. According to "Transformer Upgrade Road: 7", attention localization is key to the extrapolation of language models.
So I couldn't help but guess: could it be that this additive relative position encoding in the original paper was intended to enhance length extrapolation? The authors of GAU were so prescient that they proposed a similar idea to solve the length extrapolation problem long before Sandwich?
Replacing with Bias
However, for me, this type of scheme that adds an extra term to the attention matrix to enhance length extrapolation doesn't seem elegant enough. So, regardless of the original author's intention and actual effect, I am not inclined to do it this way. Is there something similar but nearly "unobtrusive"? I considered that if $\boldsymbol{a}$ and $\boldsymbol{b}$ were the Bias terms for $\boldsymbol{q}_m$ and $\boldsymbol{k}_n$ respectively, it might achieve a similar effect. That is, consider:
\begin{equation}(\boldsymbol{q}_m + \boldsymbol{a})^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n(\boldsymbol{k}_n + \boldsymbol{b})\end{equation}
Obviously, simply adding a Bias term is almost "unobtrusive" in terms of both form and computation. If this could enhance length extrapolation, it would undoubtedly be a very beautiful solution. Is it feasible? Let's look at the expanded result:
\begin{equation}\boldsymbol{q}_m^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n + \boldsymbol{a}^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n + \boldsymbol{q}_m^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{b} + \boldsymbol{a}^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{b} \label{eq:bias}\end{equation}
The first and fourth terms correspond exactly to formula $\eqref{eq:rel-bias}$, which are what we want. So we want to see what role the second and third terms play. If they do not have any obvious effect, then the approach of directly adding Bias terms has at least "hope" of achieving extrapolation effects similar to formula $\eqref{eq:rel-bias}$ or Sandwich.
My reasoning is as follows: as Query and Key for Attention, $\boldsymbol{q}_m$ and $\boldsymbol{k}_n$ should be relatively "isotropic," meaning their directions are fairly uniform, close to uniform sampling on a sphere. $\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n = \boldsymbol{\mathcal{R}}_{n-m}$ is just an orthogonal transformation; it does not change the isotropic nature of $\boldsymbol{q}_m$ and $\boldsymbol{k}_n$. Thus, the terms $\boldsymbol{a}^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n$ and $\boldsymbol{q}_m^{\top}\boldsymbol{\mathcal{R}}_m^{\top}\boldsymbol{\mathcal{R}}_n\boldsymbol{b}$ are equivalent to the inner product of a vector sampled from an isotropic distribution and a fixed vector. According to our discussion in "Distribution of the angle between two random vectors in n-dimensional space", the angle between two such vectors should be very close to 90 degrees. In other words, the expectation of this inner product should be 0. Therefore, the effects of the second and third terms should theoretically not be as strong as the remaining two terms.
Of course, this is just a guess. How it actually trains can only be determined through experiments. So, without further delay, I conducted the experiment.
Experimental Results
For this experiment, I chose a language modeling task. The model architecture is still the previous GAU-α. The training length and batch_size are both 512, and the optimizer is Tiger. The only difference between the two models is whether the Bias for Q and K is enabled (other Biases are still removed).
Comparison of extrapolation effects:
$$\begin{array}{c}
\text{LM Accuracy at different test lengths} \\
{\begin{array}{c|cccc}
\hline
& 512 & 1024 & 2048 & 4096 \\
\hline
\text{w/o Bias} & 52.37\% & 33.15\% & 22.85\% & 17.87\% \\
\text{w/ Bias} & 52.75\% & 50.99\% & 45.25\% & 39.55\% \\
\hline
\end{array}}
\end{array}$$
As can be seen, the Bias term does not really affect the training effect (at 512 length), but it significantly widens the gap in length extrapolation. The seemingly non-existent Bias term actually has such a magical effect! Of course, if the experiment were rerun several times, the extrapolation results might fluctuate significantly, as length extrapolation is a "complimentary" feature and not something we triggered intentionally.
To verify whether the remaining mechanism is as we guessed, I visualized the patterns of the four terms in formula $\eqref{eq:bias}$ for a certain layer of a single sample:

Comparison of the four inner product terms after adding Bias
As can be seen, the 4th term indeed shows an attenuation trend, and its magnitude is dominant. Comparing the superposition of these four terms with the model without Bias:

Comparison of Attention matrices with and without Bias
In the model without Bias (blue), the Attention indeed shows an attenuation trend within the training length (512), but it increases after the length extends, showing no obvious locality. This is why its extrapolation is not good enough. Conversely, consistent with the previous guess, the attention matrix of the model with the Bias term (orange) shows a more pronounced attenuation trend. In other words, its localization effect is stronger, leading to better extrapolation performance. It should be noted that the model with Bias does not show this obvious attenuation trend in every layer's Attention. Generally speaking, the attenuation trend is more pronounced in the earlier layers and weaker in the later layers, indicating that layers closer to the input focus more on local information, which aligns with the conclusion of "The Devil in Linear Transformer".
[Note: Later, after repeated testing, it was found that the length extrapolation results in this article are somewhat unstable (potentially closely related to model architecture, hyperparameters, etc.). Please use with discretion.]
Further Thoughts
Now the question arises: haven't previous works on length extrapolation already verified that RoPE's extrapolation is not very good? Did they all omit the Bias? To investigate this, I specifically checked: "As expected," the "pioneering work" ALIBI and the recent XPOS did not add the Bias term, while KERPLE and Sandwich did add the Bias term. When I was reading the papers before, I always felt that the extrapolation effect of RoPE in KERPLE and Sandwich seemed better than in ALIBI and XPOS. Now I can be certain this wasn't an illusion. Since both KERPLE and Sandwich added Bias, then according to the conclusion of this article, RoPE is capable of exhibiting better length extrapolation.
Some readers might recall: didn't we say previously that the Bias for the Key in Attention could be removed? Can it be removed here too? Regarding this question, one can refer to the question on Zhihu "Why do some Vision Transformers not need bias in the Key?". In fact, the conclusion that "the Bias for the Key can be omitted" applies to Attention without RoPE. Due to the presence of Softmax, the added bias can be canceled out:
\begin{equation}\frac{e^{\boldsymbol{q}\cdot(\boldsymbol{k}_n + \boldsymbol{b})}}{\sum\limits_n e^{\boldsymbol{q}\cdot(\boldsymbol{k}_n + \boldsymbol{b})}} = \frac{e^{\boldsymbol{q}\cdot\boldsymbol{k}_n}e^{\boldsymbol{q}\cdot\boldsymbol{b}}}{\sum\limits_n e^{\boldsymbol{q}\cdot\boldsymbol{k}_n} e^{\boldsymbol{q}\cdot\boldsymbol{b}}}= \frac{e^{\boldsymbol{q}\cdot\boldsymbol{k}_n}}{\sum\limits_n e^{\boldsymbol{q}\cdot\boldsymbol{k}_n}}\end{equation}
However, this "cancellation" depends on $\boldsymbol{b}$ being independent of $n$. But from formula $\eqref{eq:bias}$, we see that after RoPE, $\boldsymbol{b}$ is effectively a function of $m$ and $n$, and it cannot actually be canceled out. Therefore, for models with RoPE, adding or removing the Bias term will result in different effects.
Another question is: why go through the trouble of exploring length extrapolation? Can't we just fine-tune the model with longer samples? In fact, even for readers with that thought, length extrapolation is beneficial. Putting aside computational power, better length extrapolation means that during fine-tuning, the gap from pre-training is smaller, making fine-tuning less likely to suffer from catastrophic forgetting. This is even more important for current LLMs. Of course, one can also think further: the most ideal result is that a model trained on short texts can switch to long-text scenarios without a loss in performance, or even with better performance.
Summary
In this article, I shared an "unexpected" and interesting conclusion: the Bias term can enhance the length extrapolation of RoPE-based models! A seemingly insignificant Bias term can actually be linked to the length extrapolation of Transformers, reminding us of the importance of details—sometimes minor nuances can play a critical role.