Road to Transformer Upgrade: 7. Length Extrapolation and Local Attention

By 苏剑林 | January 12, 2023

For Transformer models, length extrapolation is a desirable property we have been pursuing. It refers to the ability of a model trained on short sequences to be applied to long sequences without fine-tuning while maintaining good performance. The pursuit of length extrapolation stems from two sides: on one hand, theoretical completeness, as it is felt to be a property an ideal model should possess; on the other hand, training practicality, as it allows us to train a model usable for long sequences at a lower cost (on shorter sequences).

Below, we analyze the key ideas for strengthening the length extrapolation of Transformers, propose a "super baseline" solution, and then use this "super baseline" to evaluate some related research work.

Misconceptions

The first work to explicitly study the length extrapolation of Transformers was likely ALIBI, published in mid-2021, which isn't very long ago. Why did it take so long (compared to the Transformer's first implementation in 2017) for someone to specialize in this topic? It's likely because we long assumed that length extrapolation was simply a matter of position encoding, and that finding a better position encoding would solve it.

In fact, by comparing the extrapolation effects of existing position encodings, arguments can indeed be found to support this view. For example, several experimental results shared later show that relative position encodings generally have better length extrapolation than absolute ones; functional relative position encodings like RoPE perform better than trained relative position encodings. Thus, it seems that as long as we continue to optimize position encoding forms, we can eventually provide Transformers with better length extrapolation. However, the situation is not that optimistic. Even RoPE, which has relatively good extrapolation capabilities, can only extrapolate about 10% to 20% of the training length before performance drops significantly. This ratio is far from expectations—one would ideally expect extrapolation by several multiples of the length for it to be considered valuable. Therefore, it is hard to imagine how long it would take to achieve significantly longer effects by improving position encoding alone.

Intuitively, many readers might feel that functional position encodings like Sinusoidal or RoPE, which have no training parameters, should have very good length extrapolation. But in reality, this is not the case. This type of position encoding does not show any advantage in length extrapolation. Why is this? It's because when assuming the extrapolation of functional position encodings, people forget its basic premise—"smoothness."

In essence, extrapolation is about inferring the whole from the parts, which is a familiar concept; Taylor series approximation is a classic example. It only needs to know the values of several orders of derivatives at a certain point to make effective estimates in a neighborhood, relying on the high-order smoothness (existence and boundedness of high-order derivatives) of the given function. But are Sinusoidal or RoPE such functions? Not really. They are combinations of sine and cosine functions whose phase functions are $k/10000^{2i/d}$. When $2i/d \approx 0$, the function is approximately $\sin k, \cos k$. These are high-frequency oscillatory functions regarding the position $k$, rather than straight lines or functions that asymptotically approach straight lines. Thus, the extrapolation behavior of models based on them is often difficult to predict. Can we design non-oscillatory position encodings? It's difficult. If position encoding functions do not oscillate, they often lack sufficient capacity to encode enough position information. In a sense, the complexity of the position encoding function itself is a requirement for encoding positions.

Super Baseline

A more accurate positioning would be:

Length extrapolation is a problem of inconsistency between training and prediction lengths.

Specifically, there are two points of inconsistency:

1. Prediction uses unseen position encodings (whether absolute or relative);
2. During prediction, the number of tokens processed by the attention mechanism far exceeds the number during training.

The first point is easy to understand: if it hasn't been trained, there's no guarantee it can be handled properly—this is a very realistic phenomenon in Deep Learning, even for functional encodings like Sinusoidal or RoPE. Regarding the second point, readers might be confused: theoretically, can't Attention handle sequences of any length? What difference does the inconsistency in length make? The answer is entropy. In "Analyzing Scale Operations in Attention from Entropy Invariance", we analyzed this: more tokens averaging the attention mean that the final distribution is relatively "flatter" (higher entropy), i.e., attention is more dispersed. Conversely, a shorter training length means the attention's entropy is lower and attention is more concentrated. This difference between training and prediction also affects performance.

In fact, for Transformer models with relative position encoding, a very simple Attention Mask can solve both of the above problems at once and achieve near-SOTA performance:

Super Baseline Model (Bidirectional)

Super Baseline Model (Unidirectional)

Understandably, this changes the Attention during prediction into a local Attention, where each token can only see a training-length number of tokens. In this way, the number of tokens each token can see is consistent with training, solving the second problem. Meanwhile, because it uses relative position encoding and positions are counted from the current token as the origin, such local Attention doesn't use more unknown position encodings than were used in training, solving the first problem. Thus, this simple Attention Mask solves the two major difficulties of length extrapolation without retraining the model. Even more surprisingly, various experimental results show that if this is used as a baseline, the relative improvement of various similar works becomes tragically weak—that is, the baseline itself is already very close to SOTA. It truly is a "super baseline" that is both fast and effective.

Paper Study

Since ALIBI, many works have been dedicated to research on Transformer length extrapolation. In this section, I have studied and organized some representative works. From these, we can see that they essentially have many similarities with the baseline model, and in some ways, they can even be considered variants of the baseline model. We will further realize the profound link between length extrapolation and the locality of attention.

ALIBI

As the "founding work," ALIBI cannot be ignored. It comes from the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation". In hindsight, the change made by ALIBI is very simple; it merely changes the Attention calculation from $\boldsymbol{q}_m^{\top}\boldsymbol{k}_n$ to \begin{equation}\boldsymbol{q}_m^{\top}\boldsymbol{k}_n - \lambda|m - n|\label{eq:alibi}\end{equation} before the Softmax, where $\lambda > 0$ is a hyperparameter set differently for each head. From this definition, one can see the similarity between ALIBI and the baseline model: both subtract a non-negative matrix before high Softmax, but the matrices being subtracted are different. ALIBI can be viewed as a "smooth version" of the baseline model:

Matrix subtracted by the baseline

Matrix subtracted by ALIBI

ALIBI is a very simple (and effective) smooth local attention trick, but if it is interpreted as "position encoding," it might be inappropriate. If one were to extend Eq. $\eqref{eq:alibi}$ to bidirectional attention, since $|m - n|=|n - m|$, the model would theoretically be unable to distinguish "left" from "right" and could only recognize relative distance, which is clearly insufficient for accurately identifying position information. Its good performance on unidirectional language models is because language models can achieve non-trivial results (significantly higher than random) even without any position encoding. The local attention imposed by ALIBI acts to enhance locality, fitting the characteristics of the language modeling task itself.

KERPLE

KERPLE comes from the paper "KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation". It is essentially a simple generalization of ALIBI, introducing two training parameters $r_1, r_2$ to generalize Eq. $\eqref{eq:alibi}$: \begin{equation}\left\{\begin{aligned}&\boldsymbol{q}_m^{\top}\boldsymbol{k}_n - r_1|m - n|^{r_2} ,& r_1 >0, 0 < r_2 \leq 2\\ &\boldsymbol{q}_m^{\top}\boldsymbol{k}_n - r_1\log(1+r_2|m - n|),& r_1, r_2 > 0 \end{aligned}\right.\label{eq:kerple}\end{equation} With generalization and trainable parameters, it is not surprising that KERPLE achieves better results than ALIBI. However, a severe criticism must be made of KERPLE's "mystification." By the layout, the third section of the original paper is the theoretical support, but it is clear that it was added solely to artificially increase the mathematical depth of the article; it provides no help in understanding KERPLE and might even diminish the reader's interest (to put it bluntly, it serves the reviewers, not the readers).

Sandwich

Sandwich was also created by the authors of KERPLE in the paper "Receptive Field Alignment Enables Transformer Length Extrapolation", which appeared on Arxiv last month. It replaces Eq. $\eqref{eq:alibi}$ with \begin{equation}\boldsymbol{q}_m^{\top}\boldsymbol{k}_n + \lambda\boldsymbol{p}_m^{\top}\boldsymbol{p}_n\label{eq:sandwich}\end{equation} where $\boldsymbol{p}_m, \boldsymbol{p}_n$ are Sinusoidal position encodings, and $\lambda > 0$ is a hyperparameter. From "Road to Transformer Upgrade: 1. Sinusoidal Position Encoding", we know that $\boldsymbol{p}_m^{\top}\boldsymbol{p}_n$ is a scalar function of $m-n$, and on average, it is a monotonic decreasing function of $|m-n|$, so its role is similar to $-\lambda|m-n|$. The emphasis on "on average" is because $\boldsymbol{p}_m^{\top}\boldsymbol{p}_n$ as a whole is not strictly monotonic; it oscillates downwards, as shown in the figure:

Plot of dot(p_m, p_n) (with d/2 subtracted)

If necessary, we can also convert Sandwich into a form like RoPE, where "absolute position encodings implement relative ones." This only requires noting that: \begin{equation}\boldsymbol{q}_m^{\top}\boldsymbol{k}_n + \lambda\boldsymbol{p}_m^{\top}\boldsymbol{p}_n = \left[\boldsymbol{q}_m, \sqrt{\lambda}\boldsymbol{p}_m\right]^{\top}\left[\boldsymbol{k}_n, \sqrt{\lambda}\boldsymbol{p}_n\right]\end{equation} In other words, Sandwich supplements absolute position information through concatenation, and the Attention result corresponds to relative position encoding. However, it seems this conversion currently has only theoretical value because concatenation increases vector dimensions, which in turn increases the computational load of Attention.

XPOS

XPOS comes from the paper "A Length-Extrapolatable Transformer", which appeared on Arxiv on the same day as Sandwich. it is a direct generalization of RoPE. We know the basic solution for RoPE is: \begin{equation}\boldsymbol{q}_m\to \boldsymbol{\mathcal{R}}_m\boldsymbol{q}_m,\quad \boldsymbol{k}_n\to \boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n\end{equation} where $\boldsymbol{\mathcal{R}}_n=\begin{pmatrix}\cos n\theta & -\sin n\theta\\ \sin n\theta & \cos n\theta\end{pmatrix}$. When I derived RoPE, I assumed "$Q$ and $K$ undergo the same transformation." In fact, from the perspective of "absolute implementation of relative position," it isn't strictly necessary for the transformation formats to be identical. For example, XPOS considers: \begin{equation}\boldsymbol{q}_m\to \boldsymbol{\mathcal{R}}_m\boldsymbol{q}_m \xi^m,\quad \boldsymbol{k}_n\to \boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n \xi^{-n}\end{equation} where $\xi$ is a scalar hyperparameter. In this way: \begin{equation}(\boldsymbol{\mathcal{R}}_m\boldsymbol{q}_m \xi^m)^{\top}(\boldsymbol{\mathcal{R}}_n\boldsymbol{k}_n \xi^{-n}) = \boldsymbol{q}_m^{\top}\boldsymbol{\mathcal{R}}_{n-m}\boldsymbol{k}_n \xi^{m-n}\end{equation} The total result still only depends on the relative position $m-n$. However, the problem now is that the exponential part is $m-n$ instead of $|m-n|$. As long as $\xi\neq 1$, one side will always diverge. XPOS cleverly selects the scenario—just like many related works—focusing only on unidirectional language models. This way, we only use the $m \geq n$ part of attention! At this point, simply choosing $\xi\in(0,1)$ achieves the effect of decay as the relative distance increases.

In fact, the introduction of the exponential decay term $\xi^m$ is not a first for XPOS. In the literature I've read, the same term appeared earliest in PermuteFormer, though PermuteFormer was mainly concerned with linear attention scenarios. In terms of detail, XPOS assigns different $\xi$ to each block, but in private communication with the author, they mentioned they performed an experiment sharing the same $\xi$ and found the improvement from different $\xi$ was almost negligible. Additionally, one must carefully control the value of $\xi$ to prevent $\xi^{-n}$ from overflowing when $n$ is large.

It is worth noting that the decay here is applied directly to the Attention Score before Softmax. The result is that Scores with larger relative distances become very close to 0, rather than tending towards negative infinity as in the previous designs. $e^0$ does not have a zeroing effect, so this design is not a variant of local attention, and thus its effect does not reach SOTA. To make up for this gap, XPOS designed a special local attention (called Blockwise Causal Attention, or BCA), which fills the gap. During discussion, the author indicated that BCA was used for implementation advantages, but in reality, the baseline model's local attention works better. So, when it comes to extrapolation, you have to look at local attention.

The original paper has very rich and valuable experiment results; I suggest everyone read it carefully~

Summary

This article summarizes the relevant work for enhancing the length extrapolation capability of Transformers. It includes a simple but powerful baseline solution and several papers focused on length extrapolation. From these, we can discover that these works are essentially variants of the baseline solution—local attention—which is one of the key links for length extrapolation.