Transformer Upgrade Journey: 8. Length Extrapolation and Positional Robustness

By 苏剑林 | January 31, 2023

In the previous article "Transformer Upgrade Journey: 7. Length Extrapolation and Local Attention," we discussed the length extrapolation of Transformers. We concluded that length extrapolation is an issue of inconsistency between training and prediction, and the main idea to solve this inconsistency is to localize attention. Many improvements with good extrapolation performance are, in a sense, variants of local attention. Admittedly, while many current indicators in language models suggest that local attention can indeed solve the length extrapolation problem, this "forced truncation" approach might not satisfy some readers' aesthetic preferences because the traces of manual craftsmanship are too strong, lacking a sense of naturalness. It also raises questions about their effectiveness in non-language modeling tasks.

In this article, we revisit the problem of length extrapolation from the perspective of the model's robustness to positional encodings. This approach can improve the length extrapolation effect of Transformers without modifying the attention mechanism itself, and it is applicable to various positional encodings. Overall, the method is more elegant and natural, and it is also suitable for non-language modeling tasks.

Problem Analysis

In previous articles, we analyzed the reasons for length extrapolation and positioned it as "a problem of length inconsistency between training and prediction." Specifically, there are two points of inconsistency:

Prediction uses positional encodings that were never seen during training (regardless of whether they are absolute or relative);
During prediction, the number of tokens processed by the attention mechanism far exceeds the number during training.

The second point refers to the fact that more tokens lead to more dispersed attention (or increased entropy of attention), which causes inconsistency between training and prediction. We have already discussed and addressed this in "Looking at Attention Scaling from the Perspective of Entropy Invariance," where the answer was to modify the Attention from:

\begin{equation}Attention(Q,K,V) = \text{softmax}\left(\frac{QK^{\top}}{\sqrt{d}}\right)V\end{equation}

to:

\begin{equation}Attention(Q,K,V) = \text{softmax}\left(\frac{\log_{m} n}{\sqrt{d}}QK^{\top}\right)V\end{equation}

where $m$ is the training length and $n$ is the prediction length. With this modification (referred to below as "$\log n$ scaled attention"), the entropy of attention changes more smoothly with length, mitigating this inconsistency. Personal experimental results show that, at least in MLM tasks, the length extrapolation performance of "$\log n$ scaled attention" is better.

Therefore, we can consider the second point of inconsistency to be initially resolved, so we should next concentrate on solving the first point of inconsistency.

Random Positions

The first point of inconsistency is "predicting with positional encodings that were not trained." To solve this, one should "train the positional encodings used in prediction during the training phase." An ACL 22 paper (published during anonymous review), "Randomized Positional Encodings Boost Length Generalization of Transformers," first considered this issue from this angle and proposed a solution.

The logic of the paper is simple: Random Position Training. Let $N$ be the training length (the paper uses $N=40$) and $M$ be the prediction length (the paper uses $M=500$). Select a large $L > M$ (this is a hyperparameter; the paper uses $L=2048$). During the training phase, the position sequence for a sequence of length $N$ was originally $[0,1,\cdots,N-2,N-1]$. Now, it is changed to randomly selecting $N$ non-repeating integers from $\{0,1,\cdots,L-2,L-1\}$ and sorting them from smallest to largest as the position sequence for the current sequence.

Reference code based on numpy is:

def random_position_ids(N, L=2048):
    """Randomly pick N non-repeating integers from [0, L) and sort them.
    """
    return np.sort(np.random.permutation(L)[:N])

During the prediction phase, one can either sample the position sequence randomly in the same way or directly pick points uniformly within the interval (personal experiments show uniform picking usually works better). This solves the problem of positional encodings not being trained during the prediction stage. It’s easy to understand that this is a simple training trick (referred to below as "Random Position Training"), aimed at making the Transformer more robust to position selection. However, as we will see, it achieves a significant improvement in length extrapolation. I have also conducted experiments on MLM tasks, and the results show it is effective for MLM as well, especially when combined with "$\log n$ scaled attention," where the improvement is even more pronounced (the original paper did not include the "$\log n$ scaled attention" step).

A New Benchmark

Many related works, including the various Local Attention and variant schemes mentioned in the previous article, use language modeling tasks to build evaluation metrics. However, whether it is unidirectional GPT or bidirectional MLM, they rely heavily on local information (locality). Therefore, previous solutions might have good extrapolation performance simply because of the locality of language models; if switched to a non-local task, the effect might deteriorate. Perhaps for this reason, the evaluation in this paper is not a conventional language modeling task, but rather a length generalization benchmark specifically proposed by Google last year in the paper "Neural Networks and the Chomsky Hierarchy" (referred to below as the "CHE benchmark"). This provides a new perspective for understanding length extrapolation.

This benchmark includes several tasks categorized into three levels: R (Regular), DCF (Deterministic Context-Free), and CS (Context-Sensitive), with difficulty increasing at each level. A brief introduction to each task is as follows:

Even Pairs, difficulty R: Given a binary sequence, such as "aabba", determine if the total count of 2-grams "ab" and "ba" is even. In this example, the 2-grams are "aa", "ab", "bb", "ba", where "ab" and "ba" appear twice in total, so it outputs "Yes". This problem is also equivalent to determining if the first and last characters of a binary sequence are the same.
Modular Arithmetic (Simple), difficulty R: Calculate numerical values composed of five numbers $\{0, 1, 2, 3, 4\}$ and three operators $\{+,-,\times\}$, and output the result modulo 5. For example, input $1 + 2 − 4$ equals $-1$, which is $4$ modulo 5, so output $4$.
Parity Check, difficulty R: Given a binary sequence, such as "aaabba", determine if the number of "b"s is even. In this example, the number of "b"s is 2, so output "Yes".
Cycle Navigation, difficulty R: Given a ternary sequence where each element represents one of $\{+0, +1, -1\}$, output the final result of the operation starting from 0 modulo 5. For example, if $0, 1, 2$ represent $+0, +1, -1$, then $010211$ represents $0 + 0 + 1 + 0 − 1 + 1 + 1 = 2$, which outputs $2$ modulo 5.
Modular Arithmetic, difficulty DCF: Calculate numerical values composed of five numbers $\{0, 1, 2, 3, 4\}$, parentheses $(,)$, and three operators $\{+,-,\times\}$, outputting the result modulo 5. For example, input $-(1-2)\times(4-3\times(-2))$ yields 10, which is $0$ modulo 5, so output $0$. Compared to the Simple version, this task adds "parentheses," making the calculation more complex.
Reverse String, difficulty DCF: Given a binary sequence, such as "aabba", output its reversed sequence "abbaa".
Solve Equation, difficulty DCF: Given an equation composed of five numbers $\{0, 1, 2, 3, 4\}$, parentheses $(,)$, three operators $\{+,-,\times\}$, and an unknown $z$, solve for $z$ such that it holds modulo 5. For example, $-(1-2)\times(4-z\times(-2))=0$ means $z=3$. Although solving equations seems harder, since the equation is constructed by replacing a number in a Modular Arithmetic expression with $z$, a solution in $\{0, 1, 2, 3, 4\}$ is guaranteed. Thus, it can theoretically be solved by enumeration combined with Modular Arithmetic, making its difficulty similar to Modular Arithmetic.
Stack Manipulation, difficulty DCF: Given a binary sequence, such as "abbaa", and a sequence of stack operations consisting of "POP", "PUSH a", and "PUSH b" (e.g., "POP / PUSH a / POP"), output the final stack result.
Binary Addition, difficulty CS: Given two binary numbers, output their sum in binary representation. For example, input $10010$ and $101$, output $10111$. Note that this must be input into the model at the character level rather than the numerical level for training and prediction, and the two numbers are provided serially rather than being aligned in parallel (like a string $10010+101$).
Binary Multiplication, difficulty CS: Given two binary numbers, output their product in binary representation. For example, input $100$ and $10110$, output $1011000$. Like Binary Addition, this is input at the character level and provided serially.
Compute Sqrt, difficulty CS: Given a binary number, output the binary representation of the floor of its square root. For example, input $101001$, the output is $\lfloor\sqrt{101001}\rfloor=101$. The difficulty is similar to Binary Multiplication since one could theoretically enumerate results from 0 using Binary Multiplication.
Duplicate String, difficulty CS: Given a binary sequence, such as "abaab", output the sequence repeated once, "abaababaab". This simple-looking task might seem to be difficulty **R**, but it is actually **CS**; you can think about why.
Missing Duplicate, difficulty CS: Given a binary sequence with a missing value, such as "ab_aba", where the original complete sequence is known to be a duplicate sequence (from the previous task), predict the missing value. The answer in this case is "a".
Odds First, difficulty CS: Given a binary sequence $t_1 t_2 t_3 \cdots t_n$, output $t_1 t_3 t_5 \cdots t_2 t_4 t_6 \cdots$. For example, input "aaabaa" results in "aaaaba".
Bucket Sort, difficulty CS: Given an $n$-element numerical sequence (where each number is one of $n$ given values), return the sequence sorted from smallest to largest. For example, input $421302214$ should output $011222344$.

As we can see, these tasks all share a common feature: their operations have fixed simple rules, and theoretically, the inputs are of unlimited length. Thus, we can train on short sequences and test whether the training results on short sequences can generalize to long sequences. In other words, it serves as a very strong testing benchmark for length extrapolation.

Experimental Results

First, let's introduce the experimental results of the original paper "Neural Networks and the Chomsky Hierarchy," which compared several RNN models and Transformer models (the evaluation metric is the average accuracy of each character, not the total correct rate of the entire sequence):

Comparison of several models on several length extrapolation test tasks

Comparison of several models on several length extrapolation test tasks

The results might be surprising: the currently "booming" Transformer has the worst length extrapolation effect (the Transformer here was tested with different positional encodings and the best value for each task was taken), while the best is Tape-RNN. The paper gave them the following ratings:

$$ \underbrace{\text{Transformer}}_{\text{R}^-} < \underbrace{\text{RNN}}_{\text{R}} < \underbrace{\text{LSTM}}_{\text{R}^+} < \underbrace{\text{Stack-RNN}}_{\text{DCF}} < \underbrace{\text{Tape-RNN}}_{\text{CS}} $$

The "Random Position Training" method proposed in "Randomized Positional Encodings Boost Length Generalization of Transformers" recovered some ground for the Transformer:

Comparison of length extrapolation effects with and without random position training for Transformers with different positional encodings

Comparison of length extrapolation effects with and without random position training for Transformers with different positional encodings

It can be seen that under Random Position Training, the length extrapolation of Transformers with any kind of positional encoding improved significantly. This further verifies the conclusion of the previous article: length extrapolation has little to do with the design of the positional encoding itself. Notably, Random Position Training achieved a 100% correct rate for the first time on the Bucket Sort task. Although the overall performance is still lacking, this is a significant step forward compared to previous results (I wonder if it could be improved further by combining it with "$\log n$ scaled attention"?). Also noteworthy is that the table shows ALiBi, which performs well in language modeling tasks, shows no advantage on the CHE benchmark. Especially after adding Random Position Training, its average metric is worse than RoPE. This initially confirms the previous suspicion that the good performance of various Local Attention variants is likely due to the serious locality of language model evaluation tasks themselves; these methods have no advantage for the non-local CHE benchmark.

Reflections on the Principle

Looking closer, "Random Position Training" is somewhat confusing. For simplicity, let's set $L=2048, N=64, M=512$. In this case, the average position sequence used in the training phase is roughly $[0, 32, 64, \cdots, 2016]$, while the average position sequence used in the prediction phase is $[0, 4, 8, \cdots, 2044]$. The difference between adjacent positions is different between training and prediction, which is also a kind of inconsistency, yet it still performs well. Why?

We can understand it from the perspective of "Order." Since the position IDs are randomly sampled during the training phase, the difference between adjacent positions is also random. Consequently, whether it is relative or absolute position, it is unlikely the model can acquire position information through precise position IDs. Instead, it gets a fuzzy position signal—or more accurately, it encodes position through the "order" of the position sequence rather than the position ID itself. For example, position sequence $[1,3,5]$ is equivalent to $[2,4,8]$ because they are both sequences arranged from smallest to largest. Random Position Training "forces" the model to learn an equivalence class: all position sequences arranged from smallest to largest are equivalent and interchangeable. This is the true meaning of positional robustness.

However, my own experimental results on MLM show that learning this "equivalence class" is still somewhat difficult for the model. A more ideal method would still use random positions during training so that the positional encodings used in prediction are also trained, but the early part of the position sequence in prediction should be consistent with the average result of the random positions. In the previous example, if the position sequence used in prediction is $[0, 4, 8, \cdots, 2044]$, we would want the average result of the random positions in training to be $[0, 4, 8, \cdots, 252]$ (i.e., the first $N$ elements of the sequence $[0, 4, 8, \cdots, 2044]$) rather than $[0, 32, 64, \cdots, 2016]$. This way, the consistency between training and prediction is tighter.

Extensions and Generalizations

Therefore, I considered the following approach:

Equal-Mean Random Position Training: Suppose $n$ follows a distribution with a mean of $N$ and a sampling space of $[0, \infty)$. During training, randomly sample an $n$, then uniformly pick $N$ points from $[0, n]$ as the position sequence.

Reference code:

def random_position_ids(N):
    """Sample n randomly, then pick N points uniformly from [0, n].
    """
    n = sample_from_xxx()
    return np.linspace(0, 1, N) * n

Note that the position sequence sampled this way consists of floating-point numbers. Therefore, it is not suitable for discrete trainable positional encodings and only applies to functional positional encodings like Sinusoidal or RoPE. Below, assume we only consider functional positional encodings.

The main problem with this idea is how to choose a suitable sampling distribution. My first reaction was the Poisson distribution, but considering its mean and variance are both $n$, then according to the "3$\sigma$ rule," it can only extrapolate to a length of $n+3\sqrt{n}$, which is clearly too short. After selection and testing, I found two distributions that are quite suitable: one is the Exponential distribution, whose mean and standard deviation are both $n$. Even according to the "3$\sigma$ rule," it can extrapolate to a length of $4n$, which is a more ideal range (and in practice, it's even longer). The other is the Beta distribution, which is defined on $[0,1]$. We can take the test length as 1, so the training length is $N/M \in (0,1)$. The Beta distribution has two parameters $\alpha, \beta$. After ensuring the mean $\frac{\alpha}{\alpha+\beta}$ is equal to $N/M$, we still have an extra degree of freedom to control the probability near 1, which is suitable for scenarios where we want to further expand the extrapolation range.

My experimental results show that "Equal-Mean Random Position Training" combined with "$\log n$ scaled attention" achieves the best extrapolation results on the MLM task (training length 64, test length 512, sampling distribution is exponential). Since I haven't done the CHE benchmark before, I haven't had the chance to test its effect on the CHE benchmark yet, which will have to wait for a future opportunity.

Conclusion

This article looks at the length extrapolation of Transformers from the perspective of positional robustness, yielding new schemes like "Random Position Training" to enhance length extrapolation. At the same time, we introduced the new "CHE benchmark." Compared to conventional language modeling tasks, it possesses stronger non-locality and can more effectively evaluate work related to length extrapolation. Under this benchmark, previous attention localization methods did not show particularly outstanding performance, whereas "Random Position Training" was more effective. This reminds us to evaluate the effectiveness of related methods on a more comprehensive range of tasks, rather than being limited to language modeling tasks.