With a Little Pre-training, Transformer's Long-Sequence Performance Can Still Rise Significantly!

By 苏剑林 | October 8, 2023

As the mainstream model architecture for LLMs, Transformers perform excellently across all types of tasks. In most cases, the criticism of Transformers is focused on their quadratic complexity rather than their effectiveness—except for a benchmark called Long Range Arena (hereafter referred to as LRA). For a long time, LRA has been the "home court" for linear RNN-based models; compared to them, Transformers have shown a clear performance gap, leading some to question whether this is an inherent flaw of the architecture.

However, the recent paper "Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors" has finally supplied the "missing link." The paper points out that a lack of pre-training is the primary reason why Transformers perform poorly on LRA, and while all architectures can benefit from pre-training, the improvements for Transformers are significantly more pronounced.

Old Background

Long Range Arena (LRA) is a benchmark for long-sequence modeling, introduced in the paper "Long Range Arena: A Benchmark for Efficient Transformers." As seen from the title, LRA was built to test various efficient versions of Transformers. It contains multiple types of data with sequence lengths ranging from 1k to 16k. Previously, many works on Efficient Transformers have been tested on LRA. Although there is some controversy regarding its representativeness, LRA remains a classic benchmark for testing the long-sequence capabilities of Efficient Transformers.

LRA results in the MEGA paper

What might surprise some readers is that standard Transformers (XFM) do not perform well on this benchmark, falling significantly behind a series of linear RNN-based models, such as the classic SSMs (S4, S4D, S5) or LRU which we introduced previously. Even the former SOTA model, MEGA, required equipping GAU with a linear RNN module (referred to as EMA in the paper). In short, the previous LRA rankings strongly signaled that "Attention is good, but RNN is essential."

(Note: The full LRA rankings can be checked at https://paperswithcode.com/sota/long-range-modeling-on-lra.)

New Conclusion

Clearly, the appearance of "Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors" has broken this impression. It points out that pre-training using the training set can greatly narrow the gap between the two, and further proposes the view that "it is unfair without pre-training."

The improvement of "Transformer + Pre-training" compared to Transformer and various Efficient versions

The pre-training approach is very simple: the task selection can be MLM or GPT, and the dataset is still the original training set. In this way, apart from increasing the consumption of computing power, no additional sources of knowledge are introduced, so the comparison is fair. In fact, both Transformers and RNNs can obtain significant improvements after pre-training, but the improvement for Transformers is more obvious:

"Transformer + Pre-training" vs "S4 + Pre-training"

Comparison with SOTA models

In hindsight, the paper's conclusion is not surprising and even feels somewhat "obviously true," but previously everyone seemed not to think in this direction (or thought about it but didn't consider it critical?). Therefore, the fact that the authors were the first to realize and prove the importance of pre-training in LRA is still very much worth commending.

The importance of pre-training actually indicates the importance of Inductive Bias on LRA. Because LRA utilizes very fine token granularity to make the sequences sufficiently long (for example, text tasks use characters as tokens, and image tasks use pixels as tokens and directly flatten 2D images into 1D sequences), it is clear that these tasks require both long-range dependencies and exhibit significant locality. Linear RNNs happen to fit these characteristics very well. In contrast, Transformers have relatively less obvious Inductive Bias; they require additional positional encoding to have position information, and even then, they don't have significant locality. Therefore, they need more pre-training to adapt to the data characteristics, or rather, to supplement Inductive Bias through pre-training.

The End

In this post, I have quickly shared a relatively new experimental conclusion: pre-training can effectively improve the scores of various models on LRA. Especially after Transformers are pre-trained, their results can basically approach the SOTA echelon. This breaks the impression I have held for a long time that LRA must be supplemented with linear RNNs.

If you still have any doubts or suggestions, you are welcome to continue the discussion in the comments section below.

If you think this article is good, you are welcome to share or tip this article. Tipping is not about making a profit, but rather hoping to know how much sincere attention Scientific Spaces has received from its readers. Of course, if you ignore it, it will not affect your reading. Welcome and thank you again!