GAU-α: Experiencing the "Faster, Better, and More Efficient" Next-Generation Attention

By 苏剑林 | April 22, 2022

In "FLASH: Probably the Most Interesting Efficient Transformer Design Recently", we introduced the GAU (Gated Attention Unit). I am personally inclined to call it the "most promising next-generation Attention design" because it truly achieves the characteristics of being "faster (speed), better (performance), and more efficient (memory)."

However, some readers have obtained opposite results in their own tests, such as slower convergence or worse performance, which differs greatly from my own testing results. This article aims to share my own training experience and release a "taster" version, "GAU-α," for everyone to test.

Open Source Address: https://github.com/ZhuiyiTechnology/GAU-alpha

GAU-α

First, let's introduce the scorecard of the open-sourced "GAU-α" on CLUE tasks:

\[\small{\begin{array}{c|ccccccccccc} \hline & \text{iflytek} & \text{tnews} & \text{afqmc} & \text{cmnli} & \text{ocnli} & \text{wsc} & \text{csl} & \text{cmrc2018} & \text{c3} & \text{chid} & \text{cluener}\\ \hline \text{BERT} & 60.06 & 56.80 & 72.41 & 79.56 & 73.93 & 78.62 & 83.93 & 56.17 & 60.54 & 85.69 & 79.45 \\ \text{RoBERTa} & 60.64 & \textbf{58.06} & 74.05 & 81.24 & 76.00 & \textbf{87.50} & 84.50 & 56.54 & 67.66 & 86.71 & 79.47\\ \text{RoFormer} & 60.91 & 57.54 & 73.52 & 80.92 & \textbf{76.07} & 86.84 & 84.63 & 56.26 & 67.24 & 86.57 & 79.72\\ \text{RoFormerV2}^\ast & 60.87 & 56.54 & 72.75 & 80.34 & 75.36 & 80.92 & 84.67 & 57.91 & 64.62 & 85.09 & \textbf{81.08}\\ \hline \text{GAU-}\alpha & \textbf{61.41} & 57.76 & \textbf{74.17} & \textbf{81.82} & 75.86 & 79.93 & \textbf{85.67} & \textbf{58.09} & \textbf{68.24} & \textbf{87.91} & 80.01\\ \hline \end{array}}\]

All models are the Base version. The table above displays the results on the validation sets of CLUE tasks; the execution methods and comparisons for everyone are fair, making this a reasonable relative comparison. Additionally, the RoFormerV2* here is not the multi-task version mentioned in "RoFormerV2: Exploring the Limits of Natural Language Understanding", but a version that only underwent MLM pre-training (this specific version was not open-sourced). This comparison is made because GAU-α also only underwent MLM pre-training.

As can be seen from the table, except for the "outlier" WSC, which has an extremely small amount of data, GAU-α holds an advantage in most tasks and has the best average score excluding WSC. Among them, the comparison between RoFormerV2* and GAU-α is the most fair because their training scripts, training data, and overall structures are identical. The only difference is that GAU-α replaces the Attention+FFN combination in RoFormerV2* with two layers of GAU. This comparison fully demonstrates the "better" characteristic of the GAU design.

Furthermore, as we introduced in "RoFormerV2: Exploring the Limits of Natural Language Understanding", RoFormerV2 simplified the structure to achieve faster speeds. GAU-α, sharing the same overall structure, is the same. Therefore, the speed of GAU-α is faster than the BERT, RoBERTa, and RoFormer models in the table, yet its average performance is superior. Further testing shows that when the sequence length exceeds 512, GAU-α's speed begins to surpass the similarly streamlined RoFormerV2, and its memory usage is lower; the longer the sequence, the more advantageous it is for GAU-α.

Training

Now, let's introduce the training details of the model. The complete code has been open-sourced on GitHub; if you have doubts, you can read it alongside the code.

Model Architecture: GAU-α simply replaces the Attention+FFN in RoFormerV2 with two layers of GAU. In a previous article, we compared the computational and parameter costs of two GAU layers to be roughly equivalent to an Attention+FFN combination, so this replacement is reasonable. Characteristics of RoFormerV2 include retaining the Post Norm structure, removing all Bias terms, and replacing Layer Norm with the simplest variant of RMS Norm, all of which are maintained in GAU-α.

Normalization: In "I Heard Attention and Softmax Go Better Together~", we discussed the normalization problem of Attention. For GAU-α, the Attention normalization selects the Entropy-Invariant Softmax (temporarily called softmax_plus in bert4keras), which I proposed and which possesses good extrapolation capabilities.

Training Method: Regarding initialization, I followed the adjustments in "What are the Difficulties in Training a 1000-layer Transformer?". As a result, training can begin directly without the need for Warmup. The optimizer used is LAMB, with a piecewise linear decay learning rate. The pre-training task is full-word MLM, and the tokenization tool is Baidu's LAC. These are all aligned with RoFormerV2.

It seems there isn't much more worth mentioning; indeed, not many major changes were made. Aside from spending a bit of time testing different normalization methods, not much extra time was spent elsewhere; direct training yielded good results.

Summary

GAU is what I consider to be the "most promising next-generation Attention design at present." This article has shared some training experiences for GAU and open-sourced a "taster" version, "GAU-α."