RoFormerV2: Exploration of the Limits of Natural Language Understanding

By 苏剑林 | March 21, 2022

About a year ago, we proposed Rotary Position Embedding (RoPE) and released the corresponding pre-trained model RoFormer. Over time, RoFormer has fortunately received increasing attention and recognition. For example, EleutherAI's newly released 6 billion and 20 billion parameter GPT models utilize RoPE position embeddings. Furthermore, Google's newly proposed FLASH model paper explicitly points out that RoPE has a significant positive effect on Transformer performance.

At the same time, we have been continuously trying to further strengthen the RoFormer model, attempting to take RoFormer's performance "to the next level." After nearly half a year of effort, we believe we have achieved quite good results, which we are now officially releasing as "RoFormerV2":

Github: https://github.com/ZhuiyiTechnology/roformer-v2

Exploration of the Limits

Since the rise of pre-trained models, many researchers have been quite interested in one question: Where is the limit of pre-trained models? Of course, the word "limit" has many meanings. A series of works represented by GPT-3 aims to explore the limits of parameter and data volume, while Microsoft's recently proposed DeepNet explores the limit of depth. For us, we are more interested in the performance limit under the same parameter count, attempting to "squeeze" the performance of pre-trained models to the fullest; RoFormerV2 is a product of this philosophy.

In simple terms, RoFormerV2 first makes appropriate simplifications to the model structure on the basis of RoFormer, thereby obtaining a certain speed increase. In terms of training, in addition to performing regular unsupervised MLM pre-training, we also collected more than 20GB of labeled data to perform supervised multi-task pre-training. Under supervised training, the model's performance improved significantly, basically achieving the optimal solution for speed and effectiveness at the same parameter scale.

Notably, the 300-million-parameter RoFormer Large exceeded several models with 1 billion+ parameters on the CLUE Leaderboard, reaching 5th place. It is also the model with the fewest parameters among the top 5 on the leaderboard:

RoFormerV2 large performance on CLUE

Model Introduction

Compared to RoFormer, the main changes in RoFormerV2 are the simplified model structure, increased training data, and the addition of supervised training. These changes allow RoFormerV2 to ultimately achieve a "win-win" in terms of speed and performance.

Structural Simplification

In terms of structure, RoFormerV2 primarily removes all Bias terms from the model, replaces Layer Norm with a simpler RMS Norm, and removes the gamma parameter of the RMS Norm. The inspiration for these changes mainly comes from Google's T5 model.

Everyone's subconscious might think that the computation of Bias terms and the beta and gamma parameters of Layer Norm is very small, or at least negligible for speed. But the facts exceeded our expectations: by removing these seemingly "trivial" parameters, the training speed of RoFormerV2 achieved a significant improvement!

Some reference data are as follows (since RoFormer and RoBERTa have similar speeds, they are not listed separately; the Base version was tested on a 3090, and the Large version was tested on an A100):

\[\begin{array}{c|cc|cc} \hline & \text{Sequence Length} & \text{Training Speed} & \text{Sequence Length} & \text{Training Speed} \\ \hline \text{RoBERTa base} & 128 & 1.0\text{x} & 512 & 1.0\text{x} \\ \text{RoFormerV2 base} & 128 & 1.3\text{x} & 512 & 1.2\text{x}\\ \hline \text{RoBERTa large} & 128 & 1.0\text{x} & 512 & 1.0\text{x} \\ \text{RoFormerV2 large} & 128 & 1.3\text{x} & 512 & 1.2\text{x} \\ \hline \end{array}\]

Unsupervised Training

Like RoFormer, RoFormerV2 is first pre-trained using the MLM task. The main differences are two-fold:

1. RoFormer was trained on top of RoBERTa weights, whereas RoFormerV2 is trained from scratch;
2. RoFormer's unsupervised training used only 30+ GB of data, while RoFormerV2 used 280 GB of data.

Training from scratch is more difficult than continuing training on existing weights, mainly reflected in the fact that the Post Norm structure is harder to converge. To this end, we proposed a new training technique: designing the residual as:

\begin{equation}\boldsymbol{x}_{t+1} = \text{Norm}(\boldsymbol{x}_t + \alpha F(\boldsymbol{x}_t)) \end{equation}

where $\alpha$ is initialized to 0 and increases linearly and slowly to 1. Related discussions can also be found in "A Brief Discussion on the Initialization, Parameterization, and Normalization of Transformers". This scheme is similar to ReZero, the difference being that in ReZero, $\alpha$ is a trainable parameter and the $\text{Norm}$ operation is removed. Experiments show that our modification yields better final results than ReZero and was almost the optimal solution before DeepNet.

Multi-task Training

As mentioned earlier, the structure of RoFormerV2 was simplified to gain speed, but because there is "no free lunch," RoFormerV2's performance would slightly decrease compared to RoBERTa and RoFormer under the same training settings. To compensate for this decline and more effectively exploit the model's potential, we added supervised multi-task pre-training.

Specifically, we collected 77 labeled datasets totaling 20GB and constructed 92 tasks for multi-task training. These datasets cover common natural language understanding tasks such as text classification, text matching, reading comprehension, information extraction, and coreference resolution, so that the model can acquire relatively comprehensive natural language understanding capabilities. To complete the training, we further developed a multi-task training framework based on bert4keras, which flexibly supports hybrid training of tasks in different formats and integrates techniques such as gradient normalization (refer to "Multi-task Learning Discussion (II): The Matter of Gradients") to ensure that each task achieves as optimal an effect as possible.

RoFormerV2 is not the first model to attempt multi-task pre-training. Before it, MT-DNN, T5, and the more recent ZeroPrompt have already affirmed the value of multi-task pre-training. Our main contribution is to have fully verified this on Chinese data and to be the first to open-source it.

Experimental Results

We mainly compare results on the CLUE benchmark:

\[\small{\begin{array}{c|ccccccccccc} \hline & \text{iflytek} & \text{tnews} & \text{afqmc} & \text{cmnli} & \text{ocnli} & \text{wsc} & \text{csl} & \text{cmrc2018} & \text{c3} & \text{chid} & \text{cluener}\\ \hline \text{BERT base} & 61.19 & 56.29 & 73.37 & 79.37 & 71.73 & 73.85 & 84.03 & 72.10 & 61.33 & 85.13 & 78.68\\ \hline \text{RoBERTa base} & 61.12 & 58.35 & 73.61 & 80.81 & 74.27 & 82.28 & \textbf{85.33} & 75.40 & 67.11 & 86.04 & 79.38\\ \text{RoBERTa large} & 60.58 & 55.51 & 75.14 & \textbf{82.16} & 75.47 & 81.97 & 85.07 & 78.85 & 76.74 & \textbf{88.65} & \textbf{80.19}\\ \hline \text{RoFormer base} & 61.08 & 56.74 & 73.82 & 80.97 & 73.10 & 80.57 & 84.93 & 73.50 & 66.29 & 86.30 & 79.69\\ \hline \text{RoFormerV2 small} & 60.46 & 51.46 & 72.39 & 76.93 & 67.70 & 69.11 & 83.00 & 71.80 & 64.49 & 77.35 & 78.20\\ \text{RoFormerV2 base} & 62.50 & \textbf{58.74} & 75.63 & 80.62 & 74.23 & 82.71 & 84.17 & 77.00 & 75.57 & 85.95 & 79.87\\ \text{RoFormerV2 large} & \textbf{62.65} & 58.06 & \textbf{76.95} & 81.20 & \textbf{75.83} & \textbf{88.03} & 84.97 & \textbf{80.50} & \textbf{78.34} & 87.68 & \textbf{80.17}\\ \hline \end{array}}\]

As can be seen, the improvement from multi-task training is quite substantial. On most tasks, RoFormerV2 not only "reclaims" the performance gap caused by structural simplification but also shows a certain improvement. On average, it can be considered to have achieved the best performance among models of the same class. Additionally, on the two tasks CMNLI and CHID, RoFormerV2 is not as good as RoBERTa. This is because both tasks have a very large amount of training data (hundreds of thousands of samples). When the amount of training data is large enough, the performance of the model depends mainly on the model capacity, and the improvement brought by multi-task training is relatively small.

So, in summary: if your task type is relatively common and the data volume is not particularly large, RoFormerV2 is often a good choice; if you want to speed up training a bit, you can also choose RoFormerV2; but if your task data volume is especially large, RoFormerV2 usually has no advantage.

Summary

This article provides a basic introduction to our newly released RoFormerV2 model. It mainly improves speed through structural simplification and improves performance through the combination of unsupervised pre-training and supervised pre-training, thereby achieving a "win-win" in speed and effectiveness.