Before using ALBERT and ELECTRA, make sure you really understand them

By 苏剑林 | October 29, 2020

In the world of pre-trained language models, ALBERT and ELECTRA can be considered two "rising stars" that followed BERT. They improved upon BERT from different perspectives and ultimately enhanced performance (at least on many public benchmark datasets), thus earning a certain reputation. However, in daily exchanges and learning, I have found that many friends have misunderstandings about these two models, leading to unnecessary time wasted during use. Here, I attempt to summarize some key points of these two models for your reference, hoping you can avoid detours when using them.

ALBERT and ELECTRA

(Note: In this article, the word "BERT" refers to both the initially released BERT model and its subsequent improved version, RoBERTa. We can think of BERT as an insufficiently trained RoBERTa, and RoBERTa as a more fully trained BERT. This article focuses on their comparison with ALBERT and ELECTRA, so it does not distinguish between BERT and RoBERTa.)

ALBERT

ALBERT comes from the paper "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." As the name suggests, it considers its main characteristic to be "Lite." So, what is the specific meaning of this "Lite"? Many friends have the impression that ALBERT is "small, fast, and good." Is this really the case?

Characteristics

Simply put, ALBERT is essentially a parameter-sharing BERT. It is equivalent to Changing the function $y=f_n(f_{n-1}(\cdots(f_1(x))))$ to $y=f(f(\cdots(f(x))))$, where $f$ represents each layer of the model. In this way, what was originally $n$ layers of parameters is now only 1 layer. Therefore, the number of parameters is greatly reduced—or rather, the weight volume of the saved model is very small. This is the first meaning of "Lite." Then, because the total number of parameters is reduced, the time and memory required for model training are correspondingly reduced; this is the second meaning of "Lite." Furthermore, when the model is very large, parameter sharing acts as a strong regularization method for the model, so it is less prone to overfitting compared to BERT. Ultimately, its large models see a certain performance improvement, which is the highlight of ALBERT.

Prediction

Note that we did not mention prediction speed. Obviously, in the prediction stage, parameter sharing does not bring acceleration because the model still performs forward calculations step-by-step anyway; it does not care whether the current parameters are the same as the past ones, and even if they were the same, it wouldn't speed up (because the input is different). Therefore, the prediction speed of ALBERT and BERT of the same specification is the same. In fact, if we want to be pedantic, ALBERT might actually be slower because ALBERT uses matrix decomposition for the Embedding layer, which brings extra computation, though this amount is generally imperceptible.

Training

As for training speed, although there is an improvement, it is not as significant as imagined. Reducing the parameter count to $1/n$ of the original does not mean the training speed increases by $n$ times. In my previous experiments, a base version of ALBERT compared to a base BERT showed a training speed increase of only about 10%–20%, and the reduction in memory usage was similar. if the model is smaller (tiny/small versions), this gap narrows further. In other words, ALBERT's training advantage is only obvious in large models; for models that aren't large, this advantage is still difficult to perceive significantly.

Performance

Regarding effectiveness, the original ALBERT paper is quite clear, as shown in the table below. Parameter sharing limits the model's expressive power; therefore, the ALBERT-xlarge version can only match the BERT-large version, and to stably surpass it, an xxlarge version is required. In other words, as long as the version specification is smaller than xlarge, the performance of ALBERT is inferior to BERT of the same specification. Evaluation results on Chinese tasks are similar (can refer here and here). Furthermore, I have done even more extreme experiments: loading ALBERT weights but removing the parameter-sharing constraint—treating ALBERT as a BERT—and performance improved! (Refer to "Drop the Constraints, Enhance the Model: Improving ALBERT Performance with One Line of Code"). So, it is basically a fact that small-spec ALBERT is inferior to BERT.

ALBERT Experimental Results

Conclusion

So, the summary advice is: If you aren't using the xlarge version, there's no need to use ALBERT. At the same speed, ALBERT is less effective than BERT; at the same effectiveness, ALBERT is slower than BERT. Now that BERT also has tiny/small versions, such as the ones open-sourced by our company, they are basically just as fast and perform better—unless you truly need the characteristic of small disk size.

What does "xlarge" mean? Some readers haven't tried BERT because their machines can't run it; most readers have limited VRAM and have only run the base version of BERT, and haven't run or can't afford to run the large version. xlarge is even bigger than large and has higher equipment requirements. So frankly, for most readers, there is no need to use ALBERT.

Why did the idea that ALBERT is "fast and good" spread? Besides improper promotion by some media, I think it is largely due to the promotion by brightmart. It must be said that brightmart made an indelible contribution to the popularization of ALBERT in China. Long before the English version of ALBERT was released, brightmart trained and open-sourced the Chinese version of ALBERT (albert_zh) and released tiny, small, base, large, and xlarge versions in one go. At that time, BERT only had base and large versions, while ALBERT had tiny and small versions. When people tried them, they found them significantly faster than BERT, so many were left with the impression that ALBERT is very fast. In fact, ALBERT being fast has nothing to do with "ALBERT" specifically; the point is the tiny/small architecture. Corresponding BERT tiny/small models are also very fast...

Of course, you can ponder the deeper reasons why ALBERT's parameter sharing works, or research how to improve prediction speed after parameter sharing. These are valuable questions, but it is not recommended that you use a version of ALBERT lower than xlarge.

ELECTRA

ELECTRA comes from the paper "ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators." To be honest, ELECTRA is a complicated model to talk about. It excited many people when it first came out, then disappointed many after its official release. Its current practical performance cannot be called bad, but it isn't exceptionally good either.

Characteristics

ELECTRA's starting point is that BERT's MLM model—randomly selecting a portion of tokens to Mask—is too simple. It wanted to increase the difficulty. So, it borrowed ideas from GANs, using a standard method to train an MLM model (Generator), then sampling and replacing tokens in the input sentence based on the MLM model's predictions. The processed sentence is then input into another model (Discriminator) to determine which parts of the sentence were replaced and which were not. The generator and discriminator are trained simultaneously; as the generator trains, the difficulty of judgment gradually increases, which theoretically helps the model learn more valuable content. Finally, only the Discriminator's Encoder is kept for use, and the generator is generally discarded.

Because this progressive mode makes the training process more targeted, ELECTRA's main highlight is higher training efficiency. According to the paper, it can achieve BERT-level results of the same spec in 1/4 of the time or less. This is ELECTRA's primary advantage.

Theory

However, in my view, ELECTRA is a model that is theoretically questionable. Why? ELECTRA's idea stems from GANs, but in Computer Vision, do we have examples of taking a trained GAN discriminator and fine-tuning it for downstream tasks? At least, I haven't seen them. In fact, this is theoretically problematic. Take a basic GAN: its optimal discriminator solution is $D(x)=\frac{p(x)}{p(x)+q(x)}$, where $p(x)$ and $q(x)$ are the distributions of real and fake samples, respectively. Assuming training is stable and the generator has sufficient fitting capacity, as training progresses, fake samples will gradually trend toward real samples, so $q(x)$ tends toward $p(x)$, and $D(x)$ tends toward the constant $1/2$. That is, theoretically, the final discriminator is just a constant function. How can you guarantee it extracts good features?

While ELECTRA is not exactly a GAN, it shares this point. Thus, ELECTRA emphasizes that the MLM model acting as the generator cannot be too complex (otherwise, as stated, the discriminator would degrade into a constant). The paper says performance is best when the generator size is between $1/4$ and $1/2$ of the discriminator. This starts to become "alchemy" (black box heuristics). We only argued that it shouldn't be too good, but we can't prove why being a bit worse works, nor how much worse it should be, nor why simultaneous training is better. These have all become purely experimental heuristics.

Performance

Of course, saying it is theoretically questionable does not mean its performance is bad, nor that the evaluations were falsified. Because constraints were placed on the generator's capacity, ELECTRA's training results still have significance, and its performance is decent. It's just that it put us through a process of "the greater the expectation, the greater the disappointment."

The ELECTRA paper first appeared as a submission to ICLR 2020. At that time, the results shocked everyone—essentially that the small version of ELECTRA far exceeded the small version of BERT and even approached the base version, while the base version of ELECTRA reached the level of BERT-large. However, when the code and weights were released, the scores shown on GitHub were "eye-popping"—they dropped by about 2 percentage points. Later, the authors clarified, saying the paper reported dev set results while GitHub reported test set results. People understood a bit more, but because of this, ELECTRA's performance became less of a highlight compared to BERT. (Refer from "ELECTRA: Surpassing BERT, the best pre-trained NLP model of 2019" to "My thoughts on the release of ELECTRA source code").

ELECTRA Paper Results

ELECTRA GitHub Results

In fact, evaluation on Chinese tasks more accurately reflects this point. For example, in Chinese-ELECTRA released by HFL, ELECTRA's performance across various tasks is nearly identical to BERT of the same level; it has advantages in a few tasks, but "crushing" results did not appear.

Losing One for Another

Some readers might think: even if performance is similar, the pre-training is faster, which is a merit after all. I don't deny this. However, a new Arxiv paper recently suggested that ELECTRA's "similar performance" might just be an illusion on simple tasks. If complex tasks are constructed, it might still be "beaten" by BERT.

The paper is titled "Commonsense knowledge adversarial dataset that challenges ELECTRA." The authors built a new dataset, QADS, based on SQuAD 2.0 by using synonym substitution. According to the authors' tests, an ELECTRA-large model that can reach 88% on SQuAD 2.0 gets only 22% on QADS, while interestingly, BERT can still manage over 60%. Of course, this paper seems a bit unpolished and hasn't received authoritative validation yet, so it can't be fully trusted, but its results already prompt reflection on ELECTRA. Previously, the paper "Probing Neural Network Comprehension of Natural Language Arguments" knocked BERT off its pedestal with just the word "not"; it seems ELECTRA might have even more severe issues of this kind.

Setting aside other evidence, I feel ELECTRA's eventual abandonment of MLM itself is a "losing one for another" operation. If you say your starting point is that MLM is too simple, then you should find ways to increase the difficulty of MLM. Why replace MLM with a discriminator? Directly using a generator network to improve the MLM model (instead of swapping it for a discriminator) is possible. Microsoft's paper "Variance-reduced Language Pretraining via a Mask Proposal Network" recently provided such a reference scheme. It lets a generator choose the positions to be masked instead of choosing randomly. Although I haven't replicated its experiments, the entire reasoning process feels very convincing, unlike the "shooting from the hip" feel of ELECTRA. Additionally, I want to emphasize that MLM is very useful. It is not just a pre-training task; it is also a model with practical value—for example, "Do we need GPT3? No, BERT's MLM can also do few-shot learning."

Conclusion

So, after all that, the conclusion is: ELECTRA's pre-training speed has indeed increased, but current experimental evidence suggests it has no prominent advantage in downstream tasks over BERT of the same level. You can try it, but don't be too disappointed if performance drops.

Furthermore, if you need to use the weights of the MLM part (for example, to do text generation with UniLM, refer here), you cannot use ELECTRA because ELECTRA's body is a discriminator, not an MLM model. The MLM model used as the generator in ELECTRA is simplified compared to the discriminator and may suffer from insufficient fitting or inadequate learning, making it a poor pre-trained MLM model.

As for the idea behind ELECTRA—improving upon the simplicity of BERT's random masking—the direction seems correct. However, the validity of swapping a generative model for a discriminative one still requires further verification. Readers interested in deep analysis can certainly explore this further.

Article Summary

This article records my views and thoughts on ALBERT and ELECTRA, synthesizing my own experimental results and referring to several sources. I hope to objectively express the pros and cons of these two models so that readers feel more confident when choosing models. Both models have their merits in specific scenarios but also certain limitations; understanding these limitations and their origins will help readers use these models better.

I have no intention of maliciously disparaging any model. If there are any misunderstandings, everyone is welcome to leave a comment for discussion.

Address for reprinting: https://kexue.fm/archives/7846

If you have any further questions or suggestions, please continue the discussion in the comment section below. If you found this article helpful, feel free to share or tip. Tipping is not about profit but to know how much sincere attention "Scientific Space" has received. Of course, ignoring it will not affect your reading. Thanks again!

Su Jianlin. (Oct. 29, 2020). "Before using ALBERT and ELECTRA, make sure you really understand them" [Blog post]. Retrieved from https://kexue.fm/archives/7846