Leave Constraints Behind, Enhance the Model: One Line of Code to Improve ALBERT's Performance

By 苏剑林 | January 29, 2020

The title of this article might seem a bit like "clickbait," but when applied within the bert4keras framework, it truly is a one-line code change. As for whether it provides an improvement, I cannot guarantee it for every case, but testing on several representative tasks has shown performance that is either on par with or better than the original, so the title is essentially a statement of fact.

What exactly is this change? It can be explained in one sentence:

In downstream tasks, abandon ALBERT's weight-sharing constraint—essentially, use ALBERT as if it were BERT.

For more details on the logic, please read on.

What is ALBERT?

This modification is specifically designed for ALBERT. To understand it, you first need to know what ALBERT is. I will spend some space on a brief introduction to ALBERT here. I assume readers already have some understanding of BERT, so the focus will be on comparing ALBERT with BERT.

Low-rank Decomposition

First is the Embedding layer. Taking the Chinese version of BERT-base as an example, the total number of tokens is approximately 20,000, and the embedding dimension is 768. Therefore, the total number of parameters in the Embedding layer is about 15 million, accounting for about 1/6 of the total parameters. The first part ALBERT targets is the Embedding layer. it reduces the Embedding dimension to 128 and then uses a \(128 \times 768\) transformation matrix to project it back to 768 dimensions. This way, the number of Embedding layer parameters is reduced to about 1/6 of the original. This is the so-called low-rank decomposition.

Parameter Sharing

Next is the Transformer section. In Transformer-based models like BERT, the core consists of modules composed of self-attention, layer normalization, and fully connected layers (1D convolution with a kernel size of 1), which we call a "transformer block." The BERT model is a stack of multiple transformer blocks. For example, the left figure below is a schematic of BERT-base, which stacks 12 transformer blocks:

Simplified schematic of BERT-base Simplified schematic of ALBERT-base
Left: BERT-base schematic; Right: ALBERT-base schematic

Note that in BERT's design, the input and output shapes of each transformer block are the same. This means it is reasonable to use the output of a block as the input for the same block again. This tells us that the same block can be reused for iterations instead of using a new block for every layer. ALBERT uses the simplest and most direct approach: all layers share the same transformer block (as shown in the right figure above)! Consequently, in ALBERT-base, the parameters for the transformer blocks are reduced to 1/12 of the original BERT-base.

Brief Review

Beyond the two points above, one significant difference between ALBERT and BERT is that during the pre-training stage, the NSP (Next Sentence Prediction) task was changed to the SOP (Sentence-Order Prediction) task. However, this is not related to the model architecture itself, so it is not the focus of this article. Readers can find relevant information on their own.

Overall, ALBERT is a model designed to reduce parameter counts, with the hope that this reduction will act as a form of regularization, thereby lowering the risk of overfitting and improving final performance. But were the results what the authors hoped for? Judging by its "track record," ALBERT's largest version broke the GLUE leaderboard when it was released, so it seems to have met the authors' expectations. However, ALBERT is not always ideal, and it is not the "small model" we might imagine.

For a model, we are generally concerned with two metrics: speed and effectiveness. From the two BERT and ALBERT diagrams above, it is evident that during the prediction stage (forward pass), ALBERT is essentially the same as BERT. Therefore, ALBERT and BERT of the same specification (e.g., both "base" versions) have the same prediction speed. More strictly speaking, ALBERT is actually slightly slower because its Embedding section involves an extra matrix operation. In other words, ALBERT does not provide an increase in prediction speed!

Between ALBERT and BERT of the same specification, which performs better? The ALBERT paper provides the answer: for versions up to "large," ALBERT performs worse than BERT. Only at "xlarge" and "xxlarge" versions does ALBERT start to outperform BERT. However, the pre-training method of RoBERTa mitigated some of BERT's shortcomings, so the only ALBERT version that can be significantly said to outperform BERT/RoBERTa is "xxlarge." Yet, ALBERT-xxlarge is a very massive model, making it difficult for many people to run.

Therefore, it can essentially be said: (Given what most people can actually run) at the same prediction speed, ALBERT's performance is worse; at the same level of performance, ALBERT is slower.

What about the training stage? One point not mentioned earlier is that ALBERT's parameter-sharing design has a strong regularizing effect, so dropout was removed in ALBERT. Parameter sharing and the removal of dropout do indeed save some VRAM and increase training speed, but my benchmarks show the improvement is only around 10% to 20%. That is, even if ALBERT's parameters are reduced to 1/10th or less of BERT's, it does not mean its VRAM usage drops to 1/10th, nor does it mean training speed increases 10x. On the contrary, the improvements are only marginal.

Dropping the Sharing Constraint

From the previous discussion, we can understand a few facts:

1. Regarding prediction only, ALBERT and BERT are basically identical;
2. ALBERT's parameter sharing generally has a negative impact on performance.

Since that is the case, we can try a fresh approach: when fine-tuning for downstream tasks, what if we remove the parameter-sharing constraint? In other words, during fine-tuning, treat ALBERT like BERT. This is equivalent to a BERT model where every transformer block is initialized with the same weights.

Performance Evaluation

Let's let the results speak for themselves. I selected four tasks for testing. To ensure reproducibility, each experiment below was run three times, and the table displays the average of the three results. The "unshared" version refers to the model after removing the parameter sharing. The "Training Speed" column refers to the training time per epoch on a single TITAN RTX card, provided only for relative comparison.

The experiments were conducted using bert4keras. For the unshared version, one only needs to load the ALBERT weights during build_transformer_model and set model='albert_unshared'. This is the "one line of code" mentioned in the title.

First, a relatively simple text sentiment classification task.

\[\begin{array}{c|c|c|c|c} \hline \text{Model} & \text{Valid Set (valid)} & \text{Training Speed} & \text{Metric after 1st Epoch} & \text{Test Set (test)} \\ \hline \text{small\_unshared} & 94.66\% & 38s & 90.75\% & 94.35\% \\ \text{small} & 94.57\% & 33s & 91.02\% & 94.52\% \\ \hline \text{tiny\_unshared} & 94.02\% & 23s & 88.09\% & 94.13\% \\ \text{tiny} & 94.14\% & 20s & 90.18\% & 93.78\% \\ \hline \end{array}\]

After removing parameter sharing, the training time increased slightly, as expected. As for model performance, the results are mixed. Considering that the accuracy for this task is already quite high, it might not highlight the gap between models, so let's continue with more complex tasks.

Next, we try the CLUE IFLYTEK long-form text classification. Results are as follows:

\[\begin{array}{c|c|c|c} \hline \text{Model} & \text{Valid Set (dev)} & \text{Training Speed} & \text{Metric after 1st Epoch} \\ \hline \text{small\_unshared} & 57.73\% & 27s & 49.35\% \\ \text{small} & 57.14\% & 24s & 48.21\% \\ \hline \text{tiny\_unshared} & 55.91\% & 16s & 47.89\% \\ \text{tiny} & 56.42\% & 14s & 43.84\% \\ \hline \end{array}\]

Here, the advantage of the unshared version begins to emerge, mainly reflected in faster overall convergence (see the metric after the first epoch). The "small" unshared version's best performance is significantly better. The "tiny" unshared version's best performance is slightly worse, but by fine-tuning the learning rate, the tiny_unshared version can actually outperform the tiny version (though doing so would introduce too many variables, so the table shows results with strictly controlled variables).

Then we try a more comprehensive task: Information Extraction. Results are as follows:

\[\begin{array}{c|c|c|c} \hline \text{Model} & \text{Valid Set (dev)} & \text{Training Speed} & \text{Metric after 1st Epoch} \\ \hline \text{small\_unshared} & 77.89\% & 375s & 61.11\% \\ \text{small} & 77.69\% & 335s & 46.58\% \\ \hline \text{tiny\_unshared} & 76.44\% & 235s & 49.74\% \\ \text{tiny} & 75.94\% & 215s & 31.66\% \\ \hline \end{array}\]

As can be seen, in more complex comprehensive tasks, the unshared version consistently outperforms the original model of the same scale.

Finally, we use Seq2Seq for reading comprehension style Question Answering. Results are as follows:

\[\begin{array}{c|c|c|c} \hline \text{Model} & \text{Valid Set (dev)} & \text{Training Speed} & \text{Metric after 1st Epoch} \\ \hline \text{small\_unshared} & 68.80\% & 607s & 57.02\% \\ \text{small} & 66.66\% & 582s & 50.93\% \\ \hline \text{tiny\_unshared} & 66.15\% & 455s & 48.64\% \\ \text{tiny} & 63.41\% & 443s & 37.47\% \\ \hline \end{array}\]

The main purpose of this task is to test the model's text generation capability. It can be seen that on this task, the unshared versions have significantly outperformed the original models; the tiny unshared model even approaches the performance of the original small model.

Analysis and Reflection

The experiments above were conducted on ALBERT-tiny/small. In fact, I have also experimented with the "base" version, and the conclusion matches those of tiny and small. However, experiments on the base version (and naturally large/xlarge versions) take too long, so I haven't completed a full set (nor repeated them three times), which is why they aren't listed. But overall, it feels like the results from tiny/small are already quite representative.

The results show that removing parameter sharing from ALBERT during downstream tasks results in performance that is basically equal to or better than the original ALBERT. This suggests that for many NLP tasks, parameter sharing might not be a good constraint. One might wonder: "Why does ALBERT with parameter sharing start to outperform BERT without sharing only when the model scale reaches xlarge or even xxlarge?" I will try to offer an explanation.

Theoretically, BERT's methods for preventing overfitting are dropout and weight decay. Weight decay is also used in ALBERT, but dropout is absent in ALBERT. So we can think about it from the perspective of dropout. Many experiments show that dropout is indeed an effective strategy for reducing overfitting risk, but existing experiments rely on models that are much smaller than BERT-xlarge or BERT-xxlarge. Therefore, the effectiveness of dropout in ultra-large models remains questionable. In fact, dropout has a training and inference inconsistency issue—that is, "strictly speaking, the training model and the prediction model are not the same model." I personally feel that as models become larger and deeper, this inconsistency is further amplified. Thus, I believe dropout is not an effective way to prevent overfitting for ultra-large models. What if we just remove dropout from BERT? That doesn't work well either because, without dropout, there would be few ways to suppress BERT's overfitting; with such a large parameter count, overfitting would be severe. In contrast, ALBERT removes dropout and introduces implicit regularization through parameter sharing, preventing the model from degrading as it grows larger and deeper, allowing it to perform even better.

Conversely, for ALBERT's parameter sharing to yield better performance, the condition is that it must be sufficiently large and deep. Thus, if we are using "base," "small," or especially "tiny" versions, we probably shouldn't use parameter sharing because, for small models, it acts as an unnecessary restriction on model representation capacity. In these cases, removing the sharing constraint leads to better performance.

Conclusion

This article explored a fresh approach: removing the parameter-sharing constraint of ALBERT during the fine-tuning stage and treating ALBERT like BERT. In several tasks, it was found that this approach yields performance that is equal to or better than the original ALBERT. Finally, I provided my personal interpretation of ALBERT and this phenomenon.