By 苏剑林 | June 20, 2022
If large-scale pre-trained models are the "brilliant strategies" (Zhang Liang's strategies) of natural language processing, then what is the corresponding "wall-climbing ladder" (counter-strategy)? In the author's opinion, it is the various techniques used to efficiently fine-tune these large models for specific tasks. Besides directly fine-tuning all parameters, there are many parameter-efficient fine-tuning techniques such as Adapter and P-Tuning. These methods achieve performance close to full parameter fine-tuning by updating only a small number of parameters. However, these techniques are usually only "parameter-efficient" and not necessarily "training-efficient." This is because they still require backpropagation through the entire model to obtain gradients for the few trainable parameters. In other words, while the number of trainable parameters is significantly reduced, the training speed does not see a marked improvement.
A recent paper, "LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning," proposes a new training technique called "Ladder Side-Tuning (LST)." It claims to achieve both parameter efficiency and training efficiency simultaneously. Is there truly such an ideal "wall-climbing ladder"? Let's dive in and learn about it.
The structure of LST can be clearly explained using "Figure 2" from the original paper:
Comparison of LST with Adapter and P-tuning
Backpropagation, the process of calculating model gradients, proceeds step-by-step from the output layer to the input layer. Therefore, the depth/computation of backpropagation depends on the location of the trainable parameters closest to the input layer; it is not strictly tied to the total number of trainable parameters. For the Adapter, small layers are inserted after every layer of the original model. Although the original parameters are fixed and only the new layers are trainable, because there are new layers at every depth, backpropagation must still travel all the way back to the input layer. For P-tuning, while it essentially only has a small number of trainable parameters in the Embedding layer, the Embedding layer is the input layer. Thus, backpropagation must also traverse the entire model. Consequently, neither of these schemes significantly improves training efficiency.
As for LST, it builds a "side branch" (a ladder) alongside the original large model. It takes the outputs from specific layers of the large model as inputs for the side branch model. All trainable parameters reside solely within the side branch. Since the original large model only provides inputs and does not require gradient updates, the complexity of backpropagation depends only on the scale of the side branch model. It does not require backpropagation through the original large model, which leads to a significant increase in training efficiency.
The original paper performed several experiments with LST across both NLP and CV domains. Below are the results for LST on the GLUE dataset:
LST Experimental results on GLUE
It can be seen that LST indeed possesses the characteristics of being parameter-efficient and training-efficient. It manages to reach a decent fine-tuning effect with fewer trainable parameters and lower training costs. In particular, the experimental results in the last two rows demonstrate the possibility of fine-tuning large models using limited training resources via LST.
I also attempted a simple implementation on the Chinese CLUE tasks. The reference code can be found here:
Github: https://github.com/bojone/LST-CLUE
Note that while the "ladder" in the original paper uses MLP layers (similar to those in Adapters), my implementation directly uses "Attention + FFN" combinations, mirroring the Transformer architecture. The number of trainable parameters was controlled at approximately 1 million, which is about 1.2% of the base version or 0.4% of the large version. The ladder was initialized randomly. The final results on the validation sets are as follows:
\[\small{\begin{array}{c|ccccccccccc} \hline & \text{iflytek} & \text{tnews} & \text{afqmc} & \text{cmnli} & \text{ocnli} & \text{wsc} & \text{csl} & \text{cmrc2018} & \text{c3} & \text{chid} & \text{cluener}\\ \hline \text{BERT base} & 60.06 & 56.80 & 72.41 & 79.56 & 73.93 & 78.62 & 83.93 & 56.17 & 60.54 & 85.69 & 79.45 \\ \text{RoBERTa base} & 60.64 & 58.06 & 74.05 & 81.24 & 76.00 & 87.50 & 84.50 & 56.54 & 67.66 & 86.71 & 79.47\\ \hline \text{RoBERTa base + LST} & 59.29 & 56.82 & 70.37 & 76.27 & 71.02 & 68.09 & 82.63 & 42.50 & 56.97 & 69.35 & 78.30\\ \text{RoBERTa large + LST} & 60.41 & 57.12 & 72.36 & 75.80 & 72.07 & 75.00 & 84.23 & 39.98 & 60.19 & 72.55 & 77.80\\ \hline \end{array}}\]It can be seen that the experimental results are not as optimistic as the English experiments in the original paper (though this might be due to my implementation not being optimal). However, the training efficiency indeed showed a significant increase (roughly doubled on average). After the entire experiment, my feeling is that for conventional classification tasks of general difficulty, LST can achieve similar results. However, for more difficult tasks, such as reading comprehension, LST shows a very significant performance drop.
Of course, this issue is likely not unique to LST. Most fine-tuning methods that claim to be parameter-efficient probably face this problem, as these methods are mostly tested on GLUE, which consists entirely of relatively simple classification tasks...
With the benefit of hindsight, LST is not a particularly "sophisticated" approach. In essence, it involves freezing the pre-trained model and using its output layer and some intermediate layer results as supplementary inputs to train a new small model. Once this is understood, many readers may already have similar schemes brewing in their minds. However, the true significance of LST lies in telling us that we can do this, providing a feasible reference scheme, and proving through experiments that it is an effective way to utilize large models.
Readers with similar research experience will realize that the initialization of the new "ladder" branch in LST is a critical issue. If it is completely randomly initialized, there may be difficulties in training, leading to suboptimal results. The original paper also mentions this and provides a scheme for taking slices of the large model's weight matrices to initialize the small model's matrices, thereby improving LST's final performance. The details can be found in the paper. As for my own implementation, I simply wanted to verify the effectiveness of LST, so I was lazy and did not implement this step.
Taking this a step further: since the initialization of the added "ladder" branch is difficult, yet LST is indeed an effective way to fine-tune large models, could we pre-reserve this "ladder" when training new large models in the future? That is, we could directly include this "ladder" as part of the pre-trained model during large-scale pre-training. Then, during fine-tuning, we would only fine-tune the "ladder." This would allow for efficient fine-tuning of large models without worrying about initialization issues.
In terms of form, I find LST to be quite similar to BERT-of-Theseus, a model compression method based on module replacement described previously. The difference is that the goal of the latter is small model distillation, which still requires backpropagation through the large model. LST’s goal is to improve training efficiency by avoiding backpropagation through the large model, although it still requires the large model for forward propagation during inference. You could say the two are somewhat complementary.
This article introduced Ladder Side-Tuning, a fine-tuning method for large models that is both parameter-efficient and training-efficient.