By 苏剑林 | March 20, 2023
Last week, I wrote "Why are modern LLMs all Decoder-only architectures?", summarizing some of my experimental conclusions and conjectures on this issue. As expected, hot topics attract significant traffic; the repost by PaperWeekly reached over 10,000 views in a short time, and it received many likes on Zhihu. Across several platforms, I received various suggestions and questions from readers. I have summarized some of the most representative questions into this FAQ, hoping to further help everyone resolve their doubts.
Review
In "Why are modern LLMs all Decoder-only architectures?", I compared GPT and UniLM architectures through experiments and, combined with previous research experience, conjectured the following conclusions:
- Changing the attention of the input part to bidirectional does not bring benefits; the advantage of the Encoder-Decoder architecture is likely merely derived from doubling the parameters.
- The reason bidirectional attention fails to provide benefits may be due to the low-rank problem of bidirectional attention leading to performance degradation.
Therefore, based on these two conjectures, we arrived at the conclusion:
Under the same number of parameters and the same inference cost, the Decoder-only architecture is the optimal choice.
For details regarding the experiments and reasoning, please refer to the original article; I will not repeat them here.
Q&A
Here are my answers to some of the readers' doubts.
Question 1: Does $n \gg d$ really hold?
Answer: $n$ is the sequence length, and $d$ is the head_size, not the hidden_size. In multi-head attention, head_size = hidden_size / heads. For example, in BERT base, head_size = 768 / 12 = 64, while the pre-training length $n$ is generally 512, so $n \gg d$ roughly holds in most cases.
Question 2: BERT and the original GPT have the same parameter count; why is BERT better at understanding tasks?
Answer: BERT and GPT differ not only in architecture but also in their pre-training tasks, making it impossible to perform a fair comparison. At the end of the original article, I provided a thought on improving BERT using GPT principles, and preliminary experiments showed it would likely outperform BERT. That experiment is the one where variables were strictly controlled.
Question 3: "Performance degradation caused by the low-rank problem of bidirectional attention" seems like a bug. Since the vast majority of models in the industry today use bidirectional attention, wouldn't the impact be too widespread?
Answer: We did not conclude that "bidirectional attention is disastrous for any task." The phenomenon that "most models in the industry use bidirectional attention" does not actually conflict with the conclusions of the original article. Our experimental conclusion was "introducing bidirectional attention in the Encoder for generation tasks does not seem to yield benefits." The condition for this conclusion is very specific—"in the Encoder of a generation task."
Question 4: I don't think so... Decoder models are just more suitable for dialogue models. Inside Google, LLM-based Encoder models, Decoder models, and Encoder-Decoder models all exist; they just have different application scenarios where the others perform better on other tasks.
Answer: The answer to this is similar to the previous one. The fact that "Encoder, Decoder, and Encoder-Decoder models all exist" does not contradict the original conclusion. We only tentatively speculated that "introducing bidirectional attention in the Encoder for generation tasks does not seem to yield benefits"; we did not say that the doubling of parameters brought by the Encoder would not yield benefits.
Question 5: Does your conclusion seem to contradict the conclusions of T5 and UL2?
Answer: First, the original conclusion is not in conflict with UL2. The original article speculated that "under the same number of parameters and same inference cost, Decoder-only is the optimal choice," whereas UL2's conclusion was that Encoder-Decoder performed better, but Encoder-Decoder and Decoder-only were not compared under the same number of parameters. Second, the conclusion of the original article does indeed conflict somewhat with the experimental results in the T5 paper (Table 2). However, I also have doubts about the T5 experimental results:
- Whether the variables were strictly controlled between
decoder-only and unilm in that table; the gap between them seems too large and unreasonable. Even if decoder-only might be inferior to unilm, the gap shouldn't be that significant.
- This article compares UniLM and Decoder-only trained from scratch under same tasks and data (directly comparing pre-training results without fine-tuning on other tasks); whereas the T5 paper compares results after pre-training on various tasks and then fine-tuning on downstream tasks. The processes are different—could this lead to the discrepancy in results?
Question 6: Does the faster drop in loss in the final experiment prove the model is better?
Answer: Based on the number of steps I have trained so far, the mixed bidirectional/unidirectional attention has consistently performed better. I can only conjecture that this trend will continue. This is the limit of the experiments I can conduct currently. I look forward to interested readers with the necessary resources further experimenting to confirm or refute this conclusion.
Question 7: Regarding your statement "Comparing GPT with UniLM is considered a strict control of variables," I don't think that is quite accurate. The Google UL2 paper points out that for pre-trained language models, both the model architecture and the pre-training tasks play a key role in model quality.
Answer: In this article, UniLM and GPT refer to two model architectures where only the Attention Mask is inconsistent. When performing the comparative experiments, except for the Attention Mask being inconsistent, all other details were aligned.
Question 8: Could there be another reason: that a lower-triangular or upper-triangular mask handles positional encoding information better?
Answer: This is indeed a very novel perspective that I hadn't considered. In fact, beyond bringing an increase in rank, the triangular mask indeed brings advantages in positional identification. It breaks the permutation invariance of the Transformer and directly introduces a left-to-right order, so it works even without adding positional encodings. Perhaps both are contributing factors.
Summary
This article has answered some of the questions raised by readers regarding the previous post.