Why are current LLMs all using a Decoder-only architecture?

By 苏剑林 | March 17, 2023

LLM is an abbreviation for "Large Language Model," which currently generally refers to language models with more than 10 billion parameters, primarily oriented towards text generation tasks. Unlike the "hundred flowers blooming" of small-scale models (around 1 billion or fewer), the current status of LLMs is that most research focuses on the Decoder-only architecture. Aside from OpenAI, which has always insisted on the Decoder-only GPT series, even companies like Google, which haven't placed all their bets on Decoder-only, have invested significant energy into researching Decoder-only models, such as PaLM. So, why has the Decoder-only architecture become the mainstream choice for LLMs?

There is a similar question on Zhihu: "Why are current LLMs all using a Decoder-only architecture?" Most answers there focus on the advantages of Decoder-only in terms of training efficiency and engineering implementation. But does it have theoretical advantages? This article attempts to provide a simple analysis from that perspective.

Unified Perspective

It should be noted that the largest models I have trained are only at the billion-parameter level, so from the perspective of LLMs, I might not be qualified to answer this question. The following content is just my attempt to provide an answer based on some research experience and a theoretical perspective. Many inferences are based on my own experimental results, and some parts may conflict with the results of certain literature; please use your own judgment.

We know that general NLP tasks involve predicting an output based on a given input; completely unconditional random generation is rare. In other words, any NLP task can be decomposed into an "input" part and an "output" part. We can call the model that processes the "input" the Encoder and the model that generates the "output" the Decoder. Thus, all tasks can be understood from the "Encoder-Decoder" perspective. The difference between different models lies in the attention patterns of the Encoder and Decoder, and whether parameters are shared:

$$\begin{array}{c|ccc} \hline & \text{Encoder Attention} & \text{Decoder Attention} & \text{Shared Parameters} \\ \hline \text{GPT} & \text{Unidirectional} & \text{Unidirectional} & \text{Yes} \\ \text{UniLM} & \text{Bidirectional} & \text{Unidirectional} & \text{Yes} \\ \text{T5} & \text{Bidirectional} & \text{Unidirectional} & \text{No} \\ \hline \end{array}$$

Here, GPT is the representative of Decoder-only; UniLM is a Decoder architecture similar to GPT but uses a mixed attention pattern; T5 is the representative of the Encoder-Decoder architecture, which Google is particularly interested in.

Bidirectional

Mixed

Unidirectional (Forward)

Unidirectional (Backward)

Google conducted extensive comparative experiments in the T5 and UL2 papers. The results consistently showed the superiority of the Encoder-Decoder architecture over Decoder-only. However, since the model scales in these two papers are not large from an LLM perspective, and since most LLMs are indeed Decoder-only, whether this advantage carries over to larger scales and the reason for the advantage itself remain unanswered.

Comparative Experiments

From the table above, we can see that comparing GPT with UniLM is actually more of a controlled variable study. If GPT is directly compared with T5, two variables are actually introduced: the attention for the input part is changed to bidirectional, and the parameters are doubled. The reason they are compared together is that their inference costs are roughly the same.

Compared to GPT, since T5 has two variables, we cannot determine whether the advantage of the Encoder-Decoder architecture comes from the bidirectional attention in the input part or from the doubling of parameters. To this end, I conducted comparative experiments between GPT and UniLM on a model scale of 1 billion parameters. The results showed that for the same input and output training from scratch (Loss is only calculated for the output part, and the only difference is the attention pattern of the input part), UniLM showed no advantage over GPT and was even worse in some tasks.

Assuming this conclusion is representative, we can tentatively conclude:

Changing the attention of the input part to bidirectional does not bring benefits; the advantage of the Encoder-Decoder architecture likely stems merely from the doubling of parameters.

In other words, given the same number of parameters and the same inference cost, the Decoder-only architecture is likely the optimal choice. Of course, to fully verify this guess, more supplemental experiments are needed, such as keeping Encoder and Decoder parameters non-shared but changing the Encoder to unidirectional attention, or to the mixed forward-backward attention introduced in the next section, and then comparing it with the standard Encoder-Decoder architecture. However, due to my limited computing power, I leave these experiments to interested readers.

Low-rank Issue

Why does "changing the attention of the input part to bidirectional not bring benefits"? Since the input part doesn't need to consider autoregressive generation, wouldn't a complete attention matrix intuitively be better? I suspect this is likely due to the performance degradation caused by the low-rank issue of bidirectional attention.

As is well known, an Attention matrix is generally formed by adding a softmax to a matrix resulting from a low-rank decomposition. Specifically, it involves multiplying an $n \times d$ matrix by a $d \times n$ matrix and then adding a softmax ($n \gg d$). This form of Attention matrix suffers from a decline in expressive power due to the low-rank issue. For a detailed analysis, please refer to "Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth". In contrast, the Attention matrix of a Decoder-only architecture is a lower triangular matrix. Note that the determinant of a triangular matrix is the product of its diagonal elements. Because of the softmax, the diagonal elements must all be positive, so its determinant must be positive. This means that the Attention matrix of a Decoder-only architecture is necessarily full-rank! Full-rank implies a theoretically stronger expressive power. In other words, the Attention matrix of the Decoder-only architecture theoretically has stronger expressive power, and changing it to bidirectional attention might instead make it insufficient.

There is also an indirect phenomenon supporting this view: the performance gap between Linear Attention and standard Attention on language modeling tasks (unidirectional attention) is smaller than the gap on MLM tasks (bidirectional attention). That is, Linear Attention performs relatively worse on bidirectional attention tasks. This is because when performing language modeling tasks, the Attention matrix of Linear Attention is a full-rank lower triangular matrix just like standard Attention. When performing MLM tasks, the rank of the Linear Attention matrix is lower than that of the standard Attention matrix (Linear Attention is an $n \times d$ matrix multiplied by a $d \times n$ matrix, so its rank never exceeds $d$; standard Attention is an $n \times d$ matrix multiplied by a $d \times n$ matrix followed by a softmax, and softmax has some rank-increasing effect; refer to the "Low-rank Issue" section and comment area in "Transformer Upgrade Road: 3. From Performer to Linear Attention").

Conversely, can this conclusion be used to improve bidirectional attention models like BERT? The idea is not hard to conceive. For example, in Multi-Head Attention, half of the heads' Attention matrices could be truncated into lower triangular matrices (forward attention), and the other half truncated into upper triangular matrices (backward attention). Or, the Attention matrices of odd-numbered layers could be truncated into lower triangular matrices (forward attention), and even-numbered layers truncated into upper triangular matrices (backward attention). Both designs could maintain the overall bidirectional interaction of the model (unlike GPT, where a previous token cannot interact with a subsequent token) while integrating the full-rank advantage of unidirectional attention.

I also conducted simple comparative experiments and found that the mixed forward-backward attention performs slightly better in MLM tasks than full bidirectional attention models like BERT:

Comparison of training curves between full bidirectional attention and mixed forward-backward attention

The good news is that there is a slight advantage, providing indirect support for the previous guess. The bad news is that this experiment was only conducted on a base version model (100 million parameters); the effect on larger models is still unknown.

Article Summary

Therefore, the answer I offer is: LLMs primarily use the Decoder-only architecture not only because of its advantages in training efficiency and engineering implementation but also because, theoretically, the bidirectional attention of the Encoder suffers from a low-rank issue that may weaken the model's expressive power. For generation tasks, introducing bidirectional attention offers no substantial benefit. The reason the Encoder-Decoder architecture performs better in certain scenarios is likely just because it has double the parameters. Thus, for the same number of parameters and the same inference cost, the Decoder-only architecture is the optimal choice.