The Road to Transformer Upgrades: 20. What Makes MLA So Good? (Part 1)

By 苏剑林 | May 04, 2025

Since the explosive popularity of DeepSeek, its proposed Attention variant, MLA (Multi-head Latent Attention), has garnered increasing attention. Through ingenious design, MLA achieves free switching between MHA (Multi-Head Attention) and MQA (Multi-Query Attention), allowing the model to choose the optimal form based on different training and inference characteristics (Compute-Bound or Memory-Bound), maximizing efficiency as much as possible.

Admittedly, while MLA is highly effective, some argue it is not "elegant" enough, and efforts to find alternatives to MLA have persisted, including our own attempts. However, after a period of experimentation, we found that many Attention variants with the same or even larger KV Cache sizes ultimately do not perform as well as MLA. This has forced us to reflect: What exactly is the key reason behind MLA's outstanding performance?

In this article, I will detail my thought process and the related experimental results surrounding this question.

Observations

MLA was introduced in DeepSeek-V2. This article assumes the reader is already familiar with MLA, or at least understands the content introduced in the previous blog post "The Tug-of-War Between Cache and Performance: From MHA, MQA, GQA to MLA". Therefore, the specific details of MLA itself will not be overly expanded upon.

The main characteristics of MLA are as follows:

1. In the training phase, MLA is an MHA with qk_head_dims=(128+64) and v_head_dims=128;

2. In the decoding phase, MLA is a KV-Shared MQA with qk_head_dims=(512+64) and v_head_dims=512;

3. The splicing of [qc, qr] and [kc, kr] in MLA can be understood as a form of Partial RoPE.

Conjectures

The head_dims commonly used in MHA and GQA is 128. For MLA, whether it is the 128+64 seen from training or the 512+64 seen from inference, it is larger than 128. Combined with the experience from "Breaking Bottlenecks: Building a Stronger Transformer", we have:

Conjecture 1: Increasing head_dims is a key reason for MLA's success.

Additionally, the KV-Shared feature allows for increasing the head_dims or num_groups of GQA under the same KV Cache size. Thus:

Conjecture 2: KV-Shared is a key reason for MLA's success.

Finally, previous theories and experiments have shown that Partial RoPE might positively impact performance (refer to "The Road to Transformer Upgrades: 18. Principles for Choosing the RoPE Base"), so:

Conjecture 3: Partial RoPE is a key reason for MLA's success.

Experiments

We will now test the above conjectures one by one through experiments.

Settings

The hyperparameters common to all experiments are as follows:

1. Dense model similar to LLAMA3;

2. hidden_size=2048, num_layers=12, num_heads=16;

3. The optimizer is Muon, with per-head updates for the Attention part;

4. Training length of 4096, total tokens of 16B, total training steps of 16k;

5. All experiments only change the Attention, so parameter counts are not strictly aligned.

Part I

MLA's KV Cache size is 512+64, which is approximately equal to GQA2-128 (the first number is num_groups, the second is head_dims). Therefore, the baselines used for comparison are GQA2-128 and GQA1-256. To verify Partial RoPE, we added GQA1-256-PR, which splits the 256 dimensions of Q and K into 192+64, adds RoPE to the 64, and leaves the 192 without it.

The results are as follows:

\[\begin{array}{c|ccc} \hline & \text{Params} & \text{Loss} & \text{Cache} \\ \hline \text{MLA} & 894M & 2.721 & 576 \\ \text{GQA2-128} & 842M & 2.75 & 512 \\ \text{GQA1-256} & 943M & 2.72 & 512 \\ \text{GQA1-256-PR} & 943M & 2.711 & 512 \\ \hline \end{array}\]

That is: $$\text{GQA2-128} < \text{MLA} \lesssim \text{GQA1-256} < \text{GQA1-256-PR}$$ This provides preliminary verification for the roles of increasing head_dims and Partial RoPE. From this perspective, the seemingly "compromised" design of splicing RoPE and NoPE in MLA is likely the key reason for its superior performance! The original paper claims MLA even outperforms MHA, which is likely because the MHA it was compared against only had a head_dims of 128.

Part II

To further verify the effect of increasing head_dims, we ran additional experiments for MHA, GQA2-192, and MLA-256. MHA is the conventional MHA with head_dims=128. GQA2-192 directly increases the head_dims of GQA2 to 192. MLA-256 increases MLA's 128+64 to 192+64. The comparison is as follows:

\[\begin{array}{c|ccc} \hline & \text{Params} & \text{Loss} & \text{Cache} \\ \hline \text{MHA} & 931M & 2.721 & 4096 \\ \text{MLA} & 894M & 2.721 & 576 \\ \text{MLA-256} & 989M & 2.705 & 576 \\ \text{GQA2-128} & 842M & 2.75 & 512 \\ \text{GQA2-192} & 899M & 2.729 & 768 \\ \text{GQA1-256} & 943M & 2.72 & 512 \\ \text{GQA1-256-PR} & 943M & 2.711 & 512 \\ \hline \end{array}\]

As can be seen, although MHA has a higher total parameter count and a KV Cache 7 times larger than MLA, its loss only barely matches MLA. This is close to the conclusion in DeepSeek-V2. Furthermore, GQA2-192 outperforms GQA2-128 but is inferior to GQA1-256. When MLA's head_dims is increased to (192+64), it shows further improvement compared to (128+64). All these phenomena indicate that increasing head_dims is far more effective than increasing num_groups.

Part III

Next, we verify KV-Shared, meaning K and V share all or most dimensions. Here, we mainly consider GQA variants with head_dims not exceeding 256 and control the total KV Cache size to be close to MLA. Thus, with KV-Shared, we can consider up to GQA2-256.

Since KV-Shared is not fully compatible with RoPE, following the approach of MLA, we divide the 256 dimensions into 192+64 parts, where:

1. The 192 part does not have RoPE and is shared between K and V;

2. The 64 part has RoPE and is used only for K;

3. V additionally projects 64 dims, which are concatenated to the shared 192 dims.

In this setup, the head_dims for both K and V are 256. The total KV Cache size is (192+64+64)*2=640, slightly larger than MLA's 512+64=576. We denote this version as "GQA2-(192+64)-S1", where "S1" stands for "Shared-1".

Part IV

Another KV-Shared scheme is:

1. The 192 part does not have RoPE and is shared between K and V;

2. The 64 part has RoPE and is also shared between K and V;

3. Perform Attention; since V carries RoPE, this results in an absolute position encoding effect;

4. To ensure relative position encoding, the output is divided into 192+64 parts, and the 64 part undergoes an inverse RoPE transform.

In this approach, K and V are fully shared. The KV Cache size is (192+64)*2=512, slightly smaller than MLA. We call this version "GQA2-(192+64)-S2", where "S2" stands for "Shared-2". The principle behind this is the VO-RoPE newly proposed by the author; refer to "The Road to Transformer Upgrades: 19. The Second Type of Rotary Position Encoding".

Part V

Additionally, based on the same logic, we supplemented the experiments with GQA4 and GQA1. All experimental results are summarized as follows:

\[\begin{array}{c|ccc|c} \hline & \text{Params} & \text{Loss} & \text{Cache} & \text{Note} \\ \hline \text{MLA} & 894M & 2.721 & 576 & \\ \text{MLA-256} & 989M & 2.705 & 576 & \\ \text{GQA2-(192+64)-S1} & 946M & 2.714 & 640 & \\ \text{GQA2-(192+64)-S2} & 943M & 2.708 & 512 & \text{Introduces VO-RoPE} \\ \text{GQA4-(64+64)-S2} & 842M & 2.738 & 512 & \\ \text{GQA4-(128+64)-S2} & 899M & 2.713 & 768 & \text{Largest KV Cache} \\ \text{GQA1-(512+64)-S3} & 1171M & 2.677 & 576 & \text{Largest head\_dims} \\ \hline \end{array}\]

Here "GQA1-(512+64)-S3" is an MQA implemented according to MLA's inference form, with a structure caught between S1 and S2. Its main feature is the large head_dims.

Interpretation of results:

1. KV-Shared GQA inherently includes Partial RoPE;

2. KV-Shared GQA2-256 can also outperform MLA;

3. The introduction of VO-RoPE seems beneficial (S1 ≲ S2);

4. For the same KV Cache, larger head_dims are better;

5. GQA2-(192+64)-S2 slightly outperforms GQA1-256-PR;

6. GQA4-(128+64)-S2 has the largest KV Cache, but its performance is not optimal, again indicating head_dims is more critical.

Regarding KV-Shared, there are two additional observations:

1. During training, GQA1-256-PR leads significantly in the early stages compared to GQA2-(192+64)-S2, but is capped or slightly overtaken in the later stages, suggesting GQA1-256-PR might suffer from insufficient "stamina";

2. Without KV-Shared, GQA is at most GQA1-256, meaning head_dims is capped at 256. But with KV-Shared, GQA can reach GQA1-512-S. Purely from a head_dims perspective, KV-Shared has a higher ceiling.

Part VI

Since parameter counts were not strictly aligned, readers might wonder "whether increasing parameters or increasing head_dims is more fundamental." Therefore, we supplement with several experiments aligning parameter counts.

We consider three ways to align parameters:

1. double-heads: Using "GQA2-128 vs GQA1-256" as an example, doubling the num_heads of GQA2-128 allows its parameter count to match GQA1-256;

2. Reducing MLP: Shrinking the intermediate_size of the MLP (SwiGLU) can make the parameter count of GQA1-256 roughly equivalent to GQA2-128;

3. Q&O LoRA: The main parameter count of GQA comes from the projection matrices of Query and Output. Using LoRA for these two matrices can reduce the parameter count of GQA1-256.

The experimental results are as follows:

\[\begin{array}{c|ccc|ccc} \hline & \text{Params} & \text{Loss} & \text{Cache} & \text{num\_heads} & \text{intermediate\_size} & \text{qo\_lora} \\ \hline \text{MLA} & 894M & 2.721 & 576 & 16 & 5456 & \text{No}\\ \hline \text{GQA2-128} & 842M & 2.75 & 512 & 16 & 5456 & \text{No}\\ \text{GQA1-256} & 943M & 2.72 & 512 & 16 & 5456 & \text{No}\\ \hline \text{GQA2-128} & 943M & 2.723 & 512 & \color{red}{32} & 5456 & \text{No} \\ \text{GQA1-256} & 843M & 2.747 & 512 & 16 & \color{red}{4096} & \text{No} \\ \text{GQA1-256} & 842M & 2.726 & 512 & 16 & 5456 & \color{red}{\text{Yes}} \\ \hline \text{GQA4-(64+64)-S2} & 842M & 2.738 & 512 & 16 & 5456 & \text{No} \\ \text{GQA2-(192+64)-S2} & 943M & 2.708 & 512 & 16 & 5456 & \text{No} \\ \hline \text{GQA4-(64+64)-S2} & 943M & 2.711 & 512 & \color{red}{32} & 5456 & \text{No} \\ \text{GQA2-(192+64)-S2} & 843M & 2.733 & 512 & 16 & \color{red}{4096} & \text{No} \\ \text{GQA2-(192+64)-S2} & 842M & 2.708 & 512 & 16 & 5456 & \color{red}{\text{Yes}} \\ \hline \end{array}\]

The results fall into three main points:

1. Doubling heads vs. doubling head_dims results in a loss difference consistently around 0.003;

2. Shrinking MLP is consistently better than halving head_dims by a loss of about 0.004;

3. Q&O LoRA has the smallest performance loss; it can achieve doubling of head_dims without increasing parameter count, and the loss decreases significantly.

Conclusion: If viewed from the perspective of increasing parameter count, increasing head_dims might be the direction with the largest performance gain. Combined with Q&O LoRA, it's possible to keep parameter counts almost unchanged while still reaping significant benefits.

Summary

The preliminary conclusions are:

1. Increasing head_dims provides the greatest benefit;

2. Partial RoPE also helps with Loss;

3. KV-Shared likely also plays a role.

In this light, our previous attempts to find an MLA alternative while keeping head_dims=128 were structurally disadvantaged from the start; no wonder we couldn't match MLA. To catch up with MLA, head_dims should probably start at 192, supplemented by Partial RoPE. As for KV-Shared, it likely helps, but probably requires larger-scale verification.

Significance

The significance here depends on how strong our determination is to replace MLA.

Suppose GQA2-(192+64)-S2 can replace MLA, but MLA can also be increased to 256. Currently, GQA2-(192+64)-S2 does not match MLA-256. Then, the only two benefits of replacing MLA are:

1. A simpler structure, making it convenient to add QK-Norm;

2. In the decoding stage, the head_dims changes from 512+64 to 256, and with num_groups changing to 2, it allows for Tensor Parallelism (TP).