Attention Residuals Memoirs

By 苏剑林 | March 19, 2026

This article introduces one of our latest works, Attention Residuals (AttnRes). As the name suggests, this involves using the concept of Attention to improve Residuals.

Many readers have likely heard of the Pre-Norm/Post-Norm debate, but ultimately, that is just "infighting" within Residuals itself, as are the many subsequent variations of Normalization. A more interesting change was HC, which started the trend of expanding the residual flow, though perhaps due to instability in its effects, it did not gain much traction. The rest of the story is well-known: late last year, DeepSeek's mHC improved upon HC and verified its effectiveness through larger-scale experiments.

Instead of further expanding the residual flow, we chose a more radical route: directly performing Attention between layers to replace Residuals. Of course, making the entire process work required many details and effort. Here, I will briefly recount the journey and thought process behind it.

AttnRes Diagram

Inter-layer Attention

Following tradition, let's start with Residuals, which everyone is likely familiar with. Its form is:

\begin{equation}\boldsymbol{x}_t = \boldsymbol{x}_{t-1} + \boldsymbol{f}_t(\boldsymbol{x}_{t-1})\end{equation}

Let's rewrite this in another way that allows us to see something more profound. First, let $\boldsymbol{y}_t=\boldsymbol{f}_t(\boldsymbol{x}_{t-1})$, then we have $\boldsymbol{x}_t=\boldsymbol{x}_{t-1}+\boldsymbol{y}_t$. By defining $\boldsymbol{y}_0=\boldsymbol{x}_0$, it is easy to see that $\boldsymbol{x}_t=\boldsymbol{y}_0+\boldsymbol{y}_1+\cdots+\boldsymbol{y}_t$. Thus, it can be equivalently written as:

\begin{equation}\boldsymbol{y}_{t+1} = \boldsymbol{f}_{t+1}(\boldsymbol{y}_0+\boldsymbol{y}_1+\cdots+\boldsymbol{y}_t)\label{eq:res-sum}\end{equation}

That is, from the perspective of $\boldsymbol{y}$, Residuals take the equal-weighted sum of $\boldsymbol{y}_0,\boldsymbol{y}_1,\cdots,\boldsymbol{y}_t$ as the input to $\boldsymbol{f}_{t+1}$ to obtain $\boldsymbol{y}_{t+1}$. A natural extension is to change this to a weighted sum:

\begin{equation}\boldsymbol{y}_{t+1} = \boldsymbol{f}_{t+1}\left(\sum_{s=0}^t a_{t+1,s}\boldsymbol{y}_s\right)\qquad \text{where}\qquad a_{t,s}\geq 0,\quad\sum_{s=0}^t a_{t+1,s}=1\label{eq:attnres-gen}\end{equation}

This is the seed of AttnRes. The equation above adds two constraints to $a_{t,s}$; let's discuss their necessity first:

1. The constraint $a_{t,s}\geq 0$ ensures that the contribution of the same $\boldsymbol{y}_s$ to different layers is always in the same direction. This avoids the inconsistency where one layer wants to increase $\boldsymbol{y}_s$ while another wants to decrease it, which is intuitively friendlier for model learning.

2. The function $\boldsymbol{f}$ we use includes In-Norm, which applies $\newcommand{RMSNorm}{\mathop{\text{RMSNorm}}}\RMSNorm$ to the input first. Since $\RMSNorm(\boldsymbol{x})=\RMSNorm(c\boldsymbol{x})$ holds for all $c > 0$, weighted averages and weighted sums are completely equivalent. The constraint $\sum_{s=0}^t a_{t,s}=1$ does not reduce expressivity.

Hyper-Connections

Before proceeding with AttnRes, let's briefly review HC (Hyper-Connections) and prove that it can also be understood as inter-layer Attention, thereby showing that inter-layer Attention is indeed a more fundamental path. HC changes Residuals to:

\begin{equation}\boldsymbol{X}_t = \boldsymbol{H}_t^{res}\boldsymbol{X}_{t-1} + \boldsymbol{H}_t^{post} \boldsymbol{f}_t(\boldsymbol{H}_t^{pre}\boldsymbol{X}_{t-1})\end{equation}

Where $\boldsymbol{X}\in\mathbb{R}^{k\times d},\boldsymbol{H}^{res}\in\mathbb{R}^{k\times k},\boldsymbol{H}^{pre}\in\mathbb{R}^{1\times k},\boldsymbol{H}^{post}\in\mathbb{R}^{k\times 1}$, with the classic choice being $k=4$. Simply put, the state variable is expanded by $k$ times. Before entering $\boldsymbol{f}_t$, an $\boldsymbol{H}_t^{pre}$ matrix shrinks it back to 1x, and after outputting, $\boldsymbol{H}_t^{post}$ expands it back to $k$ times, which is then added to $\boldsymbol{x}_{t-1}$ adjusted by $\boldsymbol{H}_t^{res}$. If we don't limit the forms of $\boldsymbol{H}_t^{res},\boldsymbol{H}_t^{pre},\boldsymbol{H}_t^{post}$, then things like Post Norm and Highway are special cases of HC.

Similarly, let $\boldsymbol{y}_t=\boldsymbol{f}_t(\boldsymbol{H}_t^{pre}\boldsymbol{X}_{t-1})$, then $\boldsymbol{X}_t = \boldsymbol{H}_t^{res}\boldsymbol{X}_{t-1} + \boldsymbol{H}_t^{post} \boldsymbol{y}_t$. Defining $\boldsymbol{X}_0 = \boldsymbol{H}_0^{post}\boldsymbol{y}_0$, it can be expanded as $\boldsymbol{X}_t = \boldsymbol{H}_{t\leftarrow 1}^{res}\boldsymbol{H}_0^{post}\boldsymbol{y}_0 + \boldsymbol{H}_{t\leftarrow 2}^{res}\boldsymbol{H}_1^{post}\boldsymbol{y}_1 + \cdots + \boldsymbol{H}_{t\leftarrow t}^{res}\boldsymbol{H}_{t-1}^{post}\boldsymbol{y}_{t-1} + \boldsymbol{H}_t^{post}\boldsymbol{y}_t$, where $\boldsymbol{H}_{t\leftarrow s}^{res}$ is defined as $\boldsymbol{H}_t^{res}\boldsymbol{H}_{t-1}^{res}\cdots \boldsymbol{H}_{s+1}^{res}\boldsymbol{H}_s^{res}$. Further defining $\boldsymbol{H}_{t\leftarrow t+1}^{res} = \boldsymbol{I}$, we can write:

\begin{equation}\boldsymbol{y}_{t+1} = \boldsymbol{f}_{t+1}(\boldsymbol{H}_{t+1}^{pre}\boldsymbol{x}_t) = \boldsymbol{f}_{t+1}\bigg(\sum_{s=0}^t\underbrace{\boldsymbol{H}_{t+1}^{pre}\boldsymbol{H}_{t\leftarrow s+1}^{res}\boldsymbol{H}_s^{post}}_{a_{t+1,s}}\boldsymbol{y}_s\bigg)\end{equation}

Note that each $\boldsymbol{H}_{t+1}^{pre}\boldsymbol{H}_{t\leftarrow s+1}^{res}\boldsymbol{H}_s^{post}$ is a $1\times 1$ matrix, equivalent to a scalar. Thus, it is also a form of the inter-layer Attention in Equation $\eqref{eq:attnres-gen}$. Readers familiar with Linear Attention should quickly understand this result; HC is essentially a DeltaNet "rotated 90 degrees." In practice, the three $\boldsymbol{H}$ matrices are calculated by simple linear layers with $\tanh$ activation, which leads to risks of explosion or collapse when multiplying $\boldsymbol{H}_{t\leftarrow s}^{res}$ and fails to guarantee the non-negativity of $a_{t+1,s}$.

Later, mHC introduced improvements: it first changed all three $\boldsymbol{H}$ matrices to Sigmoid activation to ensure non-negativity of $a_{t+1,s}$, and then alternately normalized $\boldsymbol{H}_t^{res}$ to satisfy double stochasticity. The stability of $\boldsymbol{H}_{t\leftarrow s}^{res}$ is guaranteed by the closure of doubly stochastic matrices under multiplication. Finally, experiments verified the effectiveness of these changes. However, some new experiments, such as "Your deepseek mHC might not need the 'm'", suggest that setting $\boldsymbol{H}_t^{res}$ directly to the identity matrix is sufficient.

Strength in Numbers

Let's return to AttnRes. Once we realized the feasibility of AttnRes, the next question was: what form should $a_{t+1,s}$ take? A natural idea is to follow standard Scaled Dot-Product Attention, but at the time, I wanted to try something quickly, so I chose a simpler form:

\begin{equation}a_{t+1,s} \propto \exp(\boldsymbol{w}_{t+1}\cdot \boldsymbol{y}_s)\end{equation}

Where $\boldsymbol{w}_t$ is a trainable vector parameter. That is, using a data-independent static vector as Query ($Q$), while Key ($K$) and Value ($V$) are both $\boldsymbol{y}_s$ for Softmax Attention. This was the first version of AttnRes. To our surprise, even with this simple design, the improvement over Residuals was significant!

When I shared the preliminary results within the group, @Zhang Yu and @Nathan showed great interest and joined in to validate it on larger-scale models. We found the results to be very pleasing. During the process, we also tried more complex designs, but found most were not as good as the simple version, except for adding an $\RMSNorm$ operation to $K$, which brought stable gains. This formed the final version of AttnRes:

\begin{equation}a_{t+1,s} \propto \exp(\boldsymbol{w}_{t+1}\cdot \RMSNorm(\boldsymbol{y}_s))\end{equation}

However, AttnRes is ultimately a dense inter-layer connection scheme. Is training and inference feasible at $K2$ or larger scales? Fortunately, @V-ge, through exquisite analysis, first confirmed the feasibility of inference. The "finishing touch" was exactly the static $Q$ design I started with for convenience! This allows us to pre-calculate the attention $a_{t,s}$ for $t > s$ as soon as $\boldsymbol{y}_s$ is computed, giving the infrastructure enough room to maneuver.

But unfortunately, the engineers responsible for training, like @Wang-ge, judged after careful analysis that AttnRes was not quite feasible in our current training environment (to put it bluntly, we are poor). A solution was needed to further reduce communication and memory overhead, resulting in the Block version below. Accordingly, the previous version is called the Full version.

The Block Version

The transition from Full AttnRes to Block AttnRes is analogous to the previous process of linearizing squared Attention. Various existing Efficient Attention ideas can be applied. For example, the first thing we tried was SWA (Sliding Window Attention), but we found the actual performance was terrible, even worse than Residuals.

Reflecting on this, I believe it can be understood this way: Residuals themselves are already a very strong baseline, corresponding to the equal-weighted sum of all state vectors. For any new design to surpass it, it must at least be able to cover its form. Full AttnRes clearly satisfies this, but adding SWA does not, as it discards some states and cannot cover the special case of "equal-weighted sum of all state vectors."

Consequently, we realized that for AttnRes, "compression" might be more effective than "sparsity," and the compression doesn't necessarily need to be very fine; a simple weighted sum might suffice. After some conceptualizing and refining, @Zhang Yu and @Nathan proposed the Block AttnRes design used in the paper, which combines block processing with summation compression, achieving results close to the Full version.

The idea of Block AttnRes is roughly this: First, the Embedding layer is treated as a separate block. This is because, by observing the Attention matrix of the Full version (which is a benefit of the Attention concept—visualizing patterns), we found the model tends to give significant Attention to the Embedding layer, so it's necessary to keep it independent. For the remaining layers, every $m$ layers are grouped into a block. Within a block, compression is done via summation, and inter-block Attention is calculated using these sums.

Experiments showed that fixing it to about 8 Blocks allows for most of the gains of AttnRes. After evaluation, the training and inference teams agreed that the extra overhead of Block AttnRes is very small and completely worth the performance boost (for detailed analysis, see the posts by @Wang-ge and @V-ge; if you want numbers, it's roughly less than 5% overhead for a 25% gain). Thus, all members pushed hard to get it into the main line, which was another productive and enjoyable experience.

Matrix Perspective

It is worth mentioning that we can unify Residuals, HC/mHC, Full AttnRes, and Block AttnRes through the lens of the Attention matrix, which is a quite interesting perspective. Examples follow. Where $\phi(\boldsymbol{q},\boldsymbol{k}) = \exp(\boldsymbol{q}\cdot \RMSNorm(\boldsymbol{k}))$, the Block AttnRes version corresponds to $m=3$, and $\boldsymbol{y}_{s:t}=\sum_{i=s}^t \boldsymbol{y}_i$, a notation we also used in "Making Alchemical Refining More Scientific (Part 4): New Identities, New Learning Rates".

Residuals

\[\boldsymbol{A}=\left(\begin{array}{ccccccc} 1 & & & & & & \\ 1 & 1 & & & & & \\ 1 & 1 & 1 & & & & \\ 1 & 1 & 1 & 1 & & & \\ 1 & 1 & 1 & 1 & 1 & & \\ 1 & 1 & 1 & 1 & 1 & 1 & \\ 1 & 1 & 1 & 1 & 1 & 1 & 1 \\ \end{array}\right)\]

HC/mHC

\[\boldsymbol{A}=\left(\begin{array}{ccccccc} \boldsymbol{H}_1^{pre} \boldsymbol{H}_0^{post} & & & & & & \\ \boldsymbol{H}_2^{pre}\boldsymbol{H}_{1\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_2^{pre}\boldsymbol{H}_1^{post} & & & & & \\ \boldsymbol{H}_3^{pre}\boldsymbol{H}_{2\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_3^{pre}\boldsymbol{H}_{2\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_3^{pre}\boldsymbol{H}_2^{post} & & & & \\ \boldsymbol{H}_4^{pre}\boldsymbol{H}_{3\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_4^{pre}\boldsymbol{H}_{3\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_4^{pre}\boldsymbol{H}_{3\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_4^{pre}\boldsymbol{H}_3^{post} & & & \\ \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_{4\leftarrow 4}^{res}\boldsymbol{H}_3^{post} & \boldsymbol{H}_5^{pre}\boldsymbol{H}_4^{post} & & \\ \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 4}^{res}\boldsymbol{H}_3^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_{5\leftarrow 4}^{res}\boldsymbol{H}_4^{post} & \boldsymbol{H}_6^{pre}\boldsymbol{H}_5^{post} & \\ \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 1}^{res}\boldsymbol{H}_0^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 2}^{res}\boldsymbol{H}_1^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 3}^{res}\boldsymbol{H}_2^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 4}^{res}\boldsymbol{H}_3^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 5}^{res}\boldsymbol{H}_4^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_{6\leftarrow 6}^{res}\boldsymbol{H}_5^{post} & \boldsymbol{H}_7^{pre}\boldsymbol{H}_6^{post} \\ \end{array}\right)\]

Full AttnRes

\[\boldsymbol{A}=\left(\begin{array}{ccccccc} \phi(\boldsymbol{w}_1, \boldsymbol{y}_0) & & & & & & \\ \phi(\boldsymbol{w}_2, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_2, \boldsymbol{y}_1) & & & & & \\ \phi(\boldsymbol{w}_3, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_2) & & & & \\ \phi(\boldsymbol{w}_4, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_3) & & & \\ \phi(\boldsymbol{w}_5, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_3) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_4) & & \\ \phi(\boldsymbol{w}_6, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_3) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_4) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_5) & \\ \phi(\boldsymbol{w}_7, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_1) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_2) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_3) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_4) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_5) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_6) \\ \end{array}\right)\]

Block AttnRes

\[\boldsymbol{A}=\left(\begin{array}{c|ccc|ccc} \phi(\boldsymbol{w}_1, \boldsymbol{y}_0) & & & & & & \\ \hline \phi(\boldsymbol{w}_2, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_2, \boldsymbol{y}_1) & & & & & \\ \phi(\boldsymbol{w}_3, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_{1:2}) & \phi(\boldsymbol{w}_3, \boldsymbol{y}_{1:2}) & & & & \\ \phi(\boldsymbol{w}_4, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_4, \boldsymbol{y}_{1:3}) & & & \\ \hline \phi(\boldsymbol{w}_5, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_5, \boldsymbol{y}_4) & & \\ \phi(\boldsymbol{w}_6, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{4:5}) & \phi(\boldsymbol{w}_6, \boldsymbol{y}_{4:5}) & \\ \phi(\boldsymbol{w}_7, \boldsymbol{y}_0) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{1:3}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{4:6}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{4:6}) & \phi(\boldsymbol{w}_7, \boldsymbol{y}_{4:6}) \\ \end{array}\right)\]

Related Work

Since the planning of AttnRes, the team has been immersed in refining, validating, and accelerating the process. As some readers may know, my research style is to push as far as I can with my own derivations and solutions, and only search for related literature when I encounter difficulties or have completely solved the problem. It just so happened that I had a group of similar partners, and this time the exploration of AttnRes was quite smooth, so it wasn't until all tests had basically passed and we were preparing the technical report that we began researching related literature.

"You don't know until you check"—we were startled to find that work related to Dense Connection and Depth Attention was already very extensive. In addition to the classic DenseNet, we investigated DenseFormer, ANCRe, MUDDFormer, MRLA, Dreamer, and even found that ELMo, from before BERT, partially used similar designs. We have included these in the references.

After the technical report was released, we received comments from readers pointing out other related works we hadn't included, such as SKNets, LIMe, DCA, etc. We apologize for these omissions and thank the readers; we promise to include them in subsequent revisions. However, we ask both readers and ourselves to remain rational; literature review is a difficult task, and some omissions are inevitable. We hold all related work in high regard.

At the same time, we call on everyone to pay more attention to the workload of AttnRes beyond the concept of "Depth Attention." We fully agree that in 2026, "Depth Attention" or "Layer Attention" is not a novel idea. However, applying it to a sufficiently large model as a strong replacement for Residuals while meeting the efficiency requirements for training and inference is no easy feat. To our knowledge, AttnRes is the first work to achieve this.

Summary

This article introduced our latest result in model architecture, Attention Residuals (AttnRes), which replaces naive Residuals with inter-layer Attention. Through refined design, it meets the efficiency requirements for training and inference, successfully scaling to sufficiently large models.