A Brief History of Linear Attention: From Imitation and Innovation to Nourishing Back

By 苏剑林 | June 20, 2025

In the Chinese-speaking community, this site was among the earliest to focus on Linear Attention. When I wrote the first related blog post in 2020, "Exploration of Linear Attention: Must Attention Have a Softmax?", the primary discussion was still centered on Softmax Attention related to BERT. In hindsight, considering Linear Attention during the BERT era wasn't the wisest move because training lengths were relatively short, and models were primarily Encoders—where Linear Attention offers no real advantage. At the time, I even wrote "Linear Transformers May Not Be the Model You Are Waiting For" to express this viewpoint.

It wasn't until the emergence of ChatGPT that everyone was forced toward Decoder-only generative models, which are highly compatible with the RNN form of Linear Attention. At the same time, the pursuit of increasingly long training sequences made the quadratic complexity bottleneck of Softmax Attention increasingly obvious. In this new context, Linear Attention has shown growing competitiveness, even showing signs of "nourishing back" into Softmax Attention.

Square Complexity

First, let's introduce some notation:

\begin{equation}\begin{gathered} \boldsymbol{q}_i,\boldsymbol{k}_i,\boldsymbol{v}_i,\boldsymbol{o}_i \in \mathbb{R}^{d\times 1} \\[6pt] \boldsymbol{Q}=[\boldsymbol{q}_1,\boldsymbol{q}_2,\cdots,\boldsymbol{q}_n]^{\top}\in\mathbb{R}^{n\times d} \\[6pt] \boldsymbol{K}=[\boldsymbol{k}_1,\boldsymbol{k}_2,\cdots,\boldsymbol{k}_n]^{\top}\in\mathbb{R}^{n\times d} \\[6pt] \boldsymbol{V}=[\boldsymbol{v}_1,\boldsymbol{v}_2,\cdots,\boldsymbol{v}_n]^{\top}\in\mathbb{R}^{n\times d} \\[6pt] \boldsymbol{O}=[\boldsymbol{o}_1,\boldsymbol{o}_2,\cdots,\boldsymbol{o}_n]^{\top}\in\mathbb{R}^{n\times d} \\[6pt] \end{gathered}\end{equation}

An Attention model is essentially a mapping from $\boldsymbol{Q},\boldsymbol{K},\boldsymbol{V} \to \boldsymbol{O}$. This article mainly focuses on the Causal scenario, which means $\boldsymbol{o}_t$ is at most related to $\boldsymbol{Q}_{[:t]},\boldsymbol{K}_{[:t]},\boldsymbol{V}_{[:t]}$. In principle, the dimension $d$ of $\boldsymbol{Q},\boldsymbol{K}$ and the dimension $d$ of $\boldsymbol{V},\boldsymbol{O}$ can be different, as seen in GAU and MLA, but simplifying them to the same dimension does not change the essence of the problem.

Standard Softmax Attention usually refers to the mechanism proposed in "Attention is All You Need":

\begin{equation}\boldsymbol{O} = \mathop{\text{softmax}}(\boldsymbol{Q}\boldsymbol{K}^{\top} + \log \boldsymbol{M})\boldsymbol{V}\end{equation}

The scaling factor $1/\sqrt{d}$ is omitted here because it can always be absorbed into $\boldsymbol{Q}$ and $\boldsymbol{K}$. Binary normalization $\mathop{\text{softmax}}$ is performed on the second dimension, and $\boldsymbol{M}\in\mathbb{R}^{n\times n}$ is a lower triangular matrix called the mask matrix, defined as:

\begin{equation}M_{i,j} = \left\{\begin{aligned} &1, &i \geq j \\ &0, &i < j\end{aligned}\right.\end{equation}

$\log\boldsymbol{M}$ refers to taking the $\log$ component-wise, where $\log 0 = -\infty$. Written in component form, Softmax Attention is:

\begin{equation}\boldsymbol{o}_t = \frac{\sum_{j=1}^t \exp(\boldsymbol{q}_t^{\top}\boldsymbol{k}_j) \boldsymbol{v}_j}{\sum_{j=1}^t \exp(\boldsymbol{q}_t^{\top}\boldsymbol{k}_j) }\end{equation}

The role of the denominator is primarily to maintain numerical stability. Additionally, if we add RMSNorm to $\boldsymbol{O}$, the denominator is effectively canceled out. Therefore, the core of Softmax Attention is the numerator part, i.e.,

\begin{equation}\boldsymbol{O} = \exp(\boldsymbol{Q}\boldsymbol{K}^{\top} + \log \boldsymbol{M})\boldsymbol{V} = (\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V}\end{equation}

Where $\odot$ is the Hadamard product and $\exp$ is the element-wise exponential. It’s easy to see that the denominator is essentially replacing $\boldsymbol{V}$ with an $n \times 1$ matrix of ones, which we can supplement later if needed. Standard implementation of Softmax Attention requires computing the $n \times n$ matrix $\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})$, so space and time complexity are both proportional to $n^2$. The appearance of Flash Attention reduced memory requirements, but the quadratic time complexity remains unavoidable.

The Original Form

The earliest ideas for Linear Attention focused on imitating and approximating Softmax Attention. The simplest approach was to remove the $\exp$ directly, obtaining:

\begin{equation}\boldsymbol{O} = ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V}\label{eq:linear-attn}\end{equation}

Why is this form "Linear" Attention? To quickly understand this, let's first consider the non-Causal version without $\odot \boldsymbol{M}$. Then $\boldsymbol{O} = (\boldsymbol{Q}\boldsymbol{K}^{\top})\boldsymbol{V} = \boldsymbol{Q}(\boldsymbol{K}^{\top}\boldsymbol{V})$. Note that the complexity of computing $\boldsymbol{K}^{\top}\boldsymbol{V}$ is $\mathcal{O}(nd^2)$, resulting in a $d \times d$ matrix; multiplying this with $\boldsymbol{Q}$ also has a complexity of $\mathcal{O}(nd^2)$. Thus, the complexity depends linearly on $n$.

As for the Causal version in $\eqref{eq:linear-attn}$, we can understand it through its component form:

\begin{equation}\boldsymbol{o}_t = \sum_{j=1}^t \boldsymbol{v}_j (\boldsymbol{k}_j^{\top} \boldsymbol{q}_t) = \sum_{j=1}^t (\boldsymbol{v}_j \boldsymbol{k}_j^{\top}) \boldsymbol{q}_t = \left(\sum_{j=1}^t \boldsymbol{v}_j \boldsymbol{k}_j^{\top}\right) \boldsymbol{q}_t\end{equation}

If we denote the part in the parentheses as $\boldsymbol{S}_t$, we have:

\begin{equation}\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t, \qquad \boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top}\label{eq:linear-attn-rnn}\end{equation}

Thus, Causal Attention can be written as a linear RNN with $\boldsymbol{S}_t$ as the state. Consequently, the complexity of each step is constant, and total complexity is proportional to the sequence length $n$. Note the term "linear RNN"—this is a broader concept. Linear Attention is a type of linear RNN. Linear RNNs have developed independently for a while (e.g., LRU, SSM), but the most competitive linear architectures recently have taken the form of Linear Attention.

Early Linear Attention had obvious characteristics of imitating Softmax Attention, such as adding a denominator for normalization in Equation $\eqref{eq:linear-attn}$. To ensure normalization, $\boldsymbol{k}_j^{\top} \boldsymbol{q}_t$ had to be non-negative, so non-negative activation functions were applied to $\boldsymbol{Q}$ and $\boldsymbol{K}$. Works like Performer and RFA were built specifically to approximate $\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})$.

However, later research such as "The Devil in Linear Transformer" discovered that normalizing along the sequence dimension cannot completely avoid numerical instability. It's often better to just use post-normalization, such as:

\begin{equation}\boldsymbol{O} = \mathop{\text{RMSNorm}}(((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V})\end{equation}

Since explicit normalization isn't required, applying non-negative activations to $\boldsymbol{Q}$ and $\boldsymbol{K}$ to ensure $\boldsymbol{k}_j^{\top} \boldsymbol{q}_t \geq 0$ is no longer necessary. Is there still a point in adding activation functions (not necessarily non-negative) to $\boldsymbol{Q}$ and $\boldsymbol{K}$? My view is that adding activation functions is a matter of choice—it's possible that a specific activation might yield better results, but it doesn't change the Linear Attention form. Furthermore, existing results suggest that not adding them is often good enough.

Fancy Forget Gates

From Equation $\eqref{eq:linear-attn-rnn}$, we can see that current Linear Attention is essentially a $\mathop{\text{cumsum}}$, meaning all historical information is stacked with equal weight. It’s easy to imagine that when enough tokens are stacked, the proportion of information per token becomes extremely small. A fixed-size $\boldsymbol{S}_t$ matrix would then struggle to accurately reconstruct any single token—analogous to memories of every token becoming blurred.

To alleviate this, RetNet introduced decay effects to Linear Attention:

\begin{equation}\boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t, \qquad \boldsymbol{S}_t = \gamma\boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top}\label{eq:linear-attn-retnet}\end{equation}

Where the decay factor $\gamma \in (0, 1)$. In RetNet, it is set as a constant, though it has been set as a trainable parameter in other works or even a diagonal matrix. The Linear Attention used in MiniMax-01 also follows this logic. Note that decay factors existed before RetNet, mostly in the form of linear RNNs like LRU and SSM. RetNet was likely the first to combine it with the Linear Attention formulation. With the decay factor, the model tends to forget more distant historical info, ensuring higher resolution for recent tokens. This is essentially an implementation of the "Recency Bias" inherent in language models, which usually leads to better performance.

Additionally, it's worth noting that RetNet added RoPE to $\boldsymbol{Q}$ and $\boldsymbol{K}$, which is equivalent to generalizing the decay factor to a complex number $\gamma e^{\text{i}\theta}$. From the perspective of LRU, this corresponds to considering complex eigenvalues. Although adding position embeddings to an RNN might seem counterintuitive, some experiments like the recent TransXSSM show that adding RoPE to Linear Attention can have a positive effect. Of course, this likely depends on specific model variants and experimental setups.

A simple extension of Equation $\eqref{eq:linear-attn-retnet}$ is to replace $\gamma$ with a function of position $t$, $\gamma_t$, which was already evident in SSM. Later, works like DFW, Mamba, and Mamba2 generalized this to be input-dependent, forming a series of works related to "data-dependent decay." This is very similar to the "forget gates" in traditional non-linear RNNs like GRU and LSTM, except that to maintain the linearity of the model, the dependence of the forget gate on the state (like $\boldsymbol{S}_t$) is removed.

Why do we favor Linear RNNs? Because they can almost always be trained in parallel, making them highly competitive with Softmax Attention in both training and inference efficiency. The "universal solution" for parallelization is to transform it into a Prefix Sum problem and use Associative Scan. We briefly introduced this idea in the "Parallelization" section of "Google's New Work Attempts to 'Revive' RNNs: Can RNNs Shine Again?".

However, the "universal solution" is not the most GPU-efficient. GPU efficiency is highest with matrix multiplications. Finding parallel algorithms that rely heavily on matrix multiplication is ideal; even without full parallelism, recursive chunk-by-chunk formats that utilize matrix multiplication can significantly improve training efficiency. This, in turn, imposes requirements on the model: only forget gates in outer-product form can achieve this. A typical counterexample is Mamba, which uses a non-outer-product forget gate and cannot fully leverage GPU performance, leading to subsequent changes in Mamba2 and GLA.

Test-Time Training

So far, Linear Attention has evolved from simple imitation of Softmax Attention to incorporating static decay factors and even "data-dependent decay." It has formed its own identity and proven its value in many tasks. However, most of these advances were designed manually based on experience. We can't help but ask: Is there a more high-level principle to guide the design of Linear Attention or even general sequence models (Token-Mixers)?

To this question, TTT (Test Time Training) provides its own answer. It views sequence modeling as an "Online Learning" problem and proposes constructing RNNs (not necessarily linear) using optimizers. Specifically, it treats $\boldsymbol{K}, \boldsymbol{V}$ as a corpus of pairs $(\boldsymbol{k}_1, \boldsymbol{v}_1), (\boldsymbol{k}_2, \boldsymbol{v}_2), \cdots, (\boldsymbol{k}_t, \boldsymbol{v}_t)$. Based on these, a model $\boldsymbol{v} = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{k})$ is trained, and the output is $\boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{q}_t)$, where $\boldsymbol{S}_t$ are the model parameters. The structure of the model $\boldsymbol{f}$ is largely arbitrary.

What does this have to do with RNNs? Simple: optimizers like SGD and Adam are inherently RNNs acting on model parameters! This viewpoint isn't new; back in the Meta Learning craze of 2017, researchers already suggested and utilized this, though the goal then was often to use an RNN (LSTM) to simulate a better optimizer (see "Optimization as a Model for Few-Shot Learning").

As the saying goes, "what goes around comes around." Years later, TTT proposes building RNNs via optimizers. The process is this: given the current model parameters $\boldsymbol{S}_{t-1}$, the optimizer (SGD) receives new data $(\boldsymbol{k}_t, \boldsymbol{v}_t)$, updates the parameters to $\boldsymbol{S}_t$, and finally returns the prediction for $\boldsymbol{q}_t$ as $\boldsymbol{f}(\boldsymbol{S}_{t-1}; \boldsymbol{q}_t)$, and so on. Thus, the RNN implemented by TTT can be written uniformly as:

\begin{equation}\boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{q}_t), \qquad \boldsymbol{S}_t = \boldsymbol{S}_{t-1} - \eta_t\nabla_{\boldsymbol{S}_{t-1}}\mathcal{L}(\boldsymbol{f}(\boldsymbol{S}_{t-1};\boldsymbol{k}_t), \boldsymbol{v}_t)\label{eq:ttt-rnn}\end{equation}

Where $\mathcal{L}(\boldsymbol{f}(\boldsymbol{S}_{t-1};\boldsymbol{k}_t), \boldsymbol{v}_t)$ is the loss function of the current data under the current parameters, and $\eta_t$ is the learning rate (which can be data-dependent, as discussed previously). This form can cover many RNN models; for example, Equations $\eqref{eq:linear-attn-rnn}$ and $\eqref{eq:linear-attn-retnet}$ are special cases:

$$\begin{array}{c|cc|ccc} \hline & \text{RNN} & \boldsymbol{o}_t & \boldsymbol{f}(\boldsymbol{S};\boldsymbol{k}) & \mathcal{L}(\boldsymbol{f}(\boldsymbol{S};\boldsymbol{k}),\boldsymbol{v}) & \eta_t \\ \hline \eqref{eq:linear-attn-rnn} & \boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top} & \boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t & \boldsymbol{S}\boldsymbol{k} & -\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k}) & 1 \\ \eqref{eq:linear-attn-retnet} & \boldsymbol{S}_t = \gamma\boldsymbol{S}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top} & \boldsymbol{o}_t = \boldsymbol{S}_t \boldsymbol{q}_t & \boldsymbol{S}\boldsymbol{k} & -\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k}) + \frac{1-\gamma}{2}\Vert\boldsymbol{S}\Vert_F^2 & 1 \\ \hline \end{array}$$

The original TTT paper explored non-linear RNNs under mini-batch settings. Later, Titans added momentum to TTT's SGD, and "Test-Time Training Done Right" explored large-batch TTT and "TTT + Muon" combinations. Note that TTT only uses the optimizer to construct the RNN; parameters outside the RNN (like the trainable weights for $\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}$) are still trained using a global optimizer once the whole model is built.

A deeper question: why can TTT serve as a "guiding principle" for building RNNs? The core goal of an RNN is to effectively compress historical data into a fixed-size State. Model parameters are themselves fixed-size; training a model is essentially compressing training data into weights. TTT leverages this inherent alignment. Put simply, if the RNN is a compression task, TTT views model $\boldsymbol{f}$ as the "decompressor," its weights as the "compressed archive," SGD as the compression algorithm, and loss $\mathcal{L}$ as the compression rate.

In this way, we don't have to worry about designing recurrence formats; instead, we design $\boldsymbol{f}$ and $\mathcal{L}$. We can judge the strength and reliability of an RNN just by looking at its corresponding $\boldsymbol{f}$ and $\mathcal{L}$.

Furthermore, because TTT uses Online Learning to build the RNN, the resulting RNN is naturally suited for In-Context Learning (ICL) tasks. This is an advantage of TTT as a guiding principle. Previously, "Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers" even worked backward, using Linear Attention (Softmax Attention without the Softmax) to explain ICL capability; from today's perspective, it essentially constructed a corresponding TTT.

Discarding the Old for the New

For example, the loss function for the earliest Linear Attention was $-\boldsymbol{v}^{\top}(\boldsymbol{S}\boldsymbol{k})$. At a glance, this objective is unreliable because it is unbounded from below, which might cause $\boldsymbol{S}$ to tend toward infinity. In contrast, RetNet added an L2 regularization term to the loss, avoiding this risk and mitigitating over-fitting, thus producing a better RNN.

However, while using an inner product as a loss is simple and logical, it doesn't directly encourage $\boldsymbol{S}\boldsymbol{k} = \boldsymbol{v}$, so it's not an ideal regression loss. A better target would be the squared loss, $\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2$. Substituting this into the TTT formula $\eqref{eq:ttt-rnn}$ gives:

\begin{equation}\boldsymbol{o}_t = \boldsymbol{f}(\boldsymbol{S}_t; \boldsymbol{q}_t), \qquad \boldsymbol{S}_t = \boldsymbol{S}_{t-1} - \eta_t \underbrace{(\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top}}_{\nabla_{\boldsymbol{S}_{t-1}}\frac{1}{2}\Vert\boldsymbol{S}_{t-1}\boldsymbol{k}_t - \boldsymbol{v}_t\Vert^2}\end{equation}

This is DeltaNet, though the idea was earlier proposed by "Linear Transformers Are Secretly Fast Weight Programmers". Notice that $\eta_t (\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top} = (\boldsymbol{S}_{t-1} (\sqrt{\eta_t}\boldsymbol{k}_t) - (\sqrt{\eta_t}\boldsymbol{v}_t))(\sqrt{\eta_t}\boldsymbol{k}_t)^{\top}$, meaning $\eta_t$ can always be absorbed into the definition of $\boldsymbol{k}_t, \boldsymbol{v}_t$. We continue the analysis with $\eta_t=1$:

\begin{equation}\begin{aligned} \boldsymbol{S}_t =&\, \boldsymbol{S}_{t-1} -(\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top} \\ =&\, \boldsymbol{S}_{t-1} -(\boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top} + \boldsymbol{v}_t\boldsymbol{k}_t^{\top} \\ =&\, \boldsymbol{S}_{t-1} (\boldsymbol{I} - \boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \boldsymbol{v}_t\boldsymbol{k}_t^{\top} \end{aligned}\label{eq:linear-attn-deltanet}\end{equation}

Comparing this to the earliest Linear Attention in $\eqref{eq:linear-attn-rnn}$, the difference in DeltaNet is the subtraction of $(\boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top}$ before adding $\boldsymbol{v}_t\boldsymbol{k}_t^{\top}$. $\boldsymbol{S}_{t-1} \boldsymbol{k}_t$ can be understood as the prediction for the new input under the old model.

Intuitively, "subtracting then adding" means first removing the model's old knowledge of $\boldsymbol{k}_t$, and then supplementing new knowledge based on $(\boldsymbol{k}_t, \boldsymbol{v}_t)$, achieving an effect of "replacing the old with the new." This rule is called the "Delta Rule," which is the source of "Delta" in DeltaNet. The Delta Rule is not new; it's also known as Least Mean Square or the Widrow-Hoff Algorithm, dating back to the 1960s. In fact, very little in this field is entirely new; most modifications can be traced back to "ancient" work. Current efforts mostly focus on finding the segments that are Scalable.

It should also be noted that in chronoligical order, DeltaNet came before TTT. Understanding RNNs through Online Learning appeared sporadically before TTT, but TTT systematically proposed this "guiding principle" and used it to build new models. I placed TTT first to make the introduction more fluid.

Some readers might wonder: Is DeltaNet still a Linear RNN? Yes. We say an RNN is linear when the recurrence formula's dependence on the state variable is linear, even if the dependence on the input or $\boldsymbol{q}, \boldsymbol{k}, \boldsymbol{v}$ is non-linear (though different forms of dependence affect parallel efficiency). In Equation $\eqref{eq:linear-attn-deltanet}$, the right side only contains the first power of $\boldsymbol{S}_{t-1}$, so it satisfies the definition.

Helping with Inverse

As mentioned, the most ideal (GPU-efficient) parallel algorithms for Linear RNNs are those that fully utilize matrix multiplications. To achieve this, let's write DeltaNet as:

\begin{equation}\boldsymbol{S}_t = \boldsymbol{S}_{t-1} + (\boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t)\boldsymbol{k}_t^{\top}\end{equation}

Let $\boldsymbol{u}_t = \boldsymbol{v}_t - \boldsymbol{S}_{t-1} \boldsymbol{k}_t$, then $\boldsymbol{S}_t = \boldsymbol{S}_{t-1} + \boldsymbol{u}_t\boldsymbol{k}_t^{\top}$. This means it simply replaces $\boldsymbol{V}$ with $\boldsymbol{U}=[\boldsymbol{u}_1,\boldsymbol{u}_2,\cdots,\boldsymbol{u}_n]^{\top}$ in the original Linear Attention. Iterating $t-1$ times gives:

\begin{equation}\boldsymbol{S}_{t-1} = \sum_{j=1}^{t-1} \boldsymbol{u}_j\boldsymbol{k}_j^{\top}\qquad\Rightarrow\qquad \boldsymbol{u}_t = \boldsymbol{v}_t - \left(\sum_{j=1}^{t-1} \boldsymbol{u}_j\boldsymbol{k}_j^{\top}\right)\boldsymbol{k}_t = \boldsymbol{v}_t - \sum_{j=1}^{t-1} \boldsymbol{u}_j(\boldsymbol{k}_j^{\top}\boldsymbol{k}_t)\end{equation}

The last equation can be written in matrix form as $\boldsymbol{U} = \boldsymbol{V} - ((\boldsymbol{K}\boldsymbol{K}^{\top})\odot (\boldsymbol{M} - \boldsymbol{I}))\boldsymbol{U}$. This is a system of linear equations whose solution is:

\begin{equation}\boldsymbol{U} = (\boldsymbol{I} + \underbrace{(\boldsymbol{K}\boldsymbol{K}^{\top})\odot (\boldsymbol{M} - \boldsymbol{I})}_{\text{denoted as }\boldsymbol{B}})^{-1}\boldsymbol{V}\end{equation}

Here we see the inverse of an $n \times n$ matrix $(\boldsymbol{I}+\boldsymbol{B})^{-1}$. Standard complexity is $\mathcal{O}(n^3)$, which is higher than Softmax Attention! Fortunately, we don't need the explicit inverse, only $\boldsymbol{U}$. This can be transformed into solving the equation system $(\boldsymbol{I}+\boldsymbol{B})\boldsymbol{U}=\boldsymbol{V}$, reducing complexity to $\mathcal{O}(n^2)$. Furthermore, leveraging the lower-triangular structure of $\boldsymbol{I}+\boldsymbol{B}$ and the low-rank structure of $\boldsymbol{B}$, the complexity can be reduced to linear. By writing it as block matrix multiplication, we can fully utilize the GPU. These details are in the original paper; this article aims to clear up the mathematical principles.

Following DeltaNet, Gated DeltaNet (GDN) further introduced forget gates into DeltaNet. The original recursive form of Gated DeltaNet was introduced as:

\begin{equation}\boldsymbol{S}_t = \boldsymbol{S}_{t-1} (\alpha_t (\boldsymbol{I} - \beta_t\boldsymbol{k}_t\boldsymbol{k}_t^{\top})) + \beta_t\boldsymbol{v}_t\boldsymbol{k}_t^{\top}\end{equation}

However, I believe this explicitly breaks the Delta Rule. A better formulation would be like Comba, where the gate is only multiplied by the first $\boldsymbol{S}_{t-1}$:

\begin{equation}\boldsymbol{S}_t = \gamma_t\boldsymbol{S}_{t-1} + \eta_t(\boldsymbol{v}_t - \boldsymbol{S}_{t-1}\boldsymbol{k}_t)\boldsymbol{k}_t^{\top}\end{equation}

This corresponds to setting the loss function as $\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2 + \frac{1-\gamma}{\eta}\Vert\boldsymbol{S}\Vert_F^2$. Of course, mathematically, the two formulations are equivalent:

\begin{equation}\boldsymbol{S}_{t-1} (\alpha_t (\boldsymbol{I} - \beta_t\boldsymbol{k}_t\boldsymbol{k}_t^{\top})) + \beta_t\boldsymbol{v}_t\boldsymbol{k}_t^{\top} = \alpha_t \boldsymbol{S}_{t-1} + \alpha_t \beta_t (\boldsymbol{v}_t/\alpha_t - \boldsymbol{S}_{t-1}\boldsymbol{k}_t)\boldsymbol{k}_t^{\top}\end{equation}

By setting $\gamma_t = \alpha_t, \eta_t = \alpha_t \beta_t$ and absorbing $1/\alpha_t$ into $\boldsymbol{v}_t$, you get the latter. Thus, mathematically they are the same. Since most $\alpha_t$ will be close to 1, performance is likely similar (though Comba argues the latter is slightly better), but the latter more clearly preserves the "Delta Rule" appearance.

Nourishing Back in Action

As mentioned at the beginning, current Linear Attention models not only rival Softmax Attention but have also started to "nourish" or provide feedback to Softmax Attention. This might seem surprising, but it makes sense upon reflection. In some sense, Softmax Attention has been "regressing" over the years—moving from MHA to GQA and MQA to compress the KV Cache. Linear Attention doesn't have a KV Cache problem, so it has continued to move forward.

To see this, let's write out the Attention mechanisms mentioned so far in matrix form:

\begin{array}{c|c} \hline & \text{Formula} \\[4pt] \hline \text{Softmax Attention} & (\exp(\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V} \\[4pt] \text{Earliest Linear Attention} & ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})\boldsymbol{V} \\[4pt] \text{With Forget Gate} & ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{\Gamma})\boldsymbol{V} \\[4pt] \text{DeltaNet} & ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{M})(\boldsymbol{I} + (\boldsymbol{K}\boldsymbol{K}^{\top})\odot (\boldsymbol{M} - \boldsymbol{I}))^{-1}\boldsymbol{V} \\[4pt] \text{Gated DeltaNet} & ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot \boldsymbol{\Gamma})(\boldsymbol{I} + (\boldsymbol{K}\boldsymbol{K}^{\top})\odot (\boldsymbol{\Gamma} - \boldsymbol{I}))^{-1}\boldsymbol{V} \\[4pt] \hline \end{array}

Where

\begin{equation}\Gamma_{i,j} = \left\{\begin{aligned} &\prod_{\tau=j+1}^i \gamma_{\tau}, &i > j \\[6pt] &\qquad 1, &i = j \\[6pt] &\qquad 0, &i < j\end{aligned}\right.\end{equation}

Looking at it this way, Softmax Attention's form is still stuck at the level of the earliest Linear Attention (which also proves its power). So how is the "nourishing back" achieved? First, we need a method to convert Softmax Attention into Linear Attention. This isn't hard; we summarized three schemes in "Transformer Upgrade Road: 5. Linear Attention as Infinite Dimensional".

In short, there exists a mapping $\phi$ that maps $\boldsymbol{Q}, \boldsymbol{K}$ from $n \times d$ to $n \times \infty$, satisfying $\exp(\boldsymbol{Q}\boldsymbol{K}^{\top}) = \phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top}$. This is the "kernel trick." Now the rest is simple: replace the $\boldsymbol{Q}, \boldsymbol{K}$ in the table with $\phi(\boldsymbol{Q}), \phi(\boldsymbol{K})$, then find a way to restore the $\exp$ and normalize it to get a new variant of Softmax Attention. For example, substituting into the forget gate formula:

\begin{equation}((\phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top})\odot \boldsymbol{\Gamma})\boldsymbol{V} = \exp(\boldsymbol{Q}\boldsymbol{K}^{\top} + \log\boldsymbol{\Gamma})\boldsymbol{V}\end{equation}

If $\gamma_t$ is constant, this is exactly ALiBi. If $\gamma_t$ is input-dependent, it is FoX.

An even more interesting result is DeltaFormer. As the name suggests, it is the DeltaNet version of Softmax Attention. Replacing $\boldsymbol{Q}, \boldsymbol{K}$ in DeltaNet with $\phi(\boldsymbol{Q}), \phi(\boldsymbol{K})$:

\begin{equation}\begin{aligned} &\,((\phi(\boldsymbol{Q})\phi(\boldsymbol{K})^{\top})\odot \boldsymbol{M})(\boldsymbol{I} + (\phi(\boldsymbol{K})\phi(\boldsymbol{K})^{\top})\odot (\boldsymbol{M} - \boldsymbol{I}))^{-1}\boldsymbol{V} \\[8pt] =&\, \underbrace{\exp(\boldsymbol{Q} \boldsymbol{K}^{\top} + \log\boldsymbol{M})}_{\text{denoted as }\boldsymbol{A}}(\boldsymbol{I} + \underbrace{\exp(\boldsymbol{K} \boldsymbol{K}^{\top} + \log(\boldsymbol{M} - \boldsymbol{I}))}_{\text{denoted as }\boldsymbol{B}})^{-1}\boldsymbol{V} \end{aligned}\end{equation}

If normalization is needed, replace $\exp$ with $\text{softmax}$. Compared to Softmax Attention, DeltaFormer changes $\boldsymbol{A}\boldsymbol{V}$ to $\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V}$. Notice that:

\begin{equation}\begin{aligned} \boldsymbol{A}(\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V} =&\, \boldsymbol{A}(\boldsymbol{I}-\boldsymbol{B}+\boldsymbol{B}^2- \boldsymbol{B}^3 + \cdots)\boldsymbol{V} \\ =&\, \boldsymbol{A}(\boldsymbol{V}-\boldsymbol{B}\boldsymbol{V}+\boldsymbol{B}^2\boldsymbol{V}- \boldsymbol{B}^3\boldsymbol{V} + \cdots) \end{aligned}\end{equation}

Thus, DeltaFormer is equivalent to first calculating Attention multiple times with $\boldsymbol{K}, \boldsymbol{K}, \boldsymbol{V}$, stacking the results as a new $\boldsymbol{V}$, and then calculating Attention once with $\boldsymbol{Q}, \boldsymbol{K}$. This characteristic makes it highly effective for Multi-Hop tasks (like Code). Furthermore, this feature means it pairs exceptionally well with MQA, because in $(\boldsymbol{I}+\boldsymbol{B})^{-1}\boldsymbol{V}$, only $\boldsymbol{K}, \boldsymbol{V}$ are involved. For MQA, $\boldsymbol{K}, \boldsymbol{V}$ have a Single-Head, significantly reducing computation compared to MHA.

However, from my perspective, this fixed-coefficient stacking might not be a "free lunch." For instance, my experimental results show that DeltaFormer's language modeling loss doesn't change much, meaning if losses on certain tasks decrease significantly, losses on others must have increased.

Hardcore Encoding

Another noteworthy "nourishing back" work is PaTH Attention, from "PaTH Attention: Position Encoding via Accumulating Householder Transformations." It uses DeltaNet to inform Softmax Attention from the perspective of position encoding.

In "Transformer Upgrade Road: 6. Completeness Analysis of Rotary Position Encoding," we noted that for any orthogonal matrix $\boldsymbol{\Omega}$, $\boldsymbol{R}_m = \boldsymbol{\Omega}^m$ is a generalized RoPE. Besides rotation matrices, what other orthogonal matrices are easy to build? PaTH uses Householder matrices: let $\boldsymbol{w}$ be any column vector with length $\sqrt{2}$, then $\boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top}$ is an orthogonal matrix (as derived in "Orthogonal Matrix Transforming One Unit Vector to Another"). Geometrically, this is a specular reflection.

It's easy to see that this is identical to the $\boldsymbol{I}-\boldsymbol{k}_t\boldsymbol{k}_t^{\top}$ factor in DeltaNet. PaTH simply adopts this part—abandoning the $\boldsymbol{\Omega}^m$ form and the $\Vert\boldsymbol{w}\Vert=\sqrt{2}$ constraint—using a series of $\boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top}$ multiplications to express position information:

\begin{equation}\boldsymbol{q}_i^{\top}\boldsymbol{k}_j \qquad\to\qquad \boldsymbol{q}_i^{\top}\underbrace{(\boldsymbol{I}-\boldsymbol{w}_i\boldsymbol{w}_i^{\top})(\boldsymbol{I}-\boldsymbol{w}_{i-1}\boldsymbol{w}_{i-1}^{\top})\cdots(\boldsymbol{I}-\boldsymbol{w}_{j+1}\boldsymbol{w}_{j+1}^{\top})}_{\text{denoted as }\boldsymbol{R}_{i,j}}\boldsymbol{k}_j \end{equation}

Writing $\boldsymbol{R}_{i,j}$ recursively gives $\boldsymbol{R}_{i,j} = (\boldsymbol{I}-\boldsymbol{w}_i\boldsymbol{w}_i^{\top})\boldsymbol{R}_{i-1,j}$ with $\boldsymbol{R}_{j,j} = \boldsymbol{I}$. Comparing to DeltaNet’s $\eqref{eq:linear-attn-deltanet}$, this is equivalent to setting $\boldsymbol{v}_t$ to zero but having a non-zero initial $\boldsymbol{S}_0$. Following the same process from the "Helping with Inverse" section, we get:

\begin{equation}\boldsymbol{R}_{i,j} = \boldsymbol{I} - \boldsymbol{W}_{[j:i]}^{\top}(\boldsymbol{I} + (\boldsymbol{W}_{[j:i]}\boldsymbol{W}_{[j:i]}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1}\boldsymbol{W}_{[j:i]}\end{equation}

Where $\boldsymbol{W}=[\boldsymbol{w}_1,\boldsymbol{w}_2,\cdots,\boldsymbol{w}_n]^{\top}$ and slices are understood in the NumPy sense (e.g., $\boldsymbol{W}_{[j:i]}=[\boldsymbol{w}_{j+1},\boldsymbol{w}_{j+2},\cdots,\boldsymbol{w}_i]^{\top}$). Slicing takes precedence over transposition. Note that we are inverting a lower triangular matrix. A key property of triangular matrices is that the diagonal elements of the inverse are the reciprocals of the original diagonal elements. For block triangular matrices, this holds for the diagonal blocks. Thus we can write:

\begin{equation}(\boldsymbol{I} + (\boldsymbol{W}_{[j:i]}\boldsymbol{W}_{[j:i]}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1} = (\underbrace{(\boldsymbol{I} + (\boldsymbol{W}\boldsymbol{W}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1}}_{\text{denoted as }\boldsymbol{T}})_{[j:i,j:i]}\end{equation}

The following expansion in component form makes it easier to understand:

\begin{equation}\begin{aligned} A_{i,j} =&\, \boldsymbol{q}_i^{\top} \boldsymbol{R}_{i,j} \boldsymbol{k}_j \\[6pt] =&\, \boldsymbol{q}_i^{\top}\boldsymbol{k}_j - \boldsymbol{q}_i^{\top}\boldsymbol{W}_{[j:i]}^{\top}\boldsymbol{T}_{[j:i,j:i]}\boldsymbol{W}_{[j:i]}\boldsymbol{k}_j \\ =&\, \boldsymbol{q}_i^{\top}\boldsymbol{k}_j - \sum_{p=1}^d \sum_{l=j+1}^i \sum_{r=j+1}^i \sum_{s=1}^d Q_{i,p} W_{l,p} T_{l,r} W_{r,s} K_{j,s} \\ =&\, \boldsymbol{q}_i^{\top}\boldsymbol{k}_j - \sum_{p=1}^d \sum_{l=1}^i \sum_{r=j+1}^n \sum_{s=1}^d Q_{i,p} W_{l,p} T_{l,r} W_{r,s} K_{j,s} \\ =&\, \boldsymbol{q}_i^{\top}\boldsymbol{k}_j - \sum_{p=1}^d \sum_{l=1}^n \sum_{r=1}^n \sum_{s=1}^d Q_{i,p} W_{l,p} \chi_{l \leq i} T_{l,r} \chi_{r \geq j+1}W_{r,s} K_{j,s} \\ =&\, \boldsymbol{q}_i^{\top}\boldsymbol{k}_j - \sum_{l=1}^n \sum_{r=1}^n \underbrace{\left(\chi_{l \leq i}\sum_{p=1}^d Q_{i,p} W_{l,p}\right)}_{(\boldsymbol{Q}\boldsymbol{W}^{\top})\odot\boldsymbol{M}} T_{l,r} \underbrace{\left(\chi_{r \geq j+1} \sum_{s=1}^d W_{r,s} K_{j,s}\right)}_{(\boldsymbol{W}\boldsymbol{K}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I})} \\ \end{aligned}\end{equation}

Key points: the cleverest part is the 4th equality, which uses the fact that $\boldsymbol{T}$ is a lower triangular matrix, so $T_{l,r} = 0$ when $l < r$. The 5th equality uses the indicator function $\chi$. In the 6th equality, when we process the two sum parts for $p$ and $s$, the results are $\boldsymbol{Q}\boldsymbol{W}^{\top}$ and $\boldsymbol{W}\boldsymbol{K}^{\top}$. Multiplying by $\chi_{l \leq i}$ keeps the lower triangular part of $\boldsymbol{Q}\boldsymbol{W}^{\top}$ (including the diagonal), while $\chi_{r \geq j+1}$ keeps the lower triangular part of $\boldsymbol{W}\boldsymbol{K}^{\top}$ (excluding the diagonal).

Finally, we can write the entire attention matrix (before Softmax):

\begin{equation}\boldsymbol{A} = (\boldsymbol{Q}\boldsymbol{K}^{\top})\odot\boldsymbol{M} - ((\boldsymbol{Q}\boldsymbol{W}^{\top})\odot\boldsymbol{M})(\boldsymbol{I} + (\boldsymbol{W}\boldsymbol{W}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1}((\boldsymbol{W}\boldsymbol{K}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I})) \label{eq:path-attn}\end{equation}

Shocked? There’s more. Direct inversion is $\mathcal{O}(n^3)$, which is unacceptable. One must use the low-rank property of $\boldsymbol{W}\boldsymbol{W}^{\top}$ to reduce complexity to $\mathcal{O}(n^2)$, derive backpropagation, and write a high-performance implementation similar to Flash Attention. The original paper is incredibly dense and hardcore.

From the position encoding perspective, PaTH is a type of CoPE (Contextual Position Encoding). Its positions are not simple indices $1, 2, 3, \cdots$, but position signals automatically generated based on content. Similarly, FoX can be seen as a Contextual version of ALiBi. Context-dependent position information is a major feature of current Linear Attention and likely a key direction for future feedback into Softmax Attention.

The Joy of Simplification

Let's dive a bit deeper into PaTH. This helps us understand PaTH itself and makes us more familiar with DeltaNet, as the two are highly correlated. Let's look at two special cases of PaTH to understand its connection to DeltaNet.

The first special case is $\boldsymbol{W}=\boldsymbol{K}$. Substituting into $\eqref{eq:path-attn}$:

\begin{equation}\begin{aligned} \boldsymbol{A} =&\, ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot\boldsymbol{M})(\boldsymbol{I} - (\boldsymbol{I} + ( ... )\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1}((\boldsymbol{K}\boldsymbol{K}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))) \\[6pt] =&\, ((\boldsymbol{Q}\boldsymbol{K}^{\top})\odot\boldsymbol{M})(\boldsymbol{I} + (\boldsymbol{K}\boldsymbol{K}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1} \qquad (\text{Note}: \boldsymbol{I} - (\boldsymbol{I} + \boldsymbol{A})^{-1} \boldsymbol{A} = (\boldsymbol{I}+\boldsymbol{A})^{-1}) \end{aligned}\end{equation}

Look familiar? This is exactly the Attention matrix of DeltaNet! From this special case, the difference between PaTH and DeltaFormer is that DeltaFormer uses the kernel trick to add $\exp$ to DeltaNet's $\boldsymbol{Q}\boldsymbol{K}^{\top}$ and $\boldsymbol{K}\boldsymbol{K}^{\top}$, while PaTH directly adds $\exp$ to the DeltaNet Attention matrix.

The second case is reintroducing the constraint $\Vert\boldsymbol{w}\Vert=\sqrt{2}$, making $\boldsymbol{I}-\boldsymbol{w}\boldsymbol{w}^{\top}$ orthogonal. Let's introduce:

\begin{equation}\begin{aligned} \boldsymbol{R}_i \triangleq&\, (\boldsymbol{I}-\boldsymbol{w}_i\boldsymbol{w}_i^{\top})(\boldsymbol{I}-\boldsymbol{w}_{i-1}\boldsymbol{w}_{i-1}^{\top})\cdots(\boldsymbol{I}-\boldsymbol{w}_1\boldsymbol{w}_1^{\top}) \\ =&\, \boldsymbol{I} - \boldsymbol{W}_{[:i]}^{\top}(\boldsymbol{I} + (\boldsymbol{W}_{[:i]}\boldsymbol{W}_{[:i]}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1}\boldsymbol{W}_{[:i]} \\ =&\, \boldsymbol{R}_{i,0} \end{aligned}\end{equation}

Then $\boldsymbol{R}_{i,j} = \boldsymbol{R}_i \boldsymbol{R}_j^{\top}$. This equality means we can implement PaTH using absolute position to represent relative positions, like RoPE. We just multiply each $\boldsymbol{q}_i^{\top}, \boldsymbol{k}_i^{\top}$ by $\boldsymbol{R}_i$ and use a standard Softmax Attention implementation. What is the operation of multiplying by $\boldsymbol{R}_i$? Repeating the expansion from the previous section:

\begin{equation}\begin{aligned} (\boldsymbol{q}_i^{\top} \boldsymbol{R}_{i})_s =&\, Q_{i,s} - \sum_{l=1}^n \underbrace{\chi_{l\leq i} \sum_{p=1}^d Q_{i,p} W_{l,p}}_{(\boldsymbol{Q}\boldsymbol{W}^{\top})\odot\boldsymbol{M}}\, \underbrace{\sum_{r=1}^n T_{l,r} W_{r,s}}_{\boldsymbol{T}\boldsymbol{W}} \end{aligned}\end{equation}

In matrix form, this is:

\begin{equation}\boldsymbol{\boldsymbol{Q}} - ((\boldsymbol{Q}\boldsymbol{W}^{\top})\odot\boldsymbol{M})(\boldsymbol{I} + (\boldsymbol{W}\boldsymbol{W}^{\top})\odot(\boldsymbol{M} - \boldsymbol{I}))^{-1}\boldsymbol{W}\end{equation}

Familiar again? The second part is $\text{DeltaNet}(\boldsymbol{Q},\boldsymbol{W},\boldsymbol{W})$! So in this case, PaTH is equivalent to:

\begin{equation}\mathop{\text{SoftmaxAttention}}(\underbrace{\boldsymbol{Q}-\mathop{\text{DeltaNet}}(\boldsymbol{Q},\boldsymbol{W},\boldsymbol{W})}_{\tilde{\boldsymbol{Q}}},\underbrace{\boldsymbol{K}-\mathop{\text{DeltaNet}}(\boldsymbol{K},\boldsymbol{W},\boldsymbol{W})}_{\tilde{\boldsymbol{K}}},\boldsymbol{V})\end{equation}

Which is using DeltaNet to add position encoding to $\boldsymbol{Q}, \boldsymbol{K}$. In this view, PaTH (under the $\Vert\boldsymbol{w}\Vert=\sqrt{2}$ constraint) is a kind of inter-layer mix of Softmax Attention and DeltaNet. Even if we discard the constraint, using this approach is similar to Canon Layers, using convolutions to add position info to $\boldsymbol{Q}, \boldsymbol{K}$—except here, the convolution is a long convolution like DeltaNet, not a short one.

The Unconventional Path

Finally, let's look at a recent and equally noteworthy Linear Attention model—MesaNet (and its similar contemporary work, Atlas). TTT's Online Learning perspective tells us that DeltaNet is using SGD to optimize the objective $\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2$. However, we notice that $\boldsymbol{S}\boldsymbol{k}$ is just a linear function of $\boldsymbol{k}$, making this a linear regression problem, which has an analytic solution!

\begin{equation}\boldsymbol{S}_t = \boldsymbol{G}_t \boldsymbol{H}_t^{-1},\quad \boldsymbol{G}_t = \sum_{j=1}^t \boldsymbol{v}_j \boldsymbol{k}_j^{\top},\quad \boldsymbol{H}_t = \sum_{j=1}^t \boldsymbol{k}_j \boldsymbol{k}_j^{\top}\end{equation}

MesaNet utilizes this analytic solution to build sequence models. This idea originated in "Uncovering mesa-optimization algorithms in Transformers," and its efficient training was realized by "MesaNet: Sequence Modeling by Locally Optimal Test-Time Training." MesaNet adds forget gates to $\boldsymbol{G}_t, \boldsymbol{H}_t$ and adds a diagonal matrix $\boldsymbol{\Lambda}_t$ to the inverse term for numerical stability:

\begin{equation}\boldsymbol{o}_t = \boldsymbol{G}_t (\boldsymbol{H}_t + \boldsymbol{\Lambda}_t)^{-1} \boldsymbol{q}_t,\quad \boldsymbol{G}_t = \gamma_t \boldsymbol{G}_{t-1} + \boldsymbol{v}_t \boldsymbol{k}_t^{\top},\quad\boldsymbol{H}_t = \gamma_t \boldsymbol{H}_{t-1} + \boldsymbol{k}_t \boldsymbol{k}_t^{\top}\end{equation}

Obviously, the complexity of $\boldsymbol{G}_t, \boldsymbol{H}_t$ relative to sequence length is linear, as is the calculation of $\boldsymbol{o}_t$. Therefore, MesaNet is still a Linear Attention model. Because of the analytic solution, it generally guarantees better performance than DeltaNet or Gated DeltaNet in most cases. From a signal processing standpoint, the difference between MesaNet and DeltaNet is that between Recursive Least Square and Least Mean Square.

While full of advantages, why did I categorize it as "unconventional"? In my view, MesaNet "lives by the analytic solution and dies by the analytic solution." The analytic solution usually makes it superior to DeltaNet but gives it a "dead end" feel, because even a small change to the model makes an analytic solution nearly impossible to find. Looking at the history of mathematics, branches that rely on analytic solutions have mostly declined because such solutions are too rare and unrepresentative.

In terms of implementation, the matrix $\boldsymbol{H}_t + \boldsymbol{\Lambda}_t$ that MesaNet needs to invert is not triangular. Although $(\boldsymbol{H}_t + \boldsymbol{\Lambda}_t)^{-1} \boldsymbol{q}_t$ can still be transformed into solving equations without an explicit inverse, the non-triangular nature increases complexity significantly. Efficiently computing all $(\boldsymbol{H}_t + \boldsymbol{\Lambda}_t)^{-1} \boldsymbol{q}_t$ in parallel will be a long-term challenge for MesaNet. Currently, the paper uses the "Conjugate Gradient Method" for approximate solutions—useful but not perfect.

Furthermore, theoretically, MesaNet is not strictly superior to DeltaNet. MesaNet's update rules for $\boldsymbol{G}_t, \boldsymbol{H}_t$ are simple sliding averages, and its inversion doesn't involve Token interaction, so its upper bound for capacity is likely lower than that of DeltaNet, which has the Delta Rule. Intuitively, MesaNet tries to remember all $\boldsymbol{k}, \boldsymbol{v}$, which is usually good but can lead to "blurring." DeltaNet's principle is to "remove the old to welcome the new"; by removing the old, it can achieve long-term, precise memory of specific content.

Overall, MesaNet is an aesthetically pleasing model, but the analytic solution also brings complexity and limits flexibility, leaving much room for exploration. For readers wanting to learn more about sequence modeling based on linear regression, TTR provides a detailed discussion of sequence models under various linear regression objectives.

The Road is Still Ahead

This article has briefly summarized the development of Linear Attention and introduced some of the mathematical principles of these models. Linear Attention started by imitating Softmax Attention but has gradually developed its own characteristics. It has become a highly competitive sequence modeling solution and has even provided new ideas for the development of Softmax Attention—a process full of interest and inspiration.