Why does DeltaNet need L2 Normalize?

By 苏剑林 | Dec 23, 2025

In the article "A Brief History of Linear Attention: From Imitation and Innovation to Feedforward", we introduced DeltaNet, which brought the Delta Rule into linear attention, making it a powerful tool and forming the foundation for subsequent works like GDN and KDA. However, that article focused primarily on the overall idea of DeltaNet and did not delve into many technical details. In this post, we will discuss one of them: Why do DeltaNet and its subsequent works apply L2 Normalize to $\boldsymbol{Q}$ and $\boldsymbol{K}$?

Of course, explaining this operation directly from the perspective of eigenvalues is not difficult, but I personally feel that explanation is slightly lacking. A few days ago, I learned a new interpretation from the paper "Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics", which I found quite valuable and would like to share.

Basic Analysis

The recursive format of DeltaNet is

\begin{equation}\boldsymbol{S}_t = \boldsymbol{S}_{t-1} - \eta_t (\boldsymbol{S}_{t-1} \boldsymbol{k}_t - \boldsymbol{v}_t)\boldsymbol{k}_t^{\top} = \boldsymbol{S}_{t-1}(\boldsymbol{I} - \eta_t \boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \eta_t \boldsymbol{v}_t \boldsymbol{k}_t^{\top}\label{eq:delta}\end{equation}

From the perspective of $t$, this can be seen as using the SGD optimizer with a learning rate of $\eta_t$ to perform online optimization on the loss $\frac{1}{2}\Vert\boldsymbol{S}\boldsymbol{k} - \boldsymbol{v}\Vert^2$ (where the trainable parameter is $\boldsymbol{S}$). We know that optimizers are often sensitive to learning rates, especially non-adaptive optimizers like SGD. In DeltaNet, this manifests as additional requirements for the transition matrix $\boldsymbol{I} - \eta_t \boldsymbol{k}_t\boldsymbol{k}_t^{\top}$.

Specifically, since transition matrices at different time steps are multiplied together during recursion, the transition matrix must not have eigenvalues greater than 1 or less than -1 to avoid numerical explosion. For the matrix $\boldsymbol{I} - \eta_t \boldsymbol{k}_t\boldsymbol{k}_t^{\top}$, one of its eigenvalues is $1 - \eta_t\Vert\boldsymbol{k}_t\Vert^2$, and the rest are 1 (please prove this). From this, we derive the constraint

\begin{equation}-1 \leq 1 - \eta_t\Vert\boldsymbol{k}_t\Vert^2 \leq 1\label{eq:cond}\end{equation}

To satisfy this constraint, a common practice is to apply L2 Normalize to $\boldsymbol{k}_t$ and pass $\eta_t$ through a Sigmoid function, so that all eigenvalues fall within $(0, 1]$. This is the origin of applying L2 Normalize to $\boldsymbol{K}$. As for applying L2 Normalize to $\boldsymbol{Q}$, it is not strictly necessary but is often added for symmetry, similar to the case of Short Conv, where adding it to $\boldsymbol{K}$ is the most critical part [Ref].

Supplementary Statement

It's worth mentioning that for a long time, everyone was accustomed to keeping eigenvalues within $(0, 1]$, hence choosing Sigmoid for $\eta_t$. Later, "Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues" pointed out that negative eigenvalues can enhance the state-tracking capability of DeltaNet, proposing to change DeltaNet to

\begin{equation}\boldsymbol{S}_t = \boldsymbol{S}_{t-1}(\boldsymbol{I} - 2\eta_t \boldsymbol{k}_t\boldsymbol{k}_t^{\top}) + \eta_t \boldsymbol{v}_t \boldsymbol{k}_t^{\top}\end{equation}

They still applied L2 Normalize to $\boldsymbol{k}_t$ and Sigmoid to $\eta_t$, expanding the eigenvalue range of the transition matrix $\boldsymbol{I} - 2\eta_t \boldsymbol{k}_t\boldsymbol{k}_t^{\top}$ to $(-1, 1]$. However, state-tracking is a capability more oriented toward specific syntaxes (like code). Therefore, if we make this modification and only train and test on natural language, the observable change might not be significant.

Another detail to note is that when $\eta_t=1$, the transition matrix $\boldsymbol{I} - 2\boldsymbol{k}_t\boldsymbol{k}_t^\top$ is an orthogonal matrix. Theoretically, this is fine, but in practice, it fails because we typically use BF16 for efficiency. BF16 has lower precision, which may cause eigenvalues of $\boldsymbol{I} - 2\boldsymbol{k}_t\boldsymbol{k}_t^\top$ to occasionally be slightly less than -1. Over long-term multiplication, this still poses a risk of explosion, so $\eta_t$ should be controlled to not be too close to 1.

In fact, the explanation above is quite complete and not complex. My critique of it stems mainly from personal aesthetic: the way to implement condition \eqref{eq:cond} is not unique. For example, one could introduce a Squash operation similar to Capsule networks, as done in Longhorn. Thus, we cannot "naturally" derive L2 Normalize; we can only say it is one viable solution.

Continuous Perspective

Next, let's introduce the approach from the paper "Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics". I believe this is an elegant derivation path, though this depends on individual taste. It views equation \eqref{eq:delta} as the Euler discretization of the following differential equation over the interval $[t-\eta_t, t]$:

\begin{equation}\frac{d}{dt}\boldsymbol{S}_t = \boldsymbol{S}_t\underbrace{(-\boldsymbol{k}_t\boldsymbol{k}_t^{\top})}_{\boldsymbol{A}_t} + \underbrace{\boldsymbol{v}_t \boldsymbol{k}_t^{\top}}_{\boldsymbol{B}_t}\label{eq:ode}\end{equation}

It then points out that numerical explosions occur because the precision of the discretization format is insufficient. Therefore, it proposes constructing the recursion by solving the differential equation directly rather than using approximate discretization. Since $\boldsymbol{A}_t$ and $\boldsymbol{B}_t$ are constants within the interval $[t-\eta_t, t]$, solving the recursion from $t-\eta_t$ to $t$ is equivalent to solving a linear differential equation with constant coefficients. The general result is

\begin{equation}\boldsymbol{S}_t = \boldsymbol{S}_{t-\eta_t} e^{\eta_t \boldsymbol{A}_t} + \boldsymbol{B}_t \boldsymbol{A}_t^{-1}(e^{\eta_t \boldsymbol{A}_t} - \boldsymbol{I})\label{eq:S-t-eta}\end{equation}

Relabeling $\boldsymbol{S}_{t-\eta_t}$ back to the notation $\boldsymbol{S}_{t-1}$ and substituting the expressions for $\boldsymbol{A}_t, \boldsymbol{B}_t$, we simplify to obtain

\begin{equation}\boldsymbol{S}_t = \boldsymbol{S}_{t-1} \left(\boldsymbol{I} - \frac{1 - e^{-\eta_t\Vert\boldsymbol{k}_t\Vert^2}}{\Vert\boldsymbol{k}_t\Vert^2}\boldsymbol{k}_t\boldsymbol{k}_t^{\top}\right) + \frac{1 - e^{-\eta_t\Vert\boldsymbol{k}_t\Vert^2}}{\Vert\boldsymbol{k}_t\Vert^2}\boldsymbol{v}_t \boldsymbol{k}_t^{\top}\label{eq:ode-deltanet}\end{equation}

This is the final result we wish to derive. The original paper calls this "EFLA (Error-Free Linear Attention)". Essentially, it replaces $\eta_t$ with $\frac{1 - e^{-\eta_t\Vert\boldsymbol{k}_t\Vert^2}}{\Vert\boldsymbol{k}_t\Vert^2}$. The term $\Vert\boldsymbol{k}_t\Vert^2$ naturally appears in the denominator, and when multiplied by $\boldsymbol{k}_t\boldsymbol{k}_t^\top$, it exactly produces the L2 Normalization of $\boldsymbol{K}$.

Mathematical Details

In the previous section, we quickly introduced the result of EFLA, omitting many mathematical details. Here, we supplement the discussion. Due to space constraints, I can only briefly mention the key points of the derivation.

The core result of the previous section is equation \eqref{eq:S-t-eta}, which is the solution to the differential equation $d\boldsymbol{S}_t/dt = \boldsymbol{S}_t \boldsymbol{A} + \boldsymbol{B}$. To avoid confusion, I've omitted the subscripts of $\boldsymbol{A}, \boldsymbol{B}$ here since they are indeed constants within the integration interval. If $\boldsymbol{B}=\boldsymbol{0}$, the solution is simply $\boldsymbol{S}_t = \boldsymbol{S}_0 e^{t\boldsymbol{A}}$, where $e^{t\boldsymbol{A}}$ is the matrix exponential. When $\boldsymbol{B} \neq \boldsymbol{0}$, rewrite the equation as $d(\boldsymbol{S}_t + \boldsymbol{B}\boldsymbol{A}^{-1})/dt = (\boldsymbol{S}_t + \boldsymbol{B}\boldsymbol{A}^{-1})\boldsymbol{A}$. Using the solution for the $\boldsymbol{B}=\boldsymbol{0}$ case, we get:

\begin{equation}\boldsymbol{S}_t = (\boldsymbol{S}_0 + \boldsymbol{B}\boldsymbol{A}^{-1})e^{t\boldsymbol{A}} - \boldsymbol{B}\boldsymbol{A}^{-1} = \boldsymbol{S}_0 e^{t\boldsymbol{A}} + \boldsymbol{B}\boldsymbol{A}^{-1}(e^{t\boldsymbol{A}} - \boldsymbol{I})\end{equation}

Changing the starting point to $t-\eta_t$, the end point to $t$, and restoring the subscripts for $\boldsymbol{A}, \boldsymbol{B}$ yields equation \eqref{eq:S-t-eta}. Note that the inverse matrix $\boldsymbol{A}^{-1}$ appears in the last term, but $\boldsymbol{A}$ is not actually required to be invertible; it is understood as substituting $x = \boldsymbol{A}$ into the power series expansion of $(e^x-1)/x$. Now focusing again on equation \eqref{eq:S-t-eta}, for DeltaNet, $\boldsymbol{A}_t = -\boldsymbol{k}_t\boldsymbol{k}_t^\top$ is a rank-1 matrix, which allows for further simplification:

\begin{equation}f(\boldsymbol{x}\boldsymbol{y}^{\top}) = \sum_{n=0}^{\infty} a_n (\boldsymbol{x}\boldsymbol{y}^{\top})^n = a_0\boldsymbol{I} + \sum_{n=1}^{\infty} a_n (\boldsymbol{x}\boldsymbol{y}^{\top})^n = f(0)\boldsymbol{I} + \boldsymbol{x}\underbrace{\left(\sum_{n=1}^{\infty} a_n(\boldsymbol{y}^{\top}\boldsymbol{x})^{n-1}\right)}_{\frac{f(\boldsymbol{y}^{\top}\boldsymbol{x})-f(0)}{\boldsymbol{y}^{\top}\boldsymbol{x}}}\boldsymbol{y}^{\top}\end{equation}

Note that $\boldsymbol{y}^{\top}\boldsymbol{x}$ is a scalar, so the key to the simplification is converting a matrix function into a scalar function. From this, we obtain:

\begin{equation}e^{\eta_t \boldsymbol{A}_t} = \boldsymbol{I} - \frac{1 - e^{-\eta_t\Vert\boldsymbol{k}_t\Vert^2}}{\Vert\boldsymbol{k}_t\Vert^2}\boldsymbol{k}_t\boldsymbol{k}_t^{\top},\qquad \boldsymbol{B}_t \boldsymbol{A}_t^{-1}(e^{\eta_t \boldsymbol{A}_t} - \boldsymbol{I})=\frac{1 - e^{-\eta_t\Vert\boldsymbol{k}_t\Vert^2}}{\Vert\boldsymbol{k}_t\Vert^2}\boldsymbol{v}_t \boldsymbol{k}_t^{\top}\end{equation}

Personal Thoughts

With this, our introduction to EFLA concludes. The original paper includes experimental content showing that EFLA possesses certain advantages over the original DeltaNet. However, as seen from equation \eqref{eq:ode-deltanet}, EFLA still takes the form of DeltaNet, so one shouldn't expect "revolutionary" leaps. Why is EFLA generally slightly better? While DeltaNet discards the magnitude of $\boldsymbol{K}$ via L2 Normalize, the term $\boldsymbol{v}_t \boldsymbol{k}_t^\top$ in equation \eqref{eq:ode-deltanet} depends on $\Vert\boldsymbol{k}_t\Vert$, giving EFLA an extra degree of freedom and a higher theoretical upper bound.

Furthermore, the approach of using exact solutions of differential equations to construct recursions is not new. We mentioned it when introducing SSMs in "Revisiting SSM (II): Legacy Issues of HiPPO". The key result in equation \eqref{eq:S-t-eta} already appeared in HiPPO. EFLA essentially carries out the expanded calculations for the specific case of DeltaNet to obtain a simplified, usable result.

A more profound question is: What are the benefits of using differential equations as a starting point? It is easy to see that the eigenvalues of the transition matrix in equation \eqref{eq:ode-deltanet} are automatically in $(0, 1]$. In other words, the recursive form derived from solving differential equation \eqref{eq:ode} naturally has better stability. Since differential equations come with continuity constraints and the matrix $-\boldsymbol{k}_t\boldsymbol{k}_t^\top$ is negative semi-definite, according to relevant differential equation theories, the solution is stable.

A classic example in mathematical modeling is the Logistic equation $dx/dt = \alpha x - \beta x^2$. Its solution is simple—the Logistic function. However, the corresponding difference equation $x_{t+1} - x_t = \alpha x_t - \beta x_t^2$ can exhibit chaotic behavior (extreme sensitivity to initial values) under certain settings. Thus, starting from a differential equation can automatically circumvent some anomalous behaviors.

Summary

This article discussed the L2 Normalize in DeltaNet and introduced the idea of reparameterizing DeltaNet starting from differential equations. This can also be viewed as an interpretation of the L2 Normalize operation on $\boldsymbol{K}$ in DeltaNet.