Finding a Substitute for Normalization via Gradient Approximation

By 苏剑林 | April 02, 2025

I wonder if everyone noticed the paper 《Transformers without Normalization》 that came out recently? This paper attempts to replace the Normalization layers in Transformer models with an element-wise operation called DyT, aiming to improve speed while maintaining performance. The theme of fundamental architecture itself carries a certain level of attraction, and with the names of big names like Kaiming He and Yann LeCun attached, the paper caused quite a stir when it was released, receiving both praise and criticism.

Coincidentally, a new paper last week titled 《The Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions》 interpreted DyT from the perspective of gradient analysis and differential equations, and proposed new alternatives. Personally, I feel this perspective is very fundamental, so I have studied it and would like to share it here.

Write at the Front

DyT stands for Dynamic Tanh, which replaces the Normalization layer with the following operation:

\begin{equation}\mathop{\text{DyT}}(\boldsymbol{x}) = \boldsymbol{\gamma} \odot \tanh(\alpha \boldsymbol{x}) + \boldsymbol{\beta}\end{equation}

Where $\alpha, \boldsymbol{\beta}, \boldsymbol{\gamma}$ are all learnable parameters. Since $\boldsymbol{\beta}$ and $\boldsymbol{\gamma}$ are already part of the standard Normalization layer, the key here is using $\tanh(\alpha \boldsymbol{x})$ to replace the Normalize operation. $\tanh$ is an element-wise operation, which eliminates the calculation of the two statistics: mean and variance.

Regarding DyT, I previously expressed some views on Zhihu in the thread "How to evaluate Meta's new paper Transformers without Normalization?". Simply put, I am not very optimistic. The reason is that Normalization mindlessly stabilizes the model's forward propagation, leaving more degrees of freedom and possibilities for other aspects of the model (such as performance). Therefore, I do not believe that a universal operation simplified further than Normalization can achieve better results (No Free Lunch).

In fact, as far back as 2021 in "Brief Discussion on Initialization, Parameterization and Standardization of Transformers", we discussed the topic of removing Normalization. Related works include SkipInit, ReZero, and Fixup. At that time, I tried some schemes and found that even if they could match Normalization in some respects, they would still have deficiencies in others—for example, pre-training results were okay, but fine-tuning performance was poor—so I didn't pursue it further.

Therefore, I now view similar works only as limit explorations in the dimension of simplification, much like "nGPT: Normalized Transformer with Representation Learning on the Hypersphere", which adds Normalization almost everywhere it can be added. They all belong to limit explorations in a certain direction.

Gradient Calculation

Of course, just because I am not optimistic doesn't mean we can't learn and analyze. To find a substitute or approximation for Normalization, the most direct approach is to start with the gradient. After all, deep learning boils down to forward and backward propagation, and backward propagation is all about calculating gradients, which often plays a more fundamental role.

Next, let's only consider RMS Norm. Its key operation is:

\begin{equation}\boldsymbol{y} = \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert_{RMS}} = \sqrt{d}\times \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert}\label{eq:rms-norm}\end{equation}

Where $\boldsymbol{x}\in\mathbb{R}^d$, and

\begin{equation}\Vert\boldsymbol{x}\Vert_{RMS} = \frac{\Vert\boldsymbol{x}\Vert}{\sqrt{d}},\qquad \Vert\boldsymbol{x}\Vert = \sqrt{\boldsymbol{x}^2} = \sqrt{\sum_{i=1}^d x_i^2}\end{equation}

So, calculating the gradient of $\boldsymbol{x} / \Vert\boldsymbol{x}\Vert_{RMS}$ is equivalent to calculating the gradient of $\boldsymbol{x} / \Vert\boldsymbol{x}\Vert$, which we can compute in the following way:

\begin{equation}\frac{\boldsymbol{x}+\Delta\boldsymbol{x}}{\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert} = \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert} + \frac{\Delta\boldsymbol{x}}{\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert} \approx \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert} + \frac{\Delta\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert}\label{eq:exp-1}\end{equation}

The complicated part is expanding $\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert = \sqrt{(\boldsymbol{x}+\Delta\boldsymbol{x})^2}$:

\begin{equation}\begin{aligned} &\,\sqrt{(\boldsymbol{x}+\Delta\boldsymbol{x})^2} \\ \approx&\, \sqrt{\Vert\boldsymbol{x}\Vert^2+2\boldsymbol{x}\cdot\Delta\boldsymbol{x}} \\ =&\, \Vert\boldsymbol{x}\Vert\sqrt{1+2\boldsymbol{x}\cdot\Delta\boldsymbol{x}/\Vert\boldsymbol{x}\Vert^2} \\ =&\, \Vert\boldsymbol{x}\Vert (1+\boldsymbol{x}\cdot\Delta\boldsymbol{x}/\Vert\boldsymbol{x}\Vert^2) \end{aligned} \quad \Rightarrow \quad \begin{aligned} \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert} \approx&\, \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert}(1-\boldsymbol{x}\cdot\Delta\boldsymbol{x}/\Vert\boldsymbol{x}\Vert^2) \end{aligned}\end{equation}

Substituting into Equation $\eqref{eq:exp-1}$:

\begin{equation}\frac{\boldsymbol{x}+\Delta\boldsymbol{x}}{\Vert\boldsymbol{x}+\Delta\boldsymbol{x}\Vert} - \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert} \approx \frac{\Delta\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert} - \frac{(\boldsymbol{x}\cdot\Delta\boldsymbol{x})\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert^3}\quad\Rightarrow\quad\nabla_{\boldsymbol{x}} \frac{\boldsymbol{x}}{\Vert\boldsymbol{x}\Vert} = \frac{\boldsymbol{I}}{\Vert\boldsymbol{x}\Vert} - \frac{\boldsymbol{x}\boldsymbol{x}^{\top}}{\Vert\boldsymbol{x}\Vert^3}\end{equation}

Finally, substituting back into Equation $\eqref{eq:rms-norm}$ yields:

\begin{equation}\nabla_{\boldsymbol{x}} \boldsymbol{y} = \sqrt{d}\left(\frac{\boldsymbol{I}}{\Vert\boldsymbol{x}\Vert} - \frac{\boldsymbol{x}\boldsymbol{x}^{\top}}{\Vert\boldsymbol{x}\Vert^3}\right) = \frac{1}{\Vert\boldsymbol{x}\Vert_{RMS}}\left(\boldsymbol{I} - \frac{\boldsymbol{y}\boldsymbol{y}^{\top}}{d}\right)\label{eq:rms-norm-grad}\end{equation}

DyT Appears!

Note that both $\boldsymbol{x}$ and $\boldsymbol{y}$ are vectors, so $\nabla_{\boldsymbol{x}} \boldsymbol{y}$ is a matrix (the Jacobian matrix). Now we search for an element-wise approximation for RMS Norm, meaning each component is operated on independently:

\begin{equation}f(\boldsymbol{x}) = [f(x_1),f(x_2),\cdots,f(x_d)]\end{equation}

This independence means its Jacobian matrix must be a diagonal matrix! We want this approximation to preserve the gradient of RMS Norm as much as possible, so we consider preserving only the diagonal part of Equation $\eqref{eq:rms-norm-grad}$:

\begin{equation}\frac{dy_i}{dx_i} = \frac{1}{\Vert\boldsymbol{x}\Vert_{RMS}}\left(1 - \frac{y_i^2}{d}\right)\label{eq:ode-1}\end{equation}

If we further assume that $\rho = \Vert\boldsymbol{x}\Vert_{RMS}$ is a constant, then we can directly solve the above differential equation to obtain:

\begin{equation}y_i = \sqrt{d}\tanh\left(\frac{x_i}{\rho\sqrt{d}}\right)\end{equation}

In this way, we obtain the "T" in DyT ($\tanh$), where the initial condition $y_i(0)=0$ was selected during the solution process.

DyT essentially absorbs the previous $\sqrt{d}$ into the $\boldsymbol{\gamma}$ parameter and treats the term $\frac{1}{\rho\sqrt{d}}$ inside the parentheses as the training parameter $\alpha$, which alleviates the limitation caused by the assumption that "$\rho = \Vert\boldsymbol{x}\Vert_{RMS}$ is a constant". However, from my perspective, explicitly retaining $\sqrt{d}$ might be more valuable, as long as the $\frac{1}{\rho}$ part is treated as a trainable parameter.

DyISRU

I wonder if you noticed that for RMS Norm, we consistently have $y_i = x_i / \Vert\boldsymbol{x}\Vert_{RMS}$, so we can replace $\Vert\boldsymbol{x}\Vert_{RMS}$ in Equation $\eqref{eq:ode-1}$ with $x_i/y_i$, thus obtaining:

\begin{equation}\frac{dy_i}{dx_i} = \frac{y_i}{x_i}\left(1 - \frac{y_i^2}{d}\right)\label{eq:ode-2}\end{equation}

This is an equation containing only $x_i$ and $y_i$, eliminating the need for an approximate treatment of $\Vert\boldsymbol{x}\Vert_{RMS}$. Solving this equation gives:

\begin{equation}y_i = \frac{\sqrt{d}x_i}{\sqrt{x_i^2 + C}}\end{equation}

Where $C$ is an arbitrary constant. This form is known as ISRU (Inverse Square Root Unit, which we previously called SoftSign), originating from the paper 《Improving Deep Learning by Inverse Square Root Linear Units (ISRLUs)》. If $C$ is treated as a trainable parameter, it can be called DyISRU (Dynamic ISRU) by analogy with DyT.

Looking at the path from gradient $\eqref{eq:rms-norm-grad}$ to equation $\eqref{eq:ode-1}$ and then to $\eqref{eq:ode-2}$, DyISRU is the best result we can achieve using an element-wise function, because no additional approximations were made besides the diagonal assumption. Formally, DyISRU is actually more intuitive than DyT because $\Vert\boldsymbol{x}\Vert_{RMS}^2$ is $\mathbb{E}[x_i^2]$. Since we are seeking an element-wise operation, we are forced to replace $\mathbb{E}[x_i^2]$ with $x_i^2$. Adding $C$ and multiplying by $\sqrt{d}$ are smoothing operations:

\begin{equation}\frac{x_i}{\sqrt{\color{red}{\frac{1}{d}\sum\limits_{i=1}^d x_i^2}}}\quad\to\quad \frac{x_i}{\sqrt{\color{green}{x_i^2}}}\quad\to\quad \frac{\color{orange}{\sqrt{d}} x_i}{\sqrt{\color{green}{x_i^2} + \color{orange}{C}}}\end{equation}

Related Work

Both $\tanh$ and ISRU can be viewed as smooth approximations of the sign function. Based on them, we can construct smooth approximations of the $\mathop{\text{clip}}$ operation, for example:

\begin{equation}\mathop{\text{clip}}(x, -t, t) = \left\{ \begin{aligned}t,&\,\,\, x > t \\ x,&\,\,\, x\in[-t,t] \\ -t,&\,\,\, x < -t\end{aligned} \right.\quad\approx\quad t\tanh\left(\frac{x}{t}\right)\triangleq \mathop{\text{softcap}}(x, t)\end{equation}

From this, we can also understand DyT as introducing a (smooth) $\mathop{\text{clip}}$ operation to prevent the explosion of forward propagation, thereby stabilizing the model.

$\mathop{\text{softcap}}$ was proposed by Google in Gemma 2. Its purpose at the time was to be applied to the Attention Logits matrix before the Softmax to prevent excessively large Logits values. However, in our actual tests, we found that although the Logits after $\mathop{\text{softcap}}$ do not explode, the Logits before $\mathop{\text{softcap}}$ still face the risk of explosion. Therefore, using $\mathop{\text{softcap}}$ to prevent Logits explosion merely shifts the problem to another source; it treats the symptoms but not the root cause.

It's unclear whether Google later realized this issue as well, but in their latest Gemma 3, they chose to remove $\mathop{\text{softcap}}$ and use QK-norm instead. Our own experiments also show that QK-norm can better suppress the growth of Attention Logits. This change and conclusion actually indirectly convey a pessimistic signal: operations like DyT's $\mathop{\text{softcap}}$ are difficult to use as a full replacement for Normalization in practice.

Summary

This article analyzes what kind of element-wise activation functions can (to some extent) replace Normalization layers from the perspective of gradient approximation. From this, we can derive DyT as well as new results.