By 苏剑林 | January 20, 2026
From data whitening preprocessing in the era of machine learning to the various normalization methods like BatchNorm, InstanceNorm, LayerNorm, and RMSNorm in the deep learning era, they essentially all reflect our preference for "Isotropy." Why do we lean towards isotropic features? What are its actual benefits? Many answers can be found for this question, such as scale alignment, redundancy reduction, decorrelation, etc., but most remain at a superficial level of intuition.
Recently, while reading the paper "The Affine Divergence: Aligning Activation Updates Beyond Normalisation", the author gained a new understanding of this issue from the perspective of optimization. I personally believe it is relatively close to the essence, so I am writing it down to share and discuss with everyone.
We start from the simplest linear layer:
$$\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W}$$
where $\boldsymbol{X} \in \mathbb{R}^{b \times d_{in}}$ is the input of the current layer, $\boldsymbol{W} \in \mathbb{R}^{d_{in} \times d_{out}}$ is the weight, and $\boldsymbol{Y} \in \mathbb{R}^{b \times d_{out}}$ is the output. If we denote the loss function as $\mathcal{L}(\boldsymbol{Y}) = \mathcal{L}(\boldsymbol{X}\boldsymbol{W})$, then we have:
$$\frac{\partial \mathcal{L}}{\partial \boldsymbol{W}} = \boldsymbol{X}^\top \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}}$$
Taking gradient descent as an example, the update rule is:
$$\boldsymbol{W} \leftarrow \boldsymbol{W} - \eta \frac{\partial \mathcal{L}}{\partial \boldsymbol{W}} = \boldsymbol{W} - \eta \boldsymbol{X}^\top \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}}$$
The basic principle of gradient descent is the widely known "the negative gradient direction is the direction of fastest descent for the loss," but this conclusion has a premise. The most critical premise is that index metric chosen is the Euclidean norm. If we change the norm, the direction of steepest descent also changes. We have already discussed this in articles such as "Muon Sequel: Why do we choose to try Muon?" and "Steepest Descent on Manifolds: 1. SGD + Hypersphere".
This article mainly focuses on another premise that is not so easily noticed: perspective, or stand-point.
Assuming we agree that "the negative gradient direction is the direction of fastest descent of the loss," the question is: "Whose" gradient? Some readers might say it's naturally the gradient of the parameters; this is indeed the standard answer, but not necessarily the optimal answer. Parameters are essentially a byproduct of the model. What we actually care about is not what the parameters are, but whether the model's inputs and outputs constitute the desired mapping.
Therefore, changes in input and output features are the objects we care more about. If we start from the perspective of features, the conclusion will be different. Specifically, as the parameters change from $\boldsymbol{W}$ to $\boldsymbol{W} - \eta \boldsymbol{X}^\top \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}}$, the variation of the output features $\boldsymbol{Y}$ is:
$$\Delta \boldsymbol{Y} = \boldsymbol{X}(\boldsymbol{W} - \eta \boldsymbol{X}^\top \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}}) - \boldsymbol{X}\boldsymbol{W} = -\eta \boldsymbol{X}\boldsymbol{X}^\top \frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}}$$
According to "the negative gradient direction is the direction of fastest descent for the loss," then from the standpoint of $\boldsymbol{Y}$, if we want the variation of $\boldsymbol{Y}$ to make the loss descend the fastest, it should satisfy $\Delta \boldsymbol{Y} \propto -\frac{\partial \mathcal{L}}{\partial \boldsymbol{Y}}$. However, there is now an extra Gram matrix $\boldsymbol{X}\boldsymbol{X}^\top$, which means it is not moving along the direction of steepest descent.
A naive thought is: it would be great if $\boldsymbol{X}\boldsymbol{X}^\top$ was simply the identity matrix (or its multiple). We know $\boldsymbol{X} \in \mathbb{R}^{b \times d_{in}}$; if $b \le d_{in}$, then $\boldsymbol{X}\boldsymbol{X}^\top = \boldsymbol{I}$ would mean these $b$ vectors constitute an orthonormal basis. But in practice, we usually have $b > d_{in}$, so $\boldsymbol{X}\boldsymbol{X}^\top = \boldsymbol{I}$ cannot strictly hold mathematically.
In this case, we can only hope that these $b$ vectors are distributed as uniformly as possible on the unit hypersphere, so that $\boldsymbol{X}\boldsymbol{X}^\top \approx \boldsymbol{I}$ holds approximately. This is the rationale for Isotropy. That is to say:
If input features can satisfy isotropy, then the steepest descent on parameters can simultaneously be approximately equal to the steepest descent on features, achieving "multiple benefits at once" and significantly improving the learning efficiency of the model.
We can also prove that if a random variable follows a $d_{in}$-dimensional standard normal distribution, then while it satisfies isotropy, its magnitude will also be highly concentrated around $\sqrt{d_{in}}$, i.e., approximately satisfying being on a hypersphere with radius $\sqrt{d_{in}}$. Conversely, if the input $\boldsymbol{X}$ can be standardized to a zero-mean, unit-covariance matrix, we consider $\boldsymbol{X}\boldsymbol{X}^\top \approx \sqrt{d_{in}}\boldsymbol{I}$ to hold approximately, and this operation is whitening.
In addition to whitening data beforehand, we can also introduce some normalization operations in the middle of the model to make intermediate features approximately satisfy the desired properties. More specifically, we try to find a $d_{in} \times d_{in}$ transformation matrix $\boldsymbol{A}$ such that the transformed features $\boldsymbol{X}\boldsymbol{A}$ satisfy $(\boldsymbol{X}\boldsymbol{A})(\boldsymbol{X}\boldsymbol{A})^\top \approx \boldsymbol{I}$ as much as possible, which is:
$$\min_{\boldsymbol{A}} \|\boldsymbol{X}\boldsymbol{A}\boldsymbol{A}^\top \boldsymbol{X}^\top - \boldsymbol{I}\|_F^2$$
The solution to this problem can be expressed using the pseudoinverse:
\begin{equation}\boldsymbol{A}\boldsymbol{A}^{\top} = \boldsymbol{X}^{\dagger}(\boldsymbol{X}^{\top})^{\dagger} = (\boldsymbol{X}^{\top}\boldsymbol{X})^{-1}\end{equation}Assuming here that $\boldsymbol{X}^{\top}\boldsymbol{X}$ is invertible. According to the above equation, we obtain a feasible solution as $\boldsymbol{A} = (\boldsymbol{X}^{\top}\boldsymbol{X})^{-1/2}$, then the corresponding transformation is:
\begin{equation}\boldsymbol{X}(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1/2}\end{equation}This is exactly whitening without centering. Considering that calculating $(\boldsymbol{X}^{\top}\boldsymbol{X})^{-1/2}$ is expensive, if we use a diagonal approximation, it corresponds to standardizing each dimension individually, which, depending on the granularity, corresponds to operations like BatchNorm and InstanceNorm. If we care more about the "hypersphere," we can consider standardizing the magnitude of each sample individually, which corresponds to operations like LayerNorm and RMSNorm.
Interestingly, the conclusion that "if input features satisfy isotropy, then steepest descent on parameters is approximately equal to steepest descent on features" applies not only to SGD but also to Muon. Consider Muon without momentum, the update rule is:
\begin{equation}\newcommand{\msign}{\mathop{\text{msign}}} \boldsymbol{W}\quad\leftarrow\quad \boldsymbol{W} - \eta\msign\left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{W}}\right) = \boldsymbol{W} - \eta \msign\left(\boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right)\end{equation}According to $\msign(\boldsymbol{M}) = \boldsymbol{M}(\boldsymbol{M}^{\top}\boldsymbol{M})^{-1/2}$, we have:
\begin{equation}\Delta \boldsymbol{Y} = - \eta \boldsymbol{X} \msign\left(\boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right) = - \eta \boldsymbol{X} \boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\vphantom{\bigg\|}^{\top}\boldsymbol{X}\boldsymbol{X}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right)^{-1/2}\end{equation}From this, it can be seen that if $\boldsymbol{X}\boldsymbol{X}^{\top} \approx \boldsymbol{I}$, then:
\begin{equation}\Delta \boldsymbol{Y} \approx - \eta \frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}} \left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\vphantom{\bigg\|}^{\top}\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right)^{-1/2} = -\eta\msign\left(\frac{\partial \mathcal{L}}{\partial\boldsymbol{Y}}\right)\end{equation}We know that Muon is the steepest descent under the spectral norm. The above equation implies that if input features satisfy isotropy, then the spectral norm steepest descent on parameters is simultaneously approximately equal to the spectral norm steepest descent on features—very elegant! However, for other optimizers like SignSGD, similar properties cannot be reproduced; this difference may be one of the reasons behind Muon's superiority.
In this article, we discussed a question: when is steepest descent at the parameter level exactly steepest descent at the feature level? The answer is precisely "Isotropy" as mentioned in the title. From this, we derive an explanation of why we favor isotropy—it can synchronize steepest descent at two levels, improving training efficiency.