By 苏剑林 | December 21, 2021
The most widely used attention mechanism in current Transformer architectures is the "Scaled Dot-Product Attention." The term "Scaled" refers to the fact that after the transpose multiplication of $Q$ and $K$, the result is divided by $\sqrt{d}$ before applying the Softmax (without loss of generality, we assume $Q,K,V\in\mathbb{R}^{n\times d}$):
\begin{equation}Attention(Q,K,V) = softmax\left(\frac{QK^{\top}}{\sqrt{d}}\right)V\label{eq:std}\end{equation}
In "Brief Discussion on Initialization, Parameterization, and Standardization of Transformer," we previously explained the reasoning behind dividing by $\sqrt{d}$. In this article, the author will understand this scaling operation from the perspective of "entropy invariance" and derive a new scaling factor. Experiments in MLM (Masked Language Model) show that the new scaling factor possesses better length extrapolation performance.
Entropy Invariance
We rewrite the general Scaled Dot-Product Attention as:
\begin{equation}\boldsymbol{o}_i = \sum_{j=1}^n a_{i,j}\boldsymbol{v}_j,\quad a_{i,j}=\frac{e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}}{\sum\limits_{j=1}^n e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}}\end{equation}
Where $\lambda$ is the scaling factor, which is independent of $\boldsymbol{q}_i$ and $\boldsymbol{k}_j$, but can in principle be related to parameters like sequence length $n$ and dimension $d$. Currently, the mainstream choice is $\lambda=1/\sqrt{d}$.
This article proposes a viewpoint: To make the model generalize better to unknown lengths, the design of the Attention mechanism should make $a_{i,j}$ satisfy entropy invariance as much as possible.
How should we understand this? First, generalizing to unknown lengths means that the model performs well even when the sequence length at inference time is inconsistent with that during training—for example, training with $n=64$ and extrapolating to $n=128, 256$ for testing. We know that models using relative position encodings like RoPE have inherently good length extrapolation, but we can still enhance this capability through better design, and entropy invariance is one such design.
Specifically, $a_{i,j}$ can be viewed as a conditional distribution where $i$ is the condition and $j$ is the random variable. Its entropy is:
\begin{equation}\mathcal{H}_i = -\sum_{j=1}^n a_{i,j}\log a_{i,j}\end{equation}
Entropy invariance means that $\mathcal{H}_i$ should be insensitive to the sequence length $n$. More specifically, if we add a few more tokens on top of existing ones, the newly calculated $a_{i,j}$ values will naturally change, but we hope that $\mathcal{H}_i$ does not change significantly.
Why do we want entropy to remain invariant? We know that entropy is a measure of uncertainty (refer to "Can't Afford Entropy: From Entropy and the Principle of Maximum Entropy to Maximum Entropy Models (I)"). From another perspective, we can view uncertainty as the "focus degree" of attention: if the entropy is 0, then the attention is focused on a single token; if the entropy is $\log n$, the attention is uniformly distributed across all tokens. By requiring entropy to be invariant, we hope that after introducing new tokens, the existing tokens can still focus on the original tokens in the same way, rather than having the new tokens "dilute" the original attention excessively, which would cause the summation result to change significantly.
A New Factor
Based on entropy invariance and several reasonable assumptions, we can derive a new scaling factor, resulting in a new version of Scaled Dot-Product Attention:
\begin{equation}Attention(Q,K,V) = softmax\left(\frac{\kappa \log n}{d}QK^{\top}\right)V\label{eq:ei}\end{equation}
Here $\kappa$ is a hyperparameter independent of $n$ and $d$. The detailed derivation will be introduced in the next section. For convenience, we will refer to the conventional Scaled Dot-Product Attention described in Eq $\eqref{eq:std}$ as "Attention-O" (Original), and the variant described in Eq $\eqref{eq:ei}$ and the following Eq $\eqref{eq:ei2}$ as "Attention-E" (Entropy Invariance).
Some readers might be dissatisfied with the introduction of a new parameter. In fact, this is easy to resolve. We know that the current mainstream pre-training length is 512, so we can assume that mainstream parameters are tuned specifically for $n=512$. Therefore, when $n=512$, the above formula should degenerate into standard Scaled Dot-Product Attention, i.e., $\frac{\kappa \log 512}{d}=\frac{1}{\sqrt{d}}$, which yields $\kappa = \frac{\sqrt{d}}{\log 512}$. Substituting this back and simplifying, we get:
\begin{equation}Attention(Q,K,V) = softmax\left(\frac{\log_{512} n}{\sqrt{d}}QK^{\top}\right)V\label{eq:ei2}\end{equation}
This removes the hyperparameter $\lambda$. The following experiments also use this version.
To verify whether this modification truly improves the extrapolation effect of the Transformer as expected, I trained two small versions of RoFormer using Attention-O and Attention-E respectively. The training task was MLM, the training sequence length was 64, and the MLM accuracy was compared across different validation set lengths. The results are as follows:
Length Extrapolation Experiment for Attention
|
$n=64$ |
$n=128$ |
$n=256$ |
$n=512$ |
$1024$ |
| Attention-O |
43.27 |
36.53 |
23.02 |
15.12 |
11.54 |
| Attention-E |
43.11 |
41.17 |
34.04 |
20.15 |
13.58 |
From the experimental results, it can be seen that when the test length matches the training length ($n=64$), the performance of Attention-O and Attention-E is very similar. However, when extrapolating to larger test lengths, a clear gap emerges. For example, at $n=256$, Attention-E's accuracy is more than 10 percentage points higher than Attention-O's, which is a significant difference.
Derivation Process
In this section, we introduce the derivation of Eq $\eqref{eq:ei}$. In fact, the derivation process and assumptions are almost identical to those in "Principle of Minimum Entropy (VI): How to Choose the Dimension of Word Embeddings?".
First, substituting the expression for $a_{i,j}$ gives us:
\begin{equation}\mathcal{H}_i = -\sum_{j=1}^n a_{i,j}\log a_{i,j}=\log \sum_{j=1}^n e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j} - \frac{\sum\limits_{j=1}^n e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}(\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j)}{\sum\limits_{j=1}^n e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}}\end{equation}
Note that we only need to make a semi-quantitative estimation to determine a suitable $\lambda$ to offset part of the length's influence; making entropy completely independent of length is impossible. Thus, we can make some assumptions, such as assuming $\boldsymbol{k}_j$ is a random variable, allowing us to write:
\begin{equation}\sum_{j=1}^n e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j} = n\times \frac{1}{n}\sum_{j=1}^n e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}\approx n\,\mathbb{E}_j[e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}]\end{equation}
Replacing all summations with similar approximations, we get:
\begin{equation}\mathcal{H}_i \approx \log n + \log \mathbb{E}_j[e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}] - \frac{\lambda\,\mathbb{E}_j[e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}(\boldsymbol{q}_i\cdot \boldsymbol{k}_j)]}{\mathbb{E}_j[e^{\lambda \boldsymbol{q}_i\cdot \boldsymbol{k}_j}]} \end{equation}
Notice that in general cases, $\boldsymbol{q}_i, \boldsymbol{k}_j$ are followed by a Dense layer after Layer Norm, and Dense layers act approximately as orthogonal transformations (refer to "Understanding Model Parameter Initialization Strategies from a Geometric Perspective"). Therefore, we approximately assume that $\boldsymbol{q}_i, \boldsymbol{k}_j$ are vectors with norm $\sqrt{d}$, such that $\boldsymbol{q}_i \cdot \boldsymbol{k}_j = d \cos(\boldsymbol{q}_i, \boldsymbol{k}_j)$. Furthermore, assuming $\boldsymbol{k}_j$ is uniformly distributed on a sphere of radius $\sqrt{d}$, the expectation over $\boldsymbol{k}_j$ can be transformed into an expectation over the angle between $\boldsymbol{q}_i$ and $\boldsymbol{k}_j$:
\begin{equation}\mathcal{H}_i \approx \log n + \log \mathbb{E}_{\theta}[e^{\lambda d \cos\theta}] - \frac{\lambda d\,\mathbb{E}_{\theta}[e^{\lambda d \cos\theta}\cos\theta]}{\mathbb{E}_{\theta}[e^{\lambda d \cos\theta}]} \end{equation}
Where the distribution follows the angle distribution between two random vectors in n-dimensional space. Next, following the "Approximate Estimation" in "Principle of Minimum Entropy (VI)", we can use Laplace's approximation to get:
\begin{equation}\mathcal{H}_i \approx \log n - 0.24\lambda d + \mathcal{O}(1) \end{equation}
Therefore, to offset the impact of sequence length $n$, we let $\log n - 0.24\lambda d = 0$, leading to $\lambda = \log n / (0.24 d)$. Since this is only an estimate, there is no need to keep the coefficient $0.24$; instead, it is better to introduce the hyperparameter $\kappa$ such that:
\begin{equation}\lambda = \frac{\kappa\log n}{d}\end{equation}
This corresponds to Eq $\eqref{eq:ei}$.
Related Results
While reading submissions for ACL 2022, I discovered a paper titled "Overcoming a Theoretical Limitation of Self-Attention" which provided a similar result (Eq 1 in Section 4.3 of the paper):
\begin{equation}Attention(Q,K,V) = softmax\left(\frac{\log n}{\sqrt{d}}QK^{\top}\right)V\end{equation}
However, that paper did not provide deep theoretical analysis, but instead built two special cases to test Attention performance. They found that multiplying the scale factor by $\log n$ helps with length generalization, so they proposed it.
However, it is clear that if the default convention of using the natural logarithm for $\log$ is followed, the above formula becomes quite unreasonable. When $n$ is large, the scaling factor becomes too large, leading to severe gradient vanishing. It seems that paper only performed experiments on machine translation with sequences of length around $n=20$, so the gradient vanishing problem did not surface.
Conclusion
This article derives the scaling operation in Scaled Dot-Product Attention from the perspective of entropy invariance, arriving at a new scaling factor. Preliminary experimental results show that the new scaling factor does not change existing training performance and provides better results for length extrapolation.