By 苏剑林 | April 25, 2023
In recent weeks, I have been constantly thinking about the properties of the attention mechanism. During this process, I have gained a deeper understanding of attention and Softmax. In this article, I will briefly share two of those points:
1. Softmax attention can naturally resist certain noise perturbations;
2. Initialization problems can be intuitively understood from the perspective of information entropy.
The attention mechanism based on Softmax normalization can be written as:
\[o = \frac{\sum_{i=1}^n e^{s_i} v_i}{\sum_{i=1}^n e^{s_i}}\]One day, I suddenly thought of a question: What if independent and identically distributed (i.i.d.) noise is added to $s_i$? To this end, we consider:
\[\tilde{o} = \frac{\sum_{i=1}^n e^{s_i + \epsilon_i} v_i}{\sum_{i=1}^n e^{s_i + \epsilon_i}}\]where $\epsilon_i$ is i.i.d. noise. However, after a simple analysis, I found the conclusion to be "not much of anything"—the attention mechanism naturally resists this type of noise, i.e., $\tilde{o} \approx o$.
To understand this point, one only needs to realize:
\[\tilde{o} = \frac{\frac{1}{n} \sum_{i=1}^n e^{s_i + \epsilon_i} v_i}{\frac{1}{n} \sum_{i=1}^n e^{s_i + \epsilon_i}} = \frac{\mathbb{E}_i [e^{s_i + \epsilon_i} v_i]}{\mathbb{E}_i [e^{s_i + \epsilon_i}]} \approx \frac{\mathbb{E}_i [e^{s_i} v_i] \mathbb{E}[e^\epsilon]}{\mathbb{E}_i [e^{s_i}] \mathbb{E}[e^\epsilon]} = \frac{\mathbb{E}_i [e^{s_i} v_i]}{\mathbb{E}_i [e^{s_i}]} = o\]The approximate equality utilizes the fact that $\epsilon_i$ is independent of $s_i$ and $v_i$, so the expectation of the product equals the product of the expectations.
If we denote $p_i = e^{s_i} / \sum_{i=1}^n e^{s_i}$, then $p_i$ describes a discrete probability distribution, and we can calculate the information entropy:
\[H = -\sum_{i=1}^n p_i \log p_i \in [0, \log n]\]In "'Entropy' is Unaffordable: From Entropy and the Principle of Maximum Entropy to Maximum Entropy Models (I)", we discussed that entropy is a measure of uncertainty and also a measure of information volume. How can we understand the connection between the two? Entropy is essentially a measure of uniformity; the more uniform, the more uncertain. Therefore, entropy is a measure of uncertainty. The lower bound of entropy is 0, so uncertainty also means it is the maximum amount of information we can obtain from "uncertainty" to "complete certainty."
We know that if $s_i$ is initialized to be very large, then $p_i$ will approach a one-hot distribution, which leading to a failure in training due to gradient vanishing (refer to "A Brief Talk on the Initialization, Parameterization, and Standardization of Transformer"). I found that this can also be understood very intuitively from the perspective of information volume: model training itself is a process from uncertainty (random model) to certainty (trained model). The optimizer is responsible for "extracting" information from the random model, whereas the information volume of a one-hot distribution is 0. The optimizer has "no profit to gain" and might even have to "pay" into it, so naturally, it cannot be optimized well. Therefore, we should initialize the model to be as uniform as possible to ensure that the "extractable" information volume is maximized.
Of course, besides ensuring the upper bound of information volume is large enough, we must also ensure that the lower bound of information volume is small enough to ensure the "extractable" information volume is as large as possible. Previously, when introducing contrastive learning, some readers did not understand the significance of the temperature parameter. This can also be interpreted from information volume. Let:
\[p_i = \frac{e^{(\cos \theta_i) / \tau}}{\sum_{i=1}^n e^{(\cos \theta_i) / \tau}}\]If $\tau=1$, then the upper bound of information entropy is $\log n$, but the lower bound is approximately $\log n - 0.4745$ (refer to the comment section). The amount of information that can be obtained is too small, so we need to decrease $\tau$ to make the lower bound of information entropy approach 0, thereby increasing the amount of information that can be obtained.
I have just quickly written a simple blog post. It can be seen that the final conclusion is still— "I heard that Attention and Softmax are a better match~".