A Quick Derivation of Entropy-Invariant Softmax

By 苏剑林 | April 11, 2022

In the article "From Entropy Invariance to the Scale Operation in Attention", we derived a version of the attention mechanism with entropy-invariant properties:

\begin{equation}Attention(Q,K,V) = softmax\left(\frac{\kappa \log n}{d}QK^{\top}\right)V\label{eq:a}\end{equation}

It can be observed that this is mainly achieved by introducing a length-related scaling factor $\log n$ into the Softmax. The original derivation was quite cumbersome and relied on several assumptions, which were not conducive to an intuitive understanding. This article provides a relatively concise and rapid derivation as a supplement.

Derivation Process

We can set aside the background of the attention mechanism and directly assume $s_1, s_2, \dots, s_n \in \mathbb{R}$, defining:

$$p_i = \frac{e^{\lambda s_i}}{\sum_{i=1}^n e^{\lambda s_i}}$$

Obviously, this is the result of applying Softmax to $s_1, s_2, \dots, s_n$ after multiplying them by a scaling factor $\lambda$. Now let us calculate its entropy:

\begin{equation}\begin{aligned}H =&\, -\sum_{i=1}^n p_i \log p_i = \log\sum_{i=1}^n e^{\lambda s_i} - \lambda\sum_{i=1}^n p_i s_i \\ =&\, \log n + \log\frac{1}{n}\sum_{i=1}^n e^{\lambda s_i} - \lambda\sum_{i=1}^n p_i s_i \end{aligned}\end{equation}

The term inside the first $\log$ is "exponentiate then average." we use "average then exponentiate" (mean-field) to approximate it:

\begin{equation} \log\frac{1}{n}\sum_{i=1}^n e^{\lambda s_i}\approx \log\exp\left(\frac{1}{n}\sum_{i=1}^n \lambda s_i\right) = \lambda \bar{s} \end{equation}

Furthermore, we know that Softmax tends to focus on the $\max$ element (refer to "Discussion on Function Smoothing: Differentiable Approximation of Non-differentiable Functions"), so we have the approximation:

\begin{equation}\lambda\sum_{i=1}^n p_i s_i \approx \lambda s_{\max}\end{equation}

Therefore:

\begin{equation}H\approx \log n - \lambda(s_{\max} - \bar{s})\end{equation}

The so-called entropy invariance is the desire to eliminate the influence of the length $n$ as much as possible. Thus, according to the above equation, we need $\lambda \propto \log n$. If we place this back into the context of the attention mechanism, the form of $s$ is such that $\langle \boldsymbol{q}, \boldsymbol{k}\rangle \propto d$ (where $d$ is the vector dimension), so we need $\lambda \propto \frac{1}{d}$. Combining these, we get:

\begin{equation}\lambda\propto \frac{\log n}{d}\end{equation}

This is the result shown in Equation $\eqref{eq:a}$ at the beginning of the article.

Article Summary

A simple and clear derivation has been formulated for the previously proposed "Entropy-Invariant Softmax."