Entropy Normalization of Probability Distributions

By 苏剑林 | December 24, 2021

In the previous article "From Entropy Invariance to Attention Scaling", we derived a new Attention Scale from the perspective of entropy invariance. Experiments showed that this new scale, which maintains entropy invariance, indeed results in better extrapolation performance for Attention. At that point, a natural question occurred to me:

Is there an operation similar to L2 Normalization that can directly transform a probability distribution such that it maintains its original distribution characteristics while setting its entropy to a specified value?

I searched around but found no similar research. So, I attempted to derive it myself and obtained a basically satisfactory result, which I refer to as "Entropy Normalization." I am recording it here for readers who may find it useful.

Power Transformation

First, assume an $n$-dimensional distribution $(p_1, p_2, \cdots, p_n)$. Its entropy is defined as:

\begin{equation}\mathcal{H} = -\sum_i p_i \log p_i = \mathbb{E}[-\log p_i]\end{equation}

Since $p_i \in [0,1]$, we have $-p_i \log p_i \geq 0$, and thus $\mathcal{H} \geq 0$. When one $p_i$ is 1 and the others are 0 (one-hot), it reaches its minimum value of 0. Furthermore, it can be proven that when all $p_i$ equal $1/n$, $\mathcal{H}$ reaches its maximum value of $\log n$. Therefore, the range of $\mathcal{H}$ is $[0, \log n]$.

Thus, we first need to find a transformation for the distribution that can preserve the primary information of the distribution and has the capability to vary the entropy from $0$ to $\log n$. Here, we choose the power transformation:

\begin{equation}p_i\quad\to\quad \tilde{p}_i = \frac{p_i^{\gamma}}{\sum\limits_i p_i^{\gamma}}\end{equation}

One reason for choosing the power transformation is that it preserves the monotonicity of the distribution; that is, if $p_i > p_j$, then $\tilde{p}_i > \tilde{p}_j$. I believe this is one of the important properties a distribution needs to maintain. Furthermore, when all $p_i$ are non-zero and not equal to each other, the power transformation indeed has the capability to vary the entropy from $0 \sim \log n$. Without loss of generality, we assume $1 > p_1 > p_2 > \cdots > p_n > 0$. Clearly, when $\gamma = 0$, $\tilde{p}_i=1/n$, and the entropy is the maximum value $\log n$. When $\gamma \to \infty$:

\begin{equation}\tilde{p}_1 = \lim_{\gamma\to\infty}\frac{p_1^{\gamma}}{\sum\limits_i p_i^{\gamma}} = \lim_{\gamma\to\infty}\frac{1}{1 + \sum\limits_{i > 1} (p_i/p_1)^{\gamma}}=1\end{equation}

This results in a one-hot distribution $(1, 0, \cdots, 0)$, with a corresponding entropy of 0. It can be further proven by taking the derivative that the entropy is monotonically decreasing with respect to $\gamma$. Therefore, as $\gamma$ increases from $0$ to $\infty$, the entropy decreases from $\log n$ to $0$.

Iterative Solution

After determining that the power transformation is a viable method, we need to enter the solver phase. That is, for any given specified $\mathcal{H}^* \in (0, \log n)$, we need to find the correct $\gamma$ such that the entropy of the resulting distribution is the specified value $\mathcal{H}^*$.

First, we write out:

\begin{equation}\mathcal{H}_{\gamma} = -\sum_i\frac{p_i^{\gamma}}{\sum\limits_i p_i^{\gamma}}\log \frac{p_i^{\gamma}}{\sum\limits_i p_i^{\gamma}}=\log\sum_i p_i^{\gamma} - \frac{\gamma\sum\limits_i p_i^{\gamma}\log p_i}{\sum\limits_i p_i^{\gamma}}\end{equation}

The complexity of the result on the far right leads us to believe that an analytical solution likely does not exist, so we must seek an iterative solution algorithm.

We perform an expansion at $\gamma=1$ (mainly utilizing $p_i^{\gamma} \approx p_i + (\gamma-1)p_i\log p_i$):

\begin{equation} \begin{aligned} \mathcal{H}_{\gamma} \approx &\, -\sum_i p_i\log p_i + \left(\left(\sum_i p_i\log p_i\right)^2-\sum_i p_i\left(\log p_i\right)^2\right)(\gamma - 1)\\ =&\, \mathcal{H}_1 + \left(\mathcal{H}_1^2-\mathbb{E}\left[\left(\log p_i\right)^2\right]\right)(\gamma - 1) \end{aligned} \end{equation}

Then:

\begin{equation}\gamma \approx 1 + \frac{\mathcal{H}_{\gamma}-\mathcal{H}_1}{\mathcal{H}_1^2-\mathbb{E}\left[\left(\log p_i\right)^2\right]}\end{equation}

Based on this result, starting from $\gamma=1$ and repeatedly applying the above formula for iteration, we can solve for the final distribution:

\begin{equation} \mathcal{H}\leftarrow -\sum_i p_i \log p_i,\quad \gamma \leftarrow 1 + \frac{\mathcal{H}^*-\mathcal{H}}{\mathcal{H}^2-\mathbb{E}\left[\left(\log p_i\right)^2\right]},\quad p_i \leftarrow \frac{p_i^{\gamma}}{\sum\limits_i p_i^{\gamma}} \end{equation}

This is effectively Newton's method for solving non-linear equations. In experiments, it was found that 3 to 4 iterations yield good convergence. If the actual goal is simply to roughly control the range of entropy, 1 to 2 iterations are sufficient.

Numpy reference code:

p = np.random.random(100)
p /= p.sum() # Simulate a distribution
gamma = 1
H_f = np.log(30) # Desired entropy

for i in range(10):
    H = -(p * np.log(p)).sum()
    gamma = 1 + (H_f - H) / (H**2 - (p * np.log(p)**2).sum())
    p = p**gamma
    p /= p.sum()

Application Ideas

This article mainly introduces the concept of "Entropy Normalization" because I found it interesting and wanted to derive it. However, regarding specific good application cases, I haven't fully refined them yet.

Lower entropy means the probability is concentrated in a few positions; in other words, the probabilities at other positions are closer to zero. Therefore, to some extent, entropy is a measure of the sparsity of a probability distribution. If we wish to obtain sparser prediction results, we can control this via entropy normalization. On the other hand, a sparser distribution also means the model is more likely to suffer from vanishing gradients. Thus, conversely, we could use entropy normalization to ensure the entropy does not become too small, thereby alleviating the vanishing gradient problem.

Speaking of sparsity, work like Sparsemax and Sparse Softmax (which I conceptualized) comes to mind. Sparsemax achieves sparsity by treating entropy as a penalty term, while Sparse Softmax introduces sparsity through direct truncation. Both exhibit better interpretability or effects in certain scenarios. Would sparsity brought about directly by entropy normalization be effective? This might be a question worth exploring.

Additionally, in the random sampling of auto-regressive models, we often use top-$k$ or top-$p$ truncation. These truncations are essentially reducing the entropy of the distribution. Accordingly, we could use entropy normalization to ensure the entropy of the distribution at each sampling step remains consistent, potentially replacing top-$k$ or top-$p$ sampling. This is another possible application.

The main issue with using entropy normalization is that there is no clear standard for which value to "normalize to." Currently, I don't have a better idea beyond observing existing experimental results to tune the parameter, but that is not an ideal answer.

Summary

This article introduced the concept of Entropy Normalization, using a direct transformation to set the entropy of a distribution to a specified value, and brainstormed some potential applications.