The Road to Probability Distributions: A Review of Softmax and Its Alternatives

By 苏剑林 | June 14, 2024

Whether in basic classification tasks or in the ubiquitous attention mechanisms of today, the construction of probability distributions is a critical step. Specifically, this involves converting an $n$-dimensional arbitrary vector into an $n$-element discrete probability distribution. As is well known, the standard answer to this problem is Softmax, which takes the form of exponential normalization. It is relatively simple and intuitive, while also possessing many elegant properties, making it the "standard equipment" in most scenarios. Nevertheless, Softmax has some unsatisfying aspects in certain contexts, such as being insufficiently sparse or unable to reach absolute zero. Consequently, many alternatives have emerged. In this article, we will briefly summarize the relevant properties of Softmax and review and compare some of its alternatives.

Softmax Review

First, let's introduce some general notation: $\boldsymbol{x} = (x_1,x_2,\cdots,x_n)\in\mathbb{R}^n$ is the $n$-dimensional vector to be converted into a probability distribution. Its components can be positive or negative, with no fixed upper or lower bounds. $\Delta^{n-1}$ is defined as the set of all $n$-element discrete probability distributions, i.e.,

\begin{equation}\Delta^{n-1} = \left\{\boldsymbol{p}=(p_1,p_2,\cdots,p_n)\left|\, p_1,p_2,\cdots,p_n\geq 0,\sum_{i=1}^n p_i = 1\right.\right\}\end{equation}

The label $n-1$ instead of $n$ is used because the constraint $\sum\limits_{i=1}^n p_i = 1$ defines an $(n-1)$-dimensional sub-plane in $n$-dimensional space. Combined with the constraint $p_i\geq 0$, the set of $(p_1,p_2,\cdots,p_n)$ is just a subset of that plane, meaning the actual degrees of freedom are only $n-1$. Based on these notations, the theme of this article can be simply expressed as exploring the mapping $\mathbb{R}^n\mapsto\Delta^{n-1}$, where $\boldsymbol{x}\in\mathbb{R}^n$ is commonly referred to as Logits or Scores.

Basic Definition

The definition of Softmax is very simple:

\begin{equation}p_i = softmax(\boldsymbol{x})_i = \frac{e^{x_i}}{\sum\limits_{j=1}^n e^{x_j}}\end{equation}

There are countless origins and interpretations of Softmax, such as energy models, statistical mechanics, or simply as a smooth approximation of $\text{argmax}$. Therefore, it is difficult to verify its earliest source, and we will not attempt to do so here. Often, we also add a temperature parameter, considering $softmax(\boldsymbol{x}/\tau)$, but since $\tau$ itself can be integrated into the definition of $\boldsymbol{x}$, we will not particularly separate the $\tau$ parameter here. The denominator of Softmax is usually denoted as $Z(\boldsymbol{x})$, and its logarithm is the $\text{logsumexp}$ operation found in most deep learning frameworks, which is a smooth approximation of $\max$:

\begin{align}\log Z(\boldsymbol{x}) = \log \sum\limits_{j=1}^n e^{x_j} = \text{logsumexp}(\boldsymbol{x})\\ \lim_{\tau\to 0^+} \tau\,\text{logsumexp}(\boldsymbol{x}/\tau) = \max(\boldsymbol{x})\end{align}

When $\tau=1$, we can write $\text{logsumexp}(\boldsymbol{x}) \approx \max(\boldsymbol{x})$. The larger the variance of $\boldsymbol{x}$, the higher the degree of approximation. Further discussion can be found in "Seeking a Smooth Maximum Function".

Two Properties

In addition to converting arbitrary vectors into probability distributions, Softmax satisfies two properties:

\begin{align}&{\color{red}{\text{Monotonicity}}}:\quad p_i > p_j \Leftrightarrow x_i > x_j,\quad p_i = p_j \Leftrightarrow x_i = x_j \\[5pt] &{\color{red}{\text{Invariance}}}:\quad softmax(\boldsymbol{x}) = softmax(\boldsymbol{x} + c),\,\,\forall c\in\mathbb{R} \end{align}

Monotonicity means Softmax is order-preserving; the maximum/minimum values of $\boldsymbol{x}$ correspond to the maximum/minimum values of $\boldsymbol{p}$. Invariance says that if every component of $\boldsymbol{x}$ is increased by the same constant, the result of Softmax remains unchanged. This is the same property as $\text{argmax}$, i.e., $\text{argmax}(\boldsymbol{x}) = \text{argmax}(\boldsymbol{x} + c)$. Therefore, based on these two properties, we can conclude that Softmax is actually a smooth approximation of $\text{argmax}$ (more accurately, a smooth approximation of $\text{onehot}(\text{argmax}(\cdot))$). Specifically, we have:

\begin{equation}\lim_{\tau\to 0^+} softmax(\boldsymbol{x}/\tau) = \text{onehot}(\text{argmax}(\boldsymbol{x}))\end{equation}

This is likely the origin of the name Softmax. Note not to confuse them: Softmax is the smooth approximation of $\text{argmax}$, while the smooth approximation of $\max$ is $\text{logsumexp}$.

Gradient Calculation

For deep learning, one important way to understand the nature of a function is to understand its gradient. For Softmax, we previously calculated this in "Looking at the Attention Scale Operation from Gradient Maximization":

\begin{equation}\frac{\partial p_i}{\partial x_j} = p_i\delta_{i,j} - p_i p_j = \left\{\begin{aligned} p_i - p_i^2,& \quad i=j\\ - p_i p_j,& \quad i\neq j \end{aligned}\right.\end{equation}

The matrix formed this way is called the Jacobian Matrix of Softmax. Its L1 norm has a simple form:

\begin{equation}\frac{1}{2}\left\Vert\frac{\partial \boldsymbol{p}}{\partial \boldsymbol{x}}\right\Vert_1=\frac{1}{2}\sum_{i,j}\left|\frac{\partial p_i}{\partial x_j}\right|=\frac{1}{2}\sum_i (p_i - p_i^2) + \frac{1}{2}\sum_{i\neq j} p_i p_j = 1 - \sum_i p_i^2\end{equation}

When $\boldsymbol{p}$ is a one-hot distribution, the above formula equals 0. This means that as the result of Softmax gets closer to one-hot, the gradient vanishing phenomenon becomes more severe. Thus, at least during the initialization phase, we cannot initialize Softmax to be close to one-hot. At the same time, the rightmost part of the above formula is related to the concept of Rényi Entropy, which is similar to the common Shannon entropy.

Reference Implementation

The direct implementation of Softmax is simple: just take the $\exp$ and normalize it. The reference Numpy code is:

def softmax(x):
    y = np.exp(x)
    return y / y.sum()

However, if $\boldsymbol{x}$ contains large components, calculating $\exp$ easily leads to overflow. Therefore, we usually utilize the Invariance property by subtracting the maximum value of all components from each component before calculating Softmax. This ensures that the components passed to $\exp$ are no greater than 0, preventing overflow:

def softmax(x):
    y = np.exp(x - x.max())
    return y / y.sum()

Loss Function

One of the main purposes of constructing probability distributions is for single-label multi-classification tasks. Assuming an $n$-class task where $\boldsymbol{x}$ is the model output, we hope to predict the probability of each class via $\boldsymbol{p}=softmax(\boldsymbol{x})$. To train this model, we need a loss function. Assuming the target class is $t$, a common choice is Cross-Entropy loss:

\begin{equation}\mathcal{L}_t = - \log p_t = - \log softmax(\boldsymbol{x})_t\end{equation}

We can find its gradient:

\begin{equation}-\frac{\partial \log p_t}{\partial x_j} = p_j - \delta_{t,j} = \left\{\begin{aligned} p_t - 1,& \quad j=t\\ p_j,& \quad j\neq t \end{aligned}\right.\end{equation}

Note that $t$ is given, so $\delta_{t,j}$ actually expresses the target distribution $\text{onehot(t)}$, while all $p_j$ collectively are $\boldsymbol{p}$ itself. Thus, the above equation can be written more intuitively as:

\begin{equation}-\frac{\partial \log p_t}{\partial \boldsymbol{x}} = \boldsymbol{p} - \text{onehot(t)}\label{eq:softmax-ce-grad}\end{equation}

In other words, its gradient is exactly the difference between the target distribution and the predicted distribution. As long as the two are not equal, the gradient will always exist, allowing optimization to continue. This is the advantage of cross-entropy. However, in some cases, this is also a disadvantage, because Softmax only results in one-hot when $\tau\to 0^+$. In other words, under normal circumstances, one-hot will never appear, meaning optimization never completely stops, which might lead to over-optimization. This is the motivation for some of the alternatives that follow.

In addition to cross-entropy, there are other losses available, such as $-p_t$, which can be understood as the negative of a smooth approximation of accuracy. However, it may suffer from gradient vanishing problems, so its optimization efficiency is often inferior to cross-entropy, generally suitable for fine-tuning rather than training from scratch. More discussion can be found in "How to Train Your Accuracy?".

Softmax Variants

Having introduced Softmax, we will now summarize some Softmax variants previously discussed on this blog, such as Margin Softmax, Taylor Softmax, and Sparse Softmax. These are derivatives based on Softmax, focusing on improvements in different aspects such as loss functions, sparsity, and long-tail issues.

Margin Softmax

First, we introduce a series of Softmax variants originating from face recognition, collectively known as Margin Softmax. They were later also applied to Sentence Embedding training. This site discussed one of these variants, AM-Softmax, in "Sentence Similarity Model Based on GRU and AM-Softmax", and later had a more general discussion in "From the Triangle Inequality to Margin Softmax".

Although Margin Softmax bears the name Softmax, it is actually more of an improvement to the loss function. Taking AM-Softmax as an example, it has two characteristics: first, it constructs Logits in a $\cos$ form, i.e., $\boldsymbol{x} = [\cos(\boldsymbol{z},\boldsymbol{c}_1),\cos(\boldsymbol{z},\boldsymbol{c}_2),\cdots,\cos(\boldsymbol{z},\boldsymbol{c}_n)]/\tau$. At this point, the temperature parameter $\tau$ is mandatory because the range of $\cos$ is $[-1,1]$, which cannot sufficiently separate class probabilities. Second, it does not simply use $-\log p_t$ as the loss but adds reinforcement:

\begin{equation}\mathcal{L} = - \log \frac{e^{[\cos(\boldsymbol{z},\boldsymbol{c}_t)-m]/\tau}}{e^{[\cos(\boldsymbol{z},\boldsymbol{c}_t)-m]/\tau} + \sum_{j\neq t} e^{\cos(\boldsymbol{z},\boldsymbol{c}_j)/\tau}}\end{equation}

Intuitively, while cross-entropy wants $x_t$ to be the largest among all components of $\boldsymbol{x}$, AM-Softmax not only wants $x_t$ to be the largest but also wants it to be at least $m/\tau$ larger than the second largest component. Here, $m/\tau$ is called the Margin.

Why increase the requirement for the target class? This is due to the application scenario. As mentioned, Margin Softmax originated in face recognition and can be used for semantic retrieval in NLP. That is to say, its application scenario is retrieval, but the training method is classification. If standard classification cross-entropy is used, the features encoded by the model may not necessarily satisfy the retrieval requirements well. Therefore, adding a Margin makes the features more compact. For more specific discussion, please refer to "From the Triangle Inequality to Margin Softmax" or related papers.

Taylor Softmax

Next to be introduced is Taylor Softmax, discussed in "Even-degree Taylor Expansions of exp(x) at x=0 are Always Positive". it utilizes an interesting property of the Taylor expansion of $\exp(x)$:

For any real number $x$ and even number $k$, it always holds that $f_k(x)\triangleq\sum\limits_{m=0}^k \frac{x^m}{m!} > 0$. That is, the even-degree Taylor expansion of $e^x$ at $x=0$ is always positive.

Utilizing this positivity, we can construct a Softmax variant (where $k > 0$ is any even number):

\begin{equation}taylor\text{-}softmax(\boldsymbol{x}, k)_i = \frac{f_k(x_i)}{\sum\limits_{j=1}^n f_k(x_j)}\end{equation}

Since it is constructed based on the Taylor expansion of $\exp$, Taylor Softmax has a certain approximation relationship with Softmax within a certain range. In some scenarios, Softmax can be replaced by Taylor Softmax. So what are the characteristics of Taylor Softmax? The answer is that it is more long-tailed. Because Taylor Softmax is a normalization of a polynomial function, it decays slower than an exponential function. Thus, for tail classes, Taylor Softmax often assigns a higher probability, which may help alleviate the over-confidence phenomenon of Softmax.

The latest application of Taylor Softmax is replacing the Softmax in Attention, reducing the original $O(n^2)$ complexity to linear complexity. The theoretical derivation can be found in "Transformer Path to Upgrade: 5. Linear Attention as Infinite Dimensionality". The latest practice of this idea is a model named Based, which uses $e^x\approx 1+x+x^2/2$ to linearize Attention, claiming to be more efficient than Attention and better than Mamba. Detailed introductions can be found in the blog posts "Zoology (Blogpost 2): Simple, Input-Dependent, and Sub-Quadratic Sequence Mixers" and "BASED: Simple linear attention language models balance the recall-throughput tradeoff".

Sparse Softmax

Sparse Softmax is a simple sparse variant of Softmax proposed by the author during the 2020 China Legal Research Cup (CAIL). It was first published in the blog post "SPACES: Extract-Generate Long Text Summarization (CAIL Summary)", and later supplementary experiments were conducted for a simple paper "Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation".

We know that in text generation, we often use deterministic Beam Search decoding or stochastic TopK/TopP Sampling. These algorithms share the characteristic of only retaining the top few tokens with the highest predicted probabilities for traversal or sampling, which is equivalent to treating the probabilities of the remaining tokens as zero. However, if Softmax is used during training to construct the probability distribution, there is no possibility of being strictly equal to zero, creating an inconsistency between training and prediction. Sparse Softmax aims to handle this inconsistency. The idea is simple: during training, the probabilities of tokens outside the Top-$k$ are also set to zero:

\begin{array}{c|c|c} \hline & Softmax & Sparse\text{ }Softmax \\ \hline \text{Basic Definition} & p_i = \frac{e^{x_i}}{\sum\limits_{j=1}^n e^{x_j}} & p_i=\left\{\begin{aligned}&\frac{e^{x_i}}{\sum\limits_{j\in\Omega_k} e^{x_j}},\,i\in\Omega_k\\ &\quad 0,\,i\not\in\Omega_k\end{aligned}\right.\\ \hline \text{Loss Function} & \log\left(\sum\limits_{i=1}^n e^{x_i}\right) - x_t & \log\left(\sum\limits_{i\in\Omega_k} e^{x_i}\right) - x_t\\ \hline \end{array}

Where $\Omega_k$ is the set of original indices of the $k$ largest elements after sorting $x_1,x_2,\cdots,x_n$. Simply put, one performs the same truncation operation during training as during prediction. The selection method for $\Omega_k$ can also follow the Top-$p$ method of Nucleus Sampling depending on specific needs. However, it is important to note that Sparse Softmax forcibly truncates the probabilities of the remaining parts, meaning these Logits cannot undergo backpropagation. Therefore, the training efficiency of Sparse Softmax is generally lower than that of Softmax, making it mostly suitable for fine-tuning scenarios rather than training from scratch.

Perturb Max

This section introduces a new way of constructing probability distributions, which we call Perturb Max. It is a generalization of Gumbel Max. This site first introduced it in the blog post "Building Discrete Probability Distributions from a Reparameterization Perspective". Furthermore, it was discussed in the paper "EXACT: How to Train Your Accuracy". As for earlier sources, the author has not investigated further.

Reflecting on the Problem

First, we know that constructing a mapping $\mathbb{R}^n\mapsto\Delta^{n-1}$ is not difficult. As long as $f(x)$ is a mapping $\mathbb{R}\mapsto \mathbb{R}^*$ (real numbers to non-negative real numbers), such as $x^2$, then letting

\begin{equation}p_i = \frac{f(x_i)}{\sum\limits_{j=1}^n f(x_j)}\end{equation}

will satisfy the conditions. If we want to add Monotonicity from the "Two Properties"? That is also not hard; we just need to ensure $\mathbb{R}\mapsto \mathbb{R}^*$ is a monotonically increasing function. There are many such functions, like $\text{sigmoid}(x)$. But what if we also want Invariance? Can we easily write a mapping $\mathbb{R}^n\mapsto\Delta^{n-1}$ that satisfies Invariance? (I certainly can't.)

Readers may ask: why emphasize Monotonicity and Invariance? Indeed, from the perspective of fitting a probability distribution, these two points may not seem strictly necessary; after all, with "brute force" through sufficiently large models, anything can be fitted. But from the perspective of a "Softmax alternative," we hope the newly defined probability distribution can still serve as a smooth approximation of $\text{argmax}$. Thus, we want to maintain as many properties shared with $\text{argmax}$ as possible. This is the main reason we wish to preserve Monotonicity and Invariance.

Gumbel Max

Perturb Max constructs such a class of distributions by generalizing Gumbel Max. Readers unfamiliar with Gumbel Max can first refer to "A Talk on Reparameterization: From Normal Distribution to Gumbel Softmax". Simply put, Gumbel Max discovered that:

\begin{equation}P[\text{argmax}(\boldsymbol{x}+\boldsymbol{\varepsilon}) = i] = softmax(\boldsymbol{x})_i,\quad \boldsymbol{\varepsilon}\sim Gumbel\text{ }Noise\end{equation}

How to understand this result? First, $\boldsymbol{\varepsilon}\sim Gumbel\text{ }Noise$ means each component of $\boldsymbol{\varepsilon}$ is independently and identically sampled from a Gumbel distribution. Next, we know that given a vector $\boldsymbol{x}$, $\text{argmax}(\boldsymbol{x})$ is normally a deterministic result. But after adding random noise $\boldsymbol{\varepsilon}$, the result of $\text{argmax}(\boldsymbol{x}+\boldsymbol{\varepsilon})$ becomes stochastic, so each $i$ has its own probability. Finally, Gumbel Max tells us that if Gumbel noise is added, the probability of $i$ occurring is exactly $softmax(\boldsymbol{x})_i$.

The most direct use of Gumbel Max is providing a way to sample from the distribution $softmax(\boldsymbol{x})$. Of course, if one only needs to sample, there are simpler methods; there's no need to "crack a nut with a sledgehammer." The greatest value of Gumbel Max is "Reparameterization." it transfers the stochasticity of the problem from a discrete distribution with parameter $\boldsymbol{x}$ to a parameter-less $\boldsymbol{\varepsilon}$. Combined with Softmax being a smooth approximation of $\text{argmax}$, we get that $softmax(\boldsymbol{x} + \boldsymbol{\varepsilon})$ is a smooth approximation of Gumbel Max, which is Gumbel Softmax—a common trick for training models with "learnable parameters in discrete sampling modules."

General Noise

Perturb Max directly stems from Gumbel Max: since Softmax can be derived from the Gumbel distribution, what if we replace the Gumbel distribution with a general distribution, such as the Normal distribution? Wouldn't that derive a new form of probability distribution? That is, we define:

\begin{equation}p_i = P[\text{argmax}(\boldsymbol{x}+\boldsymbol{\varepsilon}) = i],\quad \varepsilon_1,\varepsilon_2,\cdots,\varepsilon_n\sim p(\varepsilon)\end{equation}

Repeating the derivation of Gumbel Max, we can obtain:

\begin{equation}p_i = \int_{-\infty}^{\infty} p(\varepsilon_i)\left[\prod_{j\neq i} \Phi(x_i - x_j + \varepsilon_i)\right]d\varepsilon_i = \mathbb{E}_{\varepsilon}\left[\prod_{j\neq i} \Phi(x_i - x_j + \varepsilon)\right]\end{equation}

where $\Phi(\varepsilon)$ is the cumulative distribution function (CDF) of $p(\varepsilon)$. For general distributions, even simple standard Normal distributions, the above formula is difficult to solve analytically, so it must be estimated numerically. To obtain deterministic calculation results, we can use the inverse CDF method for uniform sampling: first uniformly pick $t$ from $[0,1]$, then solve $t=\Phi(\varepsilon)$ to get $\varepsilon$.

From the definition of Perturb Max or the final form of $p_i$, we can assert that Perturb Max satisfies Monotonicity and Invariance. We won't expand on that here. So, in which scenarios does it have a unique effect? To be honest, I don't really know. The paper "EXACT: How to Train Your Accuracy" uses it to construct new probability distributions and optimize smooth approximations of accuracy, but the author's own experiments did not show significant effects. Personally, I suspect it might show special effects in certain scenarios requiring reparameterization.

Sparsemax

Next to appear is the probability mapping named Sparsemax, from the 2016 paper "From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification". Like the Sparse Softmax I proposed, it is a modification aimed at sparsity, but the authors' motivation was to provide better interpretability in Attention. Unlike Sparse Softmax which simply forcibly truncates the Top-$k$ components, Sparsemax provides a more adaptive construction of a sparse probability distribution.

Basic Definition

The original paper defines Sparsemax as the solution to the following optimization problem:

\begin{equation}sparsemax(\boldsymbol{x}) = \mathop{\text{argmin}}\limits_{\boldsymbol{p}\in\Delta^{n-1}}\Vert \boldsymbol{p} - \boldsymbol{x}\Vert^2\label{eq:sparsemax-opt}\end{equation}

The exact expression can be solved using Lagrange multipliers. However, this method is not intuitive and does not easily reveal its relationship with Softmax. Below is a derivation I conceived that I believe is more concise.

First, we can discover that Softmax can be equivalently represented as:

\begin{equation}\boldsymbol{p} = softmax(\boldsymbol{x}) = \exp(\boldsymbol{x} - \lambda(\boldsymbol{x}))\label{eq:sparsemax-softmax}\end{equation}

where $\lambda(\boldsymbol{x})$ is a constant such that the sum of the components of $\boldsymbol{p}$ equals 1. For Softmax, we can derive $\lambda(\boldsymbol{x})=\log\sum\limits_i e^{x_i}$.

Next, in the Taylor Softmax section, we mentioned that the even-degree Taylor expansion of $\exp(x)$ is always positive, so it can be used to construct a Softmax variant. But what about odd degrees? For example, $\exp(x)\approx 1 + x$ is not always non-negative, but we can add a $\text{relu}$ to force it to be non-negative, i.e., $\exp(x)\approx \text{relu}(1 + x)$. Replacing $\exp$ in Eq. $\eqref{eq:sparsemax-softmax}$ with this approximation yields Sparsemax:

\begin{equation}\boldsymbol{p} = sparsemax(\boldsymbol{x}) = \text{relu}(1+\boldsymbol{x} - \lambda(\boldsymbol{x}))\end{equation}

where $\lambda(\boldsymbol{x})$ is still a constant that makes the components of $\boldsymbol{p}$ sum to 1. The constant $1$ can also be integrated into $\lambda(\boldsymbol{x})$, so the above is equivalent to:

\begin{equation}\boldsymbol{p} = sparsemax(\boldsymbol{x}) = \text{relu}(\boldsymbol{x} - \lambda(\boldsymbol{x}))\end{equation}

Solving Algorithm

Until now, Sparsemax is only a formal definition because the specific calculation method for $\lambda(\boldsymbol{x})$ is not yet clear. This is the problem we need to explore in this section. However, even so, from this definition alone it is not difficult to see that Sparsemax satisfies the two properties of Monotonicity and Invariance. Readers who are still unsure can try proving it themselves.

Now we turn to the calculation of $\lambda(\boldsymbol{x})$. Without loss of generality, assume the components of $\boldsymbol{x}$ are sorted from largest to smallest, $x_1\geq x_2\geq \cdots\geq x_n$. Suppose we already know $x_k\geq \lambda(\boldsymbol{x})\geq x_{k+1}$. Then it is clear that:

\begin{equation}sparsemax(\boldsymbol{x}) = [x_1 - \lambda(\boldsymbol{x}),\cdots,x_k - \lambda(\boldsymbol{x}),0,\cdots,0]\end{equation}

According to the definition of $\lambda(\boldsymbol{x})$, we have:

\begin{equation}\sum_{i=1}^k [x_i - \lambda(\boldsymbol{x})] = 1\quad\Rightarrow\quad 1 + k\lambda(\boldsymbol{x}) = \sum_{i=1}^k x_i\end{equation}

This allows us to solve for $\lambda(\boldsymbol{x})$. Of course, we don't know $x_k\geq \lambda(\boldsymbol{x})\geq x_{k+1}$ beforehand, but we can iterate through $k=1,2,\cdots,n$ and calculate a $\lambda_k(\boldsymbol{x})$ using the formula. We then take the $\lambda_k(\boldsymbol{x})$ that satisfies $x_k\geq \lambda_k(\boldsymbol{x})\geq x_{k+1}$. This is equivalent to finding the largest $k$ such that $x_k\geq \lambda_k(\boldsymbol{x})$ and returning that corresponding $\lambda_k(\boldsymbol{x})$.

Reference Implementation:

def sparsemax(x):
    x_sort = np.sort(x)[::-1]
    x_lamb = (np.cumsum(x_sort) - 1) / np.arange(1, len(x) + 1)
    lamb = x_lamb[(x_sort >= x_lamb).argmin() - 1]
    return np.maximum(x - lamb, 0)

Gradient Calculation

For convenience, we introduce the notation:

\begin{equation}\Omega(\boldsymbol{x}) = \big\{k\big|x_k > \lambda(\boldsymbol{x})\big\}\end{equation}

Then we can write:

\begin{equation}\boldsymbol{p} = sparsemax(\boldsymbol{x}) = \left\{\begin{aligned} &x_i - \frac{1}{|\Omega(\boldsymbol{x})|}\left(-1 + \sum_{j\in\Omega(\boldsymbol{x})}x_j\right),\quad &i\in \Omega(\boldsymbol{x})\\ &0,\quad &i \not\in \Omega(\boldsymbol{x}) \end{aligned}\right.\end{equation}

From this equivalent form, we can see that like Sparse Softmax, Sparsemax only provides gradients for some classes. The Jacobian matrix can be calculated directly:

\begin{equation}\frac{\partial p_i}{\partial x_j} = \left\{\begin{aligned} &1 - \frac{1}{|\Omega(\boldsymbol{x})|},\quad &i,j\in \Omega(\boldsymbol{x}),i=j\\[5pt] &- \frac{1}{|\Omega(\boldsymbol{x})|},\quad &i,j\in \Omega(\boldsymbol{x}),i\neq j\\[5pt] &0,\quad &i \not\in \Omega(\Omega(\boldsymbol{x}))\text{ or }j \not\in \Omega(\Omega(\boldsymbol{x})) \end{aligned}\right.\end{equation}

From this, we see that for classes inside $\Omega(\boldsymbol{x})$, Sparsemax does not suffer from gradient vanishing because its gradient is constant. However, its total gradient magnitude depends on the number of elements in $\Omega(\boldsymbol{x})$; the fewer elements, the sparser it is, meaning the gradient is also sparser.

Loss Function

Finally, let's discuss the loss function for Sparsemax as a classification output. An intuitive idea is to use cross-entropy $-\log p_t$, just like with Softmax. However, the output of Sparsemax can be strictly equal to 0. To prevent $\log 0$ errors, one would have to add an $\epsilon$ to each component, resulting in $-\log\frac{p_t + \epsilon}{1 + n\epsilon}$. But this is both ugly and not a convex function, making it an unideal choice.

In fact, the reason cross-entropy works so well with Softmax is that its gradient takes the form of Eq. $\eqref{eq:softmax-ce-grad}$. Therefore, for Sparsemax, let's assume the gradient of the loss function is likewise $\boldsymbol{p} - \text{onehot(t)}$, and back-calculate what the loss function should look like:

\begin{equation}\frac{\partial \mathcal{L}_t}{\partial \boldsymbol{x}} = \boldsymbol{p} - \text{onehot(t)}\quad\Rightarrow\quad \mathcal{L}_t = \frac{1}{2} - x_t + \sum_{i\in\Omega(\boldsymbol{x})}\frac{1}{2}\left(x_i^2 - \lambda^2(\boldsymbol{x})\right)\end{equation}

Verification from right to left is simple. Deriving from left to right might be a bit difficult, but not excessively so; a bit of piecing things together should yield it. The first $\frac{1}{2}$ constant is to ensure the non-negativity of the loss function. We can verify this with an extreme case: suppose the optimization is perfect, so $\boldsymbol{p}$ is one-hot. Then $x_t\to\infty$, and $\lambda(\boldsymbol{x}) = x_t - 1$, so:

\begin{equation}- x_t + \sum_{i\in\Omega(\boldsymbol{x})}\frac{1}{2}\left(x_i^2 - \lambda^2(\boldsymbol{x})\right) = -x_t + \frac{1}{2}x_t^2 - \frac{1}{2}(x_t - 1)^2 = -\frac{1}{2}\end{equation}

Thus, we add the constant $\frac{1}{2}$.

Entmax-$\alpha$

Entmax-$\alpha$ is a generalization of Sparsemax. Its motivation is that Sparsemax is often excessively sparse, which might lead to low learning efficiency and a decline in final performance. Therefore, Entmax-$\alpha$ introduces the $\alpha$ parameter to provide a smooth transition from Softmax ($\alpha=1$) to Sparsemax ($\alpha=2$). Entmax-$\alpha$ comes from the paper "Sparse Sequence-to-Sequence Models" by Andre F. T. Martins, the same author as Sparsemax. This author has done a lot of work around sparse Softmax and sparse Attention; interested readers can check his homepage.

Basic Definition

Like Sparsemax, the original paper defines Entmax-$\alpha$ as a solution to an optimization problem similar to Eq. $\eqref{eq:sparsemax-opt}$, but this definition involves the concept of Tsallis entropy (which is the source of "Ent" in Entmax). Solving it also requires Lagrange multipliers and is relatively complex, so we will not use that approach here.

Our introduction is likewise based on the approximation $\exp(x)\approx \text{relu}(1 + x)$. For Softmax and Sparsemax, we had:

\begin{align}&{\color{red}{Softmax}}:\quad &\exp(\boldsymbol{x} - \lambda(\boldsymbol{x})) \\[5pt] &{\color{red}{Sparsemax}}:\quad &\text{relu}(1+\boldsymbol{x} - \lambda(\boldsymbol{x})) \end{align}

The reason Sparsemax is too sparse can be understood as the approximation accuracy of $\exp(x)\approx \text{relu}(1 + x)$ not being high enough. We can evolve a higher accuracy approximation from this:

\begin{equation}\exp(x) = \exp(\beta x / \beta) = \exp^{1/\beta}(\beta x)\approx \text{relu}^{1/\beta}(1 + \beta x)\end{equation}

As long as $0 \leq \beta < 1$, the rightmost term is a better approximation than $\text{relu}(1 + x)$ (think about why). Using this new approximation, we can construct:

\begin{equation}{\color{red}{Entmax\text{-}\alpha}}:\quad \text{relu}^{1/\beta}(1+\beta\boldsymbol{x} - \lambda(\boldsymbol{x}))\end{equation}

Here $\alpha = \beta + 1$ is used to align with the notation in the original paper; in fact, using $\beta$ is more concise. Similarly, the constant $1$ can be absorbed into the definition of $\lambda(\boldsymbol{x})$, so the final definition can be simplified to:

\begin{equation}Entmax_{\alpha}(\boldsymbol{x}) = \text{relu}^{1/\beta}(\beta\boldsymbol{x} - \lambda(\boldsymbol{x}))\end{equation}

Solving Algorithm

For a general $\beta$, solving for $\lambda(\boldsymbol{x})$ is quite troublesome and usually only possible via bisection. First, let $\boldsymbol{z}=\beta\boldsymbol{x}$ and assume $z_1\geq z_2\geq \cdots \geq z_n$. We can find that Entmax-$\alpha$ satisfies Monotonicity and Invariance. Using Invariance, we can set $z_1 = 1$ without loss of generality (if not, subtract $z_1 - 1$ from each $z_i$). Now we can test: when $\lambda=0$, the sum of all components of $\text{relu}^{1/\beta}(\beta\boldsymbol{x} - \lambda)$ is $\geq 1$. When $\lambda=1$, the sum is 0. Thus, the $\lambda(\boldsymbol{x})$ that makes the sum of components equal to 1 must be in $[0,1)$. We can then use bisection to gradually approach the optimal $\lambda(\boldsymbol{x})$.

For certain special $\beta$, we can obtain an exact solution algorithm. Sparsemax corresponds to $\beta=1$, and we already gave the solution process. Another example with an analytical solution is $\beta=1/2$, which is the main case concerned by the original paper; unless otherwise noted, "Entmax" typically refers to Entmax-1.5. Following the same logic as Sparsemax, assume we know $z_k\geq \lambda(\boldsymbol{x})\geq z_{k+1}$. Then we have:

\begin{equation}\sum_{i=1}^k [z_i - \lambda(\boldsymbol{x})]^2 = 1\end{equation}

This is just a quadratic equation in $\lambda(\boldsymbol{x})$, which can be solved as:

\begin{equation}\lambda(\boldsymbol{x}) = \mu_k - \sqrt{\frac{1}{k} - \sigma_k^2},\quad \mu_k = \frac{1}{k}\sum_{i=1}^k z_i,\quad \sigma_k^2 = \frac{1}{k}\left(\sum_{i=1}^k z_i^2\right) - \mu_k^2\end{equation}

When we don't know $x_k\geq \lambda(\boldsymbol{x})\geq x_{k+1}$ in advance, we can iterate $k=1,2,\cdots,n$, calculate $\lambda_k(\boldsymbol{x})$ using the formula, and pick the one satisfying $x_k\geq \lambda_k(\boldsymbol{x})\geq x_{k+1}$. Note that in this case, it is not equivalent to finding the largest $k$ such that $x_k\geq \lambda_k(\boldsymbol{x})$.

Complete Reference Implementation:

def entmax(x):
    x_sort = np.sort(x / 2)[::-1]
    k = np.arange(1, len(x) + 1)
    x_mu = np.cumsum(x_sort) / k
    x_sigma2 = np.cumsum(x_sort**2) / k - x_mu**2
    x_lamb = x_mu - np.sqrt(np.maximum(1. / k - x_sigma2, 0))
    x_sort_shift = np.pad(x_sort[1:], (0, 1), constant_values=-np.inf)
    lamb = x_lamb[(x_sort > x_lamb) & (x_lamb > x_sort_shift)]
    return np.maximum(x / 2 - lamb, 0)**2

Other Content

The gradient of Entmax-$\alpha$ is largely similar to that of Sparsemax, so we won't discuss it at length here. Readers can derive it themselves or refer to the original paper. As for the loss function, back-calculating from the gradient $\frac{\partial \mathcal{L}_t}{\partial \boldsymbol{x}} = \boldsymbol{p} - \text{onehot(t)}$ is also possible, but its form is quite complex. Interested readers can refer to "Sparse Sequence-to-Sequence Models" and "Learning with Fenchel-Young Losses".

However, from the author's perspective, using a stop_gradient operator to define the loss function is simpler and more universal, avoiding the complex process of finding the primitive function:

\begin{equation}\mathcal{L}_t = (\boldsymbol{p} - \text{onehot(t)})\cdot \text{stop_gradient}(\boldsymbol{x})\end{equation}

where $\cdot$ is the vector dot product. The gradient of the loss defined this way is exactly $\boldsymbol{p} - \text{onehot(t)}$. However, note that for this loss function, only the gradient is valid; its numerical value has no reference meaning (e.g., it can be positive or negative, and smaller is not necessarily better). To evaluate training progress and effectiveness, one must establish other metrics (such as cross-entropy or accuracy).

Conclusion

This article briefly reviewed and organized Softmax and some of its alternatives. The works covered include the definitions and properties of Softmax, Margin Softmax, Taylor Softmax, Sparse Softmax, Perturb Max, Sparsemax, and Entmax-$\alpha$.