By 苏剑林 | May 25, 2022
Generally, neural network outputs are unconstrained, meaning their range is $\mathbb{R}$. To obtain constrained outputs, activation functions are typically employed. For instance, if we want to output a probability distribution representing the probability of each category, we usually add Softmax as the activation function at the end. A subsequent question arises: besides Softmax, are there any other operations that can generate a probability distribution?
In "A Discussion on Reparameterization: From Normal Distribution to Gumbel Softmax", we introduced the reparameterization operation of Softmax. This article reverses that process—that is, we first define a reparameterization operation and then derive the corresponding probability distribution, thereby obtaining a new perspective on constructing probability distributions.
Problem Definition
Assume the output vector of the model is $\boldsymbol{\mu}=[\mu_1,\cdots,\mu_n]\in\mathbb{R}^n$. Without loss of generality, we assume that the $\mu_i$ are distinct. We hope to transform $\boldsymbol{\mu}$ into an $n$-dimensional probability distribution $\boldsymbol{p}=[p_1,\cdots,p_n]$ through some transformation $\mathcal{T}$, while maintaining certain properties. For example, the most basic requirements are:
\begin{equation}{\color{red}1.}\,p_i\geq 0 \qquad {\color{red}2.}\,\sum_i p_i = 1 \qquad {\color{red}3.}\,p_i \geq p_j \Leftrightarrow \mu_i \geq \mu_j\end{equation}
Of course, these requirements are trivial. As long as $f$ is a monotonic function from $\mathbb{R} \mapsto \mathbb{R}^+$ (for Softmax, $f(x)=e^x$), then the transformation
\begin{equation}p_i = \frac{f(\mu_i)}{\sum\limits_j f(\mu_j)}\end{equation}
can satisfy the above requirements. Next, we add a less trivial condition:
\begin{equation}{\color{red}4.}\, \mathcal{T}(\boldsymbol{\mu}) = \mathcal{T}(\boldsymbol{\mu} + c\boldsymbol{1})\quad (\forall c \in \mathbb{R})\end{equation}
Where $\boldsymbol{1}$ represents the all-ones vector, and $c$ is an arbitrary constant. That is to say, after adding a constant to each component of $\boldsymbol{\mu}$, the result of the transformation remains unchanged. This condition is proposed because adding a constant to each component does not change the result of $\mathop{\text{argmax}}$, and it is best if $\mathcal{T}$ maintains properties as similar to it as possible. It is easy to verify that Softmax satisfies this condition; however, besides Softmax, it seems difficult to think of other transformations.
Noise Perturbation
Most interestingly, we can use the inverse process of reparameterization to construct such a transformation! Suppose $\boldsymbol{\varepsilon}=[\varepsilon_1,\cdots,\varepsilon_n]$ is a vector obtained by independently and identically sampling $n$ times from a distribution $p(\varepsilon)$. Since $\boldsymbol{\varepsilon}$ is random, $\mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})$ is also generally random. Thus, we can define the transformation $\mathcal{T}$ through:
\begin{equation}p_i = P[\mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})=i]\end{equation}
Since the $\boldsymbol{\varepsilon}$ are independent and identically distributed, and the entire definition only relates to $\mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})$—which only involves the relative magnitudes of each component—the defined transformation must satisfy the four aforementioned conditions.
We can also judge the properties it satisfies by directly calculating the form of $p_i$. Specifically, $\mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})=i$ means
\begin{equation}\mu_i + \varepsilon_i > \mu_j + \varepsilon_j\quad (\forall j\neq i)\end{equation}
which is $\mu_i - \mu_j + \varepsilon_i > \varepsilon_j$. Obviously, the larger $\mu_i$ is, the greater the probability that this expression holds; that is, the larger $\mu_i$ is, the larger the corresponding $p_i$ is, which is Condition 3. Specifically, for a fixed $\varepsilon_i$, the probability that this condition holds is
\begin{equation}\int_{-\infty}^{\mu_i - \mu_j + \varepsilon_i} p(\varepsilon_j)d\varepsilon_j = \Phi(\mu_i - \mu_j + \varepsilon_i)\end{equation}
Here $\Phi$ is the cumulative distribution function (Cumulative Distribution Function) of $p(\varepsilon)$. Since each $\varepsilon_j$ is independent and identically distributed, we can directly multiply the probabilities together:
\begin{equation}\prod_{j\neq i} \Phi(\mu_i - \mu_j + \varepsilon_i)\end{equation}
This is the probability that $\mathop{\text{argmax}}(\boldsymbol{\mu}+\boldsymbol{\varepsilon})=i$ for a fixed $\varepsilon_i$. Finally, we only need to take the expectation over $\varepsilon_i$ to get $p_i$:
\begin{equation}p_i = \int_{-\infty}^{\infty} p(\varepsilon_i)\left[\prod_{j\neq i} \Phi(\mu_i - \mu_j + \varepsilon_i)\right]d\varepsilon_i \label{eq:pi}\end{equation}
From the expression of $p_i$, we can see that it only depends on the relative values $\mu_i - \mu_j$, so it clearly satisfies condition 4 in the definition.
Reviewing the Old to Know the New
Comparing the introduction to Gumbel Max in "A Discussion on Reparameterization: From Normal Distribution to Gumbel Softmax", we can see that the above derivation is exactly the reverse of reparameterization; it first defines the reparameterization method and then backward-derives the corresponding probability distribution.
Now we can re-examine our previous results: when the noise distribution is taken as the Gumbel distribution, will equation $\eqref{eq:pi}$ yield the conventional Softmax operation? Gumbel noise is transformed from $u\sim U[0,1]$ via $\varepsilon = -\log(-\log u)$. Since the distribution of $u$ is exactly $U[0,1]$, solving for $u=e^{-e^{-\varepsilon}}$ gives the cumulative distribution function of the Gumbel distribution, i.e., $\Phi(\varepsilon)=e^{-e^{-\varepsilon}}$, and $p(\varepsilon)$ is the derivative of $\Phi(\varepsilon)$, i.e., $p(\varepsilon)=\Phi'(\varepsilon)=e^{-\varepsilon-e^{-\varepsilon}}$.
Substituting the above results into equation $\eqref{eq:pi}$ gives
\begin{equation}\begin{aligned}
p_i =&\, \int_{-\infty}^{\infty} e^{-\varepsilon_i-e^{-\varepsilon_i}} e^{-\sum\limits_{j\neq i}e^{-\varepsilon_i + \mu_j - \mu_i}} d\varepsilon_i \\
=&\, \int_{-\infty}^0 e^{-e^{-\varepsilon_i}\left(1+\sum\limits_{j\neq i}e^{\mu_j - \mu_i}\right)} d(-e^{-\varepsilon_i}) \\
=&\, \int_{-\infty}^0 e^{t\left(1+\sum\limits_{j\neq i}e^{\mu_j - \mu_i}\right)} dt\\
=&\, \frac{1}{1+\sum\limits_{j\neq i}e^{\mu_j - \mu_i}} = \frac{e^{\mu_i}}{\sum\limits_j e^{\mu_j }}
\end{aligned}\end{equation}
This is exactly Softmax. Thus, we have once again verified the correspondence between Gumbel Max and Softmax.
Numerical Computation
Being able to solve for an analytical solution like Softmax, as with the Gumbel distribution, is extremely rare; at least for now, the author cannot find a second case. Therefore, in most cases, we can only use numerical methods to approximate equation $\eqref{eq:pi}$. Since $p(\varepsilon)=\Phi'(\varepsilon)$, we can directly use substitution to get:
\begin{equation}p_i = \int_0^1 \left[\prod_{j\neq i} \Phi(\mu_i - \mu_j + \varepsilon_i)\right]d\Phi(\varepsilon_i)\end{equation}
Let $t=\Phi(\varepsilon_i)$, then
\begin{equation}\begin{aligned}
p_i =&\, \int_0^1 \left[\prod_{j\neq i} \Phi(\mu_i - \mu_j + \Phi^{-1}(t))\right]dt \\
\approx&\, \frac{1}{K}\sum_{k=1}^K\prod_{j\neq i} \Phi\left(\mu_i - \mu_j + \Phi^{-1}\left(\frac{k}{K+1}\right)\right)
\end{aligned}\end{equation}
where $\Phi^{-1}$ is the inverse function of $\Phi$, also known in probability as the Quantile Function (or Percent Point Function, etc.).
From the formula above, it can be seen that as long as we know the analytical expression for $\Phi$, we can perform an approximate calculation of $p_i$. Note that we do not need to know the analytical expression for $\Phi^{-1}$, because the results of the sampling points $\Phi^{-1}\left(\frac{k}{K+1}\right)$ can be pre-calculated using other numerical methods.
Taking the standard normal distribution as an example, $\Phi(x)=\frac{1}{2} \left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$, and mainstream deep learning frameworks basically all come with an builtin $\text{erf}$ function, so the calculation of $\Phi(x)$ is no problem; as for $\Phi^{-1}\left(\frac{k}{K+1}\right)$, we can pre-calculate it using scipy.stats.norm.ppf. Therefore, when $\boldsymbol{\varepsilon}$ is sampled from a standard normal distribution, the calculation of $p_i$ is not an issue in mainstream deep learning frameworks.
Summary
This article generalizes Softmax from the perspective of reparameterization, obtaining a class of probability normalization methods with similar properties.