A Few Inequalities for the logsumexp Operation

By 苏剑林 | May 10, 2022

$\text{logsumexp}$ is an operation frequently encountered in machine learning, especially in the implementation and derivation of cross-entropy. At the same time, it is a smooth approximation of the $\max$ function (refer to "Seeking a Smooth Maximum Function"). Let $x=(x_1, x_2, \cdots, x_n)$; $\text{logsumexp}$ is defined as:

\begin{equation}\text{logsumexp}(x) = \log \sum_{i=1}^n e^{x_i}\end{equation}

This article introduces several inequalities related to $\text{logsumexp}$ that may be useful in theoretical derivations.

Basic Bounds

Let $x_{\max} = \max(x_1, x_2, \cdots, x_n)$. Then it is obvious that:

\begin{equation}e^{x_{\max}} < \sum_{i=1}^n e^{x_i} \leq \sum_{i=1}^n e^{x_{\max}} = ne^{x_{\max}}\end{equation}

Taking the logarithm of each part gives:

\begin{equation}x_{\max} < \text{logsumexp}(x) \leq x_{\max} + \log n\end{equation}

This is the most basic result regarding the upper and lower bounds of $\text{logsumexp}$. It indicates that the approximation error of $\text{logsumexp}$ relative to $\max$ does not exceed $\log n$. Note that this error is independent of $x$ itself. Thus, we have:

\begin{equation}x_{\max}/\tau < \text{logsumexp}(x/\tau) \leq x_{\max}/\tau + \log n\end{equation}

Multiplying each part by $\tau$ gives:

\begin{equation}x_{\max} < \tau\text{logsumexp}(x/\tau) \leq x_{\max} + \tau\log n\end{equation}

As $\tau \to 0$, the error tends to 0. This tells us that we can improve the degree of approximation to $\max$ by reducing the temperature parameter $\tau$.

Average Bounds

We know that $e^x$ is a convex function, satisfying Jensen's Inequality $\mathbb{E}[e^x] \geq e^{\mathbb{E}[x]}$. Therefore:

\begin{equation}\frac{1}{n}\sum_{i=1}^n e^{x_i} \geq e^{\bar{x}}\end{equation}

Here $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$. Multiplying both sides by $n$ and taking the logarithm yields:

\begin{equation}\text{logsumexp}(x) \geq \bar{x} + \log n\end{equation}

This is another result regarding the lower bound of $\text{logsumexp}$. This result can be further generalized to the case of weighted averages: let $p_1, p_2, \cdots, p_n \geq 0$ and $\sum_{i=1}^n p_i = 1$. From the Cauchy-Schwarz inequality, we have:

\begin{equation}\left[\sum_{i=1}^n (e^{x_i/2})^2\right]\left[\sum_{i=1}^n p_i^2\right] \geq \left[\sum_{i=1}^n p_i e^{x_i/2}\right]^2\end{equation}

Applying Jensen's inequality to the expression inside the brackets on the right side gives:

\begin{equation}\left[\sum_{i=1}^n p_i e^{x_i/2}\right]^2 \geq \left[e^{\left(\sum_{i=1}^n p_i x_i/2\right)}\right]^2 = e^{\left(\sum_{i=1}^n p_i x_i\right)}\end{equation}

Taking the logarithm of both sides and rearranging, we get:

\begin{equation}\text{logsumexp}(x) \geq \sum_{i=1}^n p_i x_i - \log \sum_{i=1}^n p_i^2\end{equation}

If we use the more general Hölder's Inequality instead of Cauchy-Schwarz from the beginning, we can also obtain:

\begin{equation}\text{logsumexp}(x) \geq \sum_{i=1}^n p_i x_i - \frac{1}{t-1}\log\sum_{i=1}^n p_i^t, \quad \forall t > 1\end{equation}

In particular, taking the limit $t \to 1$, we can obtain:

\begin{equation}\text{logsumexp}(x) \geq \sum_{i=1}^n p_i x_i - \sum_{i=1}^n p_i \log p_i\end{equation}

This can be equivalently rewritten as $\sum_{i=1}^n p_i \log \frac{p_i}{e^{x_i}/Z} \geq 0$, where $Z=e^{\text{logsumexp}(x)}$ is the normalization factor. So, it is actually the KL divergence between two distributions.

Lipschitz Constraint

Under the infinity norm, $\text{logsumexp}$ also satisfies a Lipschitz constraint, namely:

\begin{equation}\|\text{logsumexp}(x) - \text{logsumexp}(y)\| \leq \|x - y\|_{\infty}\end{equation}

where $\|x - y\|_{\infty} = \max_i |x_i - y_i|$ (actually, writing it as $\|x - y\|_{\max}$ would be more intuitive). The proof is not difficult. Define:

\begin{equation}f(t) = \text{logsumexp}(tx + (1-t)y), \quad t \in [0, 1]\end{equation}

Considering it as a univariate function of $t$, by the Mean Value Theorem, there exists $\varepsilon \in (0, 1)$ such that:

\begin{equation}f'(\varepsilon) = \frac{f(1) - f(0)}{1 - 0} = \text{logsumexp}(x) - \text{logsumexp}(y)\end{equation}

It is not hard to find that:

\begin{equation}f'(\varepsilon) = \frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}(x_i - y_i)}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}}\end{equation}

Therefore:

\begin{equation}\begin{aligned} &\|\text{logsumexp}(x) - \text{logsumexp}(y)\| = \left|\frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}(x_i - y_i)}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}}\right| \\ \leq & \frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i} |x_i - y_i|}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}} \leq \frac{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i} \|x - y\|_{\infty}}{\sum_{i=1}^n e^{\varepsilon x_i + (1-\varepsilon)y_i}} = \|x - y\|_{\infty} \end{aligned}\end{equation}

Convex Function

Finally, a very strong conclusion: $\text{logsumexp}$ is also a convex function! This means that all inequalities related to convex functions apply to $\text{logsumexp}$, such as the most basic Jensen's Inequality:

\begin{equation}\mathbb{E}[\text{logsumexp}(x)] \geq \text{logsumexp}(\mathbb{E}[x])\end{equation}

To prove that $\text{logsumexp}$ is a convex function, we need to show that for any $t \in [0, 1]$, the following holds:

\begin{equation}t\text{logsumexp}(x) + (1-t)\text{logsumexp}(y) \geq \text{logsumexp}(tx + (1-t)y)\end{equation}

The proof process is essentially a basic application of Hölder's Inequality. Specifically, we have:

\begin{equation}t\text{logsumexp}(x) + (1-t)\text{logsumexp}(y) = \log \left(\sum_{i=1}^n e^{x_i}\right)^t \left(\sum_{i=1}^n e^{y_i}\right)^{(1-t)}\end{equation}

Applying Hölder's Inequality directly, we obtain:

\begin{equation}\log \left(\sum_{i=1}^n e^{x_i}\right)^t \left(\sum_{i=1}^n e^{y_i}\right)^{(1-t)} \geq \log \sum_{i=1}^n e^{tx_i + (1-t)y_i} = \text{logsumexp}(tx + (1-t)y)\end{equation}

This proves that $\text{logsumexp}$ is a convex function.

Summary

This article summarized several inequalities related to the $\text{logsumexp}$ operation for future reference.