By 苏剑林 | November 24, 2020
I recently came across an interesting conclusion:
For any real number $x$ and any even number $n$, it always holds that $\sum\limits_{k=0}^n \frac{x^k}{k!} > 0$. That is, the even-order Taylor expansion of $e^x$ at $x=0$ is always positive.
Below, we will look at the proof of this conclusion and its application in finding alternatives to the softmax function.
Proof Process
This appears to be a very strong result. Will the proof be complicated? In fact, the proof is very simple. Let's denote
\begin{equation}f_n(x) = \sum\limits_{k=0}^n \frac{x^k}{k!}\end{equation}
When $n$ is an even number, we have $\lim\limits_{x\to\pm\infty} f_n(x)=+\infty$, which means the function's global shape is opening upwards. Therefore, we only need to prove that its minimum value is greater than 0. Furthermore, since it is a smooth and continuous polynomial function, its global minimum must occur at one of its local minima. Looking at it from another perspective, we only need to prove that the function values at all its stationary points (whether they are local maxima or minima) are greater than 0.
The standard way to find stationary points is to take the derivative. A beautiful property of $f_n(x)$ is that its derivative satisfies:
\begin{equation}f_n'(x) = f_{n-1}(x)\end{equation}
Stationary points satisfy $f_n'(x)=0$, which implies $f_{n-1}(x)=0$. At these points, we have:
\begin{equation}f_n(x) = f_{n-1}(x) + \frac{x^n}{n!} = \frac{x^n}{n!} \geq 0 \,\,\,( \text{when } n \text{ is even})\end{equation}
Thus, we have proven that the function values at all stationary points are non-negative. Consequently, $f_n(x) \geq 0$ holds constantly. We can also verify that $x=0$ is not a stationary point, so $\geq$ can be replaced with $>$. The proof is complete.
Application Scenario
In fact, I saw this conclusion in a new arXiv paper titled "Exploring Alternatives to Softmax Function". The original paper provides a relatively complex proof based on mathematical induction, while the proof presented above was devised by myself and is relatively simpler and clearer.
So, why did the original paper need this conclusion? As the name suggests, it was to explore alternatives to softmax. As we know, the common method in machine learning to convert outputs into a probability distribution is to add a softmax function:
\begin{equation}softmax(\boldsymbol{x})_i = \frac{e^{x_i}}{\sum\limits_{k=1}^n e^{x_k}}\end{equation}
Since $f_n(x) > 0$ when $n$ is even, and $f_n(x)$ is an approximation of $e^x$ within a certain range, replacing $e^x$ with $f_n(x)$ can also serve as a reasonable normalization function:
\begin{equation}taylor\text{-}softmax(\boldsymbol{x}, n)_i = \frac{f_n(x_i)}{\sum\limits_{k=1}^n f_n(x_k)}\end{equation}
The original paper conducted several experiments showing that $taylor\text{-}softmax$ provides some improvement over the regular softmax:

Comparison of softmax and its Taylor expansion approximation effects
Brief Review
However, in my view, these experimental results are hardly convincing, especially since the baselines used are quite weak (it's 2020, at least run a ResNet, right?). Furthermore, the original paper does not provide much intuitive insight into why this alternative works; it simply performs basic experiments and states that it "works," which is somewhat crude.
Nevertheless, despite the many shortcomings of the original paper, I believe the proposed $taylor\text{-}softmax$ might indeed be effective. The process from softmax to $taylor\text{-}softmax$ is essentially changing the activation function from an exponential function to a polynomial function. What is the difference between the two? We know that when $|x|$ is relatively large, $e^x$ increases or decays very rapidly. This directly leads to the phenomenon where softmax often provides over-confident predictions (probability values being either 0 or 1). In contrast, polynomial functions do not grow as aggressively, making them less likely to suffer from over-confidence and, therefore, less prone to overfitting.
Similar changes appeared in the classic dimensionality reduction method t-SNE. The predecessor of t-SNE was SNE, which used an exponential probability distribution similar to softmax. SNE was found to have the "Crowding Problem" (refer to "The Principle of Minimum Entropy (IV): 'Birds of a Feather Flock Together' from Libraries to Word Vectors"). Finally, t-SNE replaced the exponential with a quadratic function, which significantly improved the result. I feel that the idea behind $taylor\text{-}softmax$ shares some similarities with t-SNE.
Maintaining Monotonicity
In fact, it can also be proven that $f_n(x)$ has only one local minimum point globally, so its graph is characteristically "U-shaped," as shown below:

Graph of $f_n(x)$
Some readers with obsessive-compulsive tendencies might be bothered by the non-monotonicity of $f_n(x)$, fearing that a non-monotonic function might hide some potential issues. In reality, there is currently no clear evidence suggesting that the transformation to a probability distribution must be monotonic. However, if you are still concerned, you can simply truncate it. As mentioned, $f_n(x)$ has only one minimum point $x_n^*$. The part greater than the minimum point is monotonically increasing. For the part smaller than the minimum point, we can simply set it equal to the minimum value. That is, define:
\begin{equation}\tilde{f}_{n}(x)=\left\{\begin{aligned}&f_n(x),\quad x > x_n^*\\ &f_n(x_n^*),\quad x\leq x_n^*\end{aligned}\right.\end{equation}
Then use $\tilde{f}_n(x)$ instead of $f_n(x)$ for normalization. For a fixed $n$, $x_n^*$ and $f_n(x_n^*)$ can be calculated in advance using numerical methods:
| $n$ |
$x_n^*$ |
$f(x_n^*)$ |
| 2 |
-1 |
0.5 |
| 4 |
-1.59607 |
0.270395 |
| 6 |
-2.18061 |
0.149325 |
| 8 |
-2.759 |
0.0832715 |
| 10 |
-3.33355 |
0.0466991 |
Summary
The main purpose of this article was to introduce the interesting conclusion that "the even-order Taylor expansion of $e^x$ at $x=0$ is always positive," and briefly discuss its application in searching for softmax alternatives.