SquarePlus: Possibly the Simplest Algebraic Smooth Approximation of ReLU

By 苏剑林 | December 29, 2021

By Jerry Su | 2021-12-29

The ReLU function, namely $\max(x,0)$, is one of the most common activation functions. However, its non-differentiability at $x=0$ is often regarded as a "drawback." Consequently, various smooth approximations have been proposed, such as SoftPlus, GeLU, Swish, etc. However, all of these smooth approximations use at least one exponential operation $e^x$ (SoftPlus even uses a logarithm). From an "efficiency-conscious" perspective, the computational cost is not negligible (although with modern GPU acceleration, we rarely perceive this overhead). Recently, a paper titled "Squareplus: A Softplus-Like Algebraic Rectifier" proposed a simpler approximation called SquarePlus. We will discuss it here.

It should be pointed out beforehand that I do not recommend spending too much time on the selection and design of activation functions. Therefore, while I am sharing this paper, it is mainly to provide a reference result and serve as an exercise for everyone to "practice."

Definition

The form of SquarePlus is very simple, using only addition, multiplication, division, and square root:

\begin{equation}\text{SquarePlus}(x)=\frac{x+\sqrt{x^2+b}}{2}\end{equation}

where $b > 0$. When $b=0$, it exactly reduces to $\text{ReLU}(x)=\max(x,0)$. The inspiration for SquarePlus comes roughly from:

\begin{equation}\max(x,0)=\frac{x+|x|}{2}=\frac{x+\sqrt{x^2}}{2}\end{equation}

Therefore, to fix the differentiability at $x=0$, a constant $b > 0$ is added inside the square root (to prevent division-by-zero issues in the derivative).

The original paper points out that because it only uses addition, multiplication, division, and square root, the speed of SquarePlus (primarily on the CPU) is faster than functions like SoftPlus:

Speed comparison of SquarePlus with other similar functions
Speed comparison of SquarePlus with other similar functions

Of course, if you don't care about this slight speedup, then as mentioned at the beginning of this article, just treat it as a mathematical exercise.

Properties

Like the SoftPlus function ($\log(e^x+1)$), SquarePlus is globally monotonically increasing and strictly greater than ReLU, as shown in the figure below (using $b=1$ for SquarePlus):

ReLU, SoftPlus, SquarePlus function images (1)
ReLU, SoftPlus, SquarePlus function curves (I)

Direct differentiation also reveals its monotonicity:

\begin{equation}\frac{d}{dx}\text{SquarePlus}(x)=\frac{1}{2}\left(1+\frac{x}{\sqrt{x^2+b}}\right) > 0\end{equation}

As for the second derivative:

\begin{equation}\frac{d^2}{dx^2}\text{SquarePlus}(x)=\frac{b}{2(x^2+b)^{3/2}}\end{equation}

which is also strictly greater than 0, so SquarePlus is a convex function.

Approximation

Now, there are two exercises we can perform:

1. For what value of $b$ is SquarePlus always greater than or equal to SoftPlus?

2. For what value of $b$ is the error between SquarePlus and SoftPlus minimized?

For the first question, solving $\text{SquarePlus}(x) \geq \text{SoftPlus}(x)$ directly yields:

\begin{equation}b \geq 4\log(e^x+1)\left[\log(e^x+1) - x\right]=4\log(e^x+1)\log(e^{-x}+1)\end{equation}

For this to hold universally, $b$ must be greater than or equal to the maximum value of the right-hand side. We can prove that the maximum value of the right-hand side occurs at $x=0$, so $b \geq 4\log^2 2 = 1.921812\cdots$. Thus, the first question is solved.

Proof: Note that

\begin{equation}\frac{d^2}{dx^2}\log\log(e^x+1)=\frac{e^x(\log(e^x+1)-e^x)}{(e^x+1)^2\log^2(e^x+1)} < 0\end{equation}

Therefore, $\log\log(e^x+1)$ is a concave function. By Jensen's Inequality:

\begin{equation}\frac{1}{2}\left(\log\log(e^x+1) + \log\log(e^{-x}+1)\right) \leq \log\log(e^{(x+(-x))/2}+1)=\log\log 2\end{equation}

This implies $\log\left(\log(e^x+1)\log(e^{-x}+1)\right) \leq 2\log\log 2$, or $\log(e^x+1)\log(e^{-x}+1) \leq \log^2 2$. Multiplying both sides by 4 gives the required conclusion. The equality holds when $x=-x$, i.e., $x=0$.

As for the second question, we need a standard for "error." As in the previous article "How the Two Elementary Function Approximations of GELU Were Derived", we convert this into a $\min\text{-}\max$ problem with no extra parameters:

\begin{equation}\min_{b} \max_x \left\|\frac{x+\sqrt{x^2+b}}{2} - \log(e^x+1)\right\|\end{equation}

I cannot find an analytical solution for this problem, so I currently solve it numerically:

import numpy as np
from scipy.special import erf
from scipy.optimize import minimize

def f(x, a):
    return np.abs((x + np.sqrt(x**2 + a**2)) / 2 - np.log(np.exp(x) + 1))

def g(a):
    return np.max([f(x, a) for x in np.arange(-2, 4, 0.0001)])

options = {'xtol': 1e-10, 'ftol': 1e-10, 'maxiter': 100000}
result = minimize(g, 0, method='Powell', options=options)
b = result.x**2
print(b)

The final result is $b=1.52382103\cdots$, with a maximum error of $0.075931\cdots$. The comparison is as follows:

ReLU, SoftPlus, SquarePlus function images (2)
ReLU, SoftPlus, SquarePlus function curves (II)

Summary

There isn't much to summarize; I've simply introduced a smooth approximation of ReLU and accompanied it with two simple function exercises.