Using Dirac Functions to Construct Smooth Approximations of Non-smooth Functions

By 苏剑林 | October 10, 2021

In machine learning, we often encounter non-smooth functions. However, our optimization methods are usually gradient-based, which means that smooth models are generally more conducive to optimization (as the gradients are continuous). Consequently, there is a demand for finding smooth approximations of non-smooth functions. In fact, this blog has discussed related topics multiple times, such as "Seeking a Smooth Maximum Function" and "On Function Smoothing: Differentiable Approximations of Non-differentiable Functions". However, the previous discussions lacked generality in their methodologies.

Recently, I learned a relatively general approach from the paper "SAU: Smooth activation function using convolution with approximate identities": using Dirac functions to construct smooth approximations. How general is it? In theory, any function with a countable number of discontinuities can be smoothly approximated using this method! I find this quite interesting.

The Dirac Function

In a very early article, "The Eerie Dirac Function", we introduced the Dirac function. In modern mathematics, the Dirac function is defined as a "functional" rather than a "function," but for most readers, understanding it as a function is easier to grasp.

Simply put, the Dirac function $\delta(x)$ satisfies:

1. $\forall x \neq 0, \delta(x) = 0$;
2. $\delta(0) = \infty$;
3. $\int_{-\infty}^{\infty} \delta(x) dx = 1$.

Intuitively, $\delta(x)$ can be viewed as a continuous probability density function where the sampling space is all real numbers $\mathbb{R}$, but the probability is non-zero only at $x=0$. That is, its mean is 0 and its variance is also 0; thus, any sample taken from it must be 0. Therefore, the following identity holds:

\begin{equation}\int_{-\infty}^{\infty} f(x)\delta(x) dx = f(0)\end{equation}

\begin{equation}\int_{-\infty}^{\infty} f(y)\delta(x-y) dy = f(x)\label{eq:base}\end{equation}

This is arguably the most important property of the Dirac function and is the primary identity we will use moving forward.

Smooth Approximation

If we can find a smooth approximation of $\delta(x)$, denoted as $\varphi(x) \approx \delta(x)$, then according to $\eqref{eq:base}$, we have:

\begin{equation}g(x) = \int_{-\infty}^{\infty} f(y)\varphi(x-y) dy \approx f(x)\end{equation}

Since $\varphi(x)$ is smooth, $g(x)$ is also smooth. This means that $g(x)$ serves as a smooth approximation of $f(x)$! This is the core idea of constructing a smooth approximation of $f(x)$ by leveraging smooth approximations of the Dirac function. In this process, there are very few restrictions on the form or continuity of $f(x)$; for instance, $f(x)$ is allowed to have a countable number of jump discontinuities (such as the floor function $[x]$).

So, what are some smooth approximations of the Dirac function? There are several readily available ones, such as:

\begin{equation}\delta(x) = \lim_{\sigma\to 0} \frac{e^{-x^2/2\sigma^2}}{\sqrt{2\pi}\sigma}\label{eq:g}\end{equation}

\begin{equation}\delta(x)=\frac{1}{\pi} \lim_{a \to 0}\frac{a}{x^2+a^2}\end{equation}

Simply put, it involves finding a non-negative function with a bell-shaped curve like the normal distribution, and making the width of the bell approach zero while keeping the integral equal to 1. Another approach is to note that:

\begin{equation}\int_{-\infty}^x \delta(t)dt = \theta(x) = \left\{\begin{aligned}1,\,\, (x > 0) \\ 0,\,\, (x < 0)\end{aligned}\right.\end{equation}

That is, the integral of the Dirac function is the "unit step function" $\theta(x)$. If we can find a smooth approximation of $\theta(x)$, then its derivative will yield a smooth approximation of the Dirac function. Smooth approximations of $\theta(x)$ are the so-called "S-shaped" curves, such as the sigmoid function $\sigma(x)=1/(1+e^{-x})$. Thus, we have:

\begin{equation}\delta(x) = \lim_{t\to \infty} \frac{d}{dx}\sigma(tx) = \lim_{t\to \infty} \frac{e^{tx}t}{(1+e^{tx})^2}\label{eq:s}\end{equation}

Equations $\eqref{eq:g}$ and $\eqref{eq:s}$ are the two most commonly used approximations.

ReLU Activation

Now, let's use the aforementioned logic as a tool to derive various smooth approximations for the ReLU activation function $\max(x,0)$.

By using Equation $\eqref{eq:s}$, we get:

\begin{equation}\begin{aligned} \max(x,0)\approx&\, \int_{-\infty}^{\infty} \frac{e^{t(x-y)}t}{(1+e^{t(x-y)})^2} \max(y,0) dy\\ =&\,\int_0^{\infty} \frac{e^{t(x-y)}ty}{(1+e^{t(x-y)})^2}dy=\frac{\log(1+e^{tx})}{t} \end{aligned}\end{equation}

When $t=1$, this is the SoftPlus activation function.

If we instead use Equation $\eqref{eq:g}$, the result is:

\begin{equation}\begin{aligned} \max(x,0)\approx&\, \int_{-\infty}^{\infty} \frac{e^{-(x-y)^2/2\sigma^2}}{\sqrt{2\pi}\sigma} \max(y,0) dy\\ =&\,\int_0^{\infty} \frac{e^{-(x-y)^2/2\sigma^2} y}{\sqrt{2\pi}\sigma}dy\\ =&\,\frac{1}{2} \left[x \,\text{erf}\left(\frac{x}{\sqrt{2} \sigma}\right)+x+\sqrt{\frac{2}{\pi }} \sigma e^{-\frac{x^2}{2 \sigma^2}}\right] \end{aligned}\end{equation}

This smooth approximation of ReLU seems not to have been studied much before.

Of course, for a function as simple as ReLU, there are even simpler approaches. For example, noting that $\max(x,0) = x\theta(x)$, where $\theta(x)$ is the unit step function mentioned earlier. The problem then shifts to finding a smooth approximation for $\theta(x)$. We already know the sigmoid is one such approximation, so we quickly obtain:

\begin{equation}\max(x,0)\approx x\sigma(tx)\end{equation}

When $t=1$, this is the Swish activation function. If we perform the calculation using $\eqref{eq:g}$, we obtain:

\begin{equation}\begin{aligned} \max(x,0)\approx&\, x\int_{-\infty}^{\infty} \frac{e^{-(x-y)^2/2\sigma^2}}{\sqrt{2\pi}\sigma} \theta(y) dy\\ =&\,x\int_0^{\infty} \frac{e^{-(x-y)^2/2\sigma^2}}{\sqrt{2\pi}\sigma}dy =\frac{1}{2}\left[x + x\,\text{erf}\left(\frac{x}{\sqrt{2}\sigma}\right)\right] \end{aligned}\end{equation}

When $\sigma=1$, this is the GeLU activation function.

(Image of ReLU function and its several smooth approximations)

The Integer Function

Readers might find the previous examples trivial, as those approximations are already well-known and can be derived without the Dirac function. Now, let's provide a non-trivial example: a smooth approximation of the integer function.

The integer function comes in two forms: ceiling and floor. They differ in definition but are not essentially different. Here we use the floor function as an example, denoted as:

\begin{equation}[x] = n, \,\, \text{if and only if there exists } n \in \mathbb{Z} \text{ such that } x \in [n, n + 1)\end{equation}

Assuming $\varphi(x)$ is some smooth approximation of the Dirac function, then:

\begin{equation} [x] \approx \int_{-\infty}^{\infty} \varphi(x-y)[y]dy = \sum_{n=-\infty}^{\infty}n\int_n^{n+1} \varphi(x-y)dy\end{equation}

Let $\Phi(x)$ be the antiderivative of $\varphi(x)$. Then the antiderivative of $\varphi(x-y)$ with respect to $y$ is $-\Phi(x-y)$. Thus, we have:

\begin{equation}\begin{aligned}[] [x]\approx&\,\sum_{n=-\infty}^{\infty}n\big[\Phi(x-n) - \Phi(x-n-1)\big]\\ =&\,\lim_{M,N\to\infty}\sum_{n=-M}^{N}(n-1)\Phi(x-n) - n\Phi(x-n-1) + \Phi(x-n)\\ =&\,\lim_{M,N\to\infty} -N\Phi(x-N-1) - (M+1)\Phi(x+M) + \sum_{n=-M}^{N} \Phi(x-n) \end{aligned}\end{equation}

For $\Phi(x)$, we have $\Phi(-\infty)=0$ and $\Phi(\infty)=1$. Assuming the range we care about satisfies $-M \ll x \ll N$, then $\Phi(x-N-1)\approx 0$ and $\Phi(x+M)\approx 1$. Thus, at this point:

\begin{equation}\begin{aligned}[] [x]\approx&\, -M-1 + \sum_{n=-M}^{N} \Phi(x-n)\\ =&\,\sum_{n=-M}^0 \big[\Phi(x-n)-1\big] + \sum_{n=1}^N \Phi(x-n) \end{aligned}\end{equation}

Using $\Phi(x)=\sigma(tx)$ as an example, taking $t=10, M=5, N=10$, the result is as follows:

(Visualization of the smooth approximation of the floor function)

As we can see, it is indeed quite close to $[x]$, and increasing $t$ can further improve the degree of approximation.

Conclusion

This article introduced a method for constructing smooth approximations using the Dirac function. Its characteristic feature is its generality, placing no strict requirements on the original function. As examples, we used it to derive various common approximations for the ReLU function and a smooth approximation for the floor function.