On the Design of Activation Functions in Neural Networks

By 苏剑林 | October 26, 2017

Activation functions are the source of non-linearity in neural networks. Without these functions, the entire network would consist only of linear operations. Since the composition of linear operations is still linear, the final effect would be equivalent to a single-layer linear model.

So, what are the common activation functions? Furthermore, what principles should guide the choice of an activation function? Can any non-linear function serve as an activation function?

The activation functions explored here are those for hidden layers, not for the output layer. The final output generally uses specific activation functions that cannot be changed arbitrarily; for example, binary classification typically uses the sigmoid function, multi-class classification typically uses softmax, and so on. In contrast, there is more room for choice regarding activation functions in the hidden layers.

Even Floating Point Errors Work!

Theoretically, any non-linear function has the potential to be an activation function. A very convincing example is a recent successful attempt by OpenAI to use floating-point errors as an activation function. For details, please read OpenAI's blog: https://blog.openai.com/nonlinear-computation-in-linear-networks/

Alternatively, read the introduction by Synced (Jiqi Zhixin): https://mp.weixin.qq.com/s/PBRzS4Ol_Zst35XKrEpxdw

Nonetheless, the training costs of different activation functions vary. Although OpenAI's exploration showed that even floating-point errors can serve as activation functions, the non-differentiability of this operation led them to use "Evolutionary Strategies" to train the model. "Evolutionary Strategies" refers to algorithms such as genetic algorithms, which are time-consuming and labor-intensive.

ReLU Paved the Way

Does adding differentiability to allow training via gradient descent solve all problems? Not necessarily. At the dawn of neural networks, the Sigmoid function was generally used as the activation function:

\begin{equation}\text{sigmoid}(x)=\sigma(x)=\frac{1}{1+e^{-x}}\end{equation}

The characteristic of this function is that the left end approaches 0 and the right end approaches 1. Both ends are saturated, as shown below:

sigmoid

Because of this, its derivatives approach 0 at both ends. Since we optimize using gradient descent, a derivative approaching zero means that the update amount in each iteration is very small (proportional to the gradient). Consequently, updates become difficult. This is especially true as the number of layers increases; due to the chain rule of differentiation, the update amount becomes proportional to the $n$-th power of the gradient, making optimization even harder. This is why early neural networks could not be made very deep.

A landmark activation function is the ReLU function, whose definition is very simple:

\begin{equation}\text{relu}(x)=\max(x,0)\end{equation}

Its graph is as follows:

relu

This is a piecewise linear function. Clearly, its derivative is 1 on the positive axis and 0 on the negative axis. This ensures that half of the space across the entire real number domain is unsaturated. In contrast, the Sigmoid function is saturated in almost all regions (the proportion of saturated regions tends toward 1; saturation is defined as the derivative being very close to 0).

ReLU is a piecewise linear function, and its non-linearity is weak, so networks generally need to be very deep. However, this satisfies our needs perfectly, because given the same performance, depth is often more important than width—deeper models tend to have better generalization capabilities. Thus, since the advent of the ReLU activation function, various deep models have been proposed. A landmark event was likely the success of the VGG model on ImageNet; subsequent developments need not be detailed here.

Better Swish

Despite ReLU's brilliant record, some felt that the fact that half of ReLU's domain is saturated remains a significant shortcoming. Consequently, variants such as LeakyReLU and PReLU were proposed, though these modifications are largely similar.

A few days ago, the Google Brain team proposed a new activation function called Swish. Information can be found here: http://mp.weixin.qq.com/s/JticD0itOWH7Aq7ye1yzvg

Its definition is:

\begin{equation}\text{swish}(x)=x\cdot\sigma(x)=\frac{x}{1+e^{-x}}\end{equation}

Its graph is as follows:

Swish

The team's test results show that this function outperforms ReLU in many models.

From the graph, Swish looks similar to ReLU, with the only major difference being the negative region near zero. To speak with the benefit of hindsight, even I had considered this activation function, as it is similar to the GLU activation function proposed by Facebook. The GLU activation function is:

\begin{equation}(\boldsymbol{W}_1\boldsymbol{x}+\boldsymbol{b}_1)\otimes \sigma(\boldsymbol{W}_2\boldsymbol{x}+\boldsymbol{b}_2)\end{equation}

In other words, two sets of parameters are trained separately; one set is activated by sigmoid and then multiplied by the other set. Here, $\sigma(\boldsymbol{W}_2\boldsymbol{x}+\boldsymbol{b}_2)$ is called the "gate," which is the "G" in GLU. Swish essentially takes both set of parameters to be the same, training only one set of parameters.

Improvement Ideas

Swish has stirred up some controversy. Some believe Google Brain has made a mountain out of a molehill—that improving an activation function is something small teams can do, and major teams like Google Brain should pursue more "high-end" directions. Regardless, Google Brain conducted many experiments, and the results all indicate that Swish is superior to ReLU. Therefore, we need to consider: what is the reason behind this?

The following analysis is purely my own subjective conjecture and currently lacks theoretical or experimental proof; please read it with discretion. I believe a very important reason why Swish outperforms ReLU is related to initialization.

Swish is unsaturated near the origin; it is only saturated in regions far from the origin on the negative axis. In contrast, ReLU is saturated in half the space even near the origin. When training models, we typically use uniform or normal distribution initialization. Regardless of the type, the mean of the initialization is generally 0. This means that half of the initialized parameters fall into ReLU's saturated region, rendering half of the parameters unused at the very beginning. Especially due to strategies like BN (Batch Normalization), where outputs automatically approximate a normal distribution with a mean of 0, half of the parameters in these cases fall into ReLU's saturated zone. By comparison, Swish is slightly better because it maintains a certain unsaturated region on the negative axis, leading to higher parameter utilization.

As mentioned earlier, I had considered the Swish activation function but did not research it deeply. One reason was that it didn't seem concise or elegant enough—I even found it a bit ugly. Seeing that Swish's experimental results are so good, I wondered if there are similar, more aesthetically pleasing activation functions. I thought of one:

\begin{equation}x\cdot\min(1,e^x)\end{equation}

Its graph is:

An activation function of my own design

Actually, it looks quite similar to Swish. The logic was to maintain $x$ on the positive axis, and for the negative axis, to think of a function that first decreases, then increases, and approaches 0. I thought of $xe^{-x}$, and with a slight adjustment, I arrived at this function. In some of my models, its performance was even slightly better than Swish (specifically in my Q&A models). Of course, I only performed a few experiments and don't have the energy or computing power for extensive comparative testing.

Comparison with Swish; orange is Swish.

Comparison with Swish (orange)

It should be noted that if you want to use this function, you cannot write it directly in this form because the calculation of $e^x$ might overflow. A form that won't overflow is:

\begin{equation}\max(x, x\cdot e^{-\|x\|})\end{equation}

Or using the ReLU function, it can be written as:

\begin{equation}x + \text{relu}(x\cdot e^{-\|x\|}-x)\end{equation}

Could it be that all effective activation functions resemble the checkmark (√) on our homework?